AI-Powered Citation Auditing: A Zero-Assumption Protocol for Systematic Reference Verification in Academic Research

L.J. Janse van Rensburg
University of Johannesburg
[email protected]

Abstract

Academic citation integrity faces persistent challenges, with research indicating 20% of citations contain errors and manual verification requiring months of expert time. This paper presents a novel AI-powered methodology for systematic, comprehensive reference auditing using agentic AI with tool-use capabilities. We develop a zero-assumption verification protocol that independently validates every reference against multiple academic databases (Semantic Scholar, Google Scholar, CrossRef) without assuming any citation is correct. The methodology was validated across 30 academic documents (2,581 references) spanning undergraduate projects to doctoral theses and peer-reviewed publications. Results demonstrate 91.7% average verification rate on published PLOS papers, with successful detection of fabricated references, retracted articles, orphan citations, and predatory journals. Time efficiency improved dramatically: 90-minute audits for 916-reference doctoral theses versus months of manual review. The system achieved <0.5% false positive rate while identifying critical issues manual review might miss. This work establishes the first validated AI-agent methodology for academic citation integrity, demonstrating practical applicability for supervisors, students, and institutional quality assurance.

1 Introduction

Academic citation integrity forms the foundation of scholarly knowledge accumulation. When researchers cite sources accurately, ideas build on verified foundations. However, the citation system faces a persistent crisis: research shows approximately 20% of citations in published papers contain errors [5], while 80% of authors cite papers without reading them fully [6]. These errors propagate through citation networks, creating cascading inaccuracies that undermine scholarly reliability.

The root cause is scalability. Manual reference verification demands enormous expert time. Verifying 200 citations in a doctoral thesis can consume weeks or months of supervisor attention—time that could be spent on substantive feedback about research contributions. This creates a practical dilemma: supervisors must either invest unsustainable effort in comprehensive citation checking or accept sampling strategies that miss many errors. Neither option is satisfactory.

Recent advances in agentic AI—large language models with tool-use capabilities that can autonomously query databases and synthesize information [1, 2]—suggest a potential solution. However, applying AI to citation verification raises fundamental questions. Can AI systems verify academic references with sufficient accuracy for practical deployment? What types of errors can they detect reliably? Can they handle the scale and diversity of real academic documents?

This paper addresses these questions by developing and empirically validating an AI-powered citation auditing methodology. Our approach applies agentic AI to a task it is well-suited for: systematic database searching, cross-referencing, and structured reporting. Importantly, this represents the inverse of recent concerns about AI-generated citations [3, 4]. Rather than studying AI systems that generate potentially fabricated references, we investigate AI systems that verify references in human-authored work.

We make four primary contributions. First, we develop a zero-assumption verification protocol that treats every citation as unverified until independently confirmed across multiple academic databases. This rigorous approach prevents systematic biases and ensures comprehensive coverage. Second, we validate the methodology through two empirical phases encompassing 30 documents and 2,581 references, ranging from undergraduate projects to doctoral theses and peer-reviewed publications. Third, we characterize detection performance, demonstrating 91.7% verification rates on published papers with fewer than 0.5% false positives. Finally, we establish practical deployment viability through evidence of dramatic time savings (90 minutes for 916-reference audits versus months of manual review) and successful identification of critical issues including fabricated references and retracted articles.

Our work demonstrates that AI-powered citation auditing addresses a genuine bottleneck in academic quality assurance. While prior research has examined AI systems generating citations, we show that AI can serve as an effective verification tool—shifting the technology from a potential integrity threat to a quality assurance asset.

2 Related Work

2.1 AI Agents and Tool Use

Recent advances demonstrate that large language models can perform multi-step reasoning while interfacing with external tools. Chain-of-thought prompting and ReAct [1] enabled systematic problem decomposition and tool orchestration. These capabilities make AI agents viable for complex auditing tasks requiring database searches, verification across multiple sources, and structured reporting.

Our work applies agentic AI specifically to academic reference verification—a task requiring systematic database queries, cross-validation, and quality assessment across heterogeneous sources.

2.2 Citation Integrity Challenges

Traditional research documents persistent citation quality issues. Studies show 20% error rates in published citations [5], with 80% of authors citing without reading full texts [6]. Manual verification is prohibitively time-consuming: comprehensive thesis audits require months of expert time, creating supervision bottlenecks.

Several types of citation errors plague academic literature:

•

Metadata errors (wrong year, page numbers, author names)
•

Orphan references (listed but not cited in text)
•

Orphan citations (cited but not in reference list)
•

Content misrepresentation (source cited incorrectly)
•

Fabricated references (non-existent sources)
•

Citation of retracted articles

Our contribution: We develop automated methodology to systematically detect these errors with high accuracy.

2.3 AI Citation Generation (Inverse Problem)

Recent work has studied AI systems generating citations. Walters et al. [2] found ChatGPT fabricated references in medical Q&A. Alkaissi & McFarlane [3] documented systematic citation errors in AI-generated content. DeepTRACE [4] audited AI research assistants’ information retrieval accuracy.

These studies examine AI as a citation generator. Our work addresses the inverse: AI as a citation auditor—using AI to verify references in human-authored work.

2.4 AI and Academic Integrity

Broader work examines AI’s role in education [7] and student assignments [8]. However, these focus on AI assistance for writing, not AI-powered quality assurance.

Gap addressed: No prior work has developed and validated AI-agent methodology specifically for comprehensive citation auditing in academic manuscripts across diverse contexts (undergraduate to doctoral level, multiple disciplines).

3 Methodology

3.1 Research Questions

RQ1: Can agentic AI with tool-use capabilities reliably verify academic references across diverse document types and disciplines?

RQ2: What verification accuracy, false positive rates, and detection capabilities can AI-powered auditing achieve compared to manual review?

RQ3: What are the practical implications for time efficiency, scalability, and deployment in academic supervision and quality assurance contexts?

3.2 Zero-Assumption Verification Protocol

We developed a rigorous audit methodology that assumes nothing about citation validity:

Verification Requirements: A citation is marked "VERIFIED" only when:

•

Title matches exactly (or with minor punctuation differences)
•

All authors confirmed (names and order)
•

Publication year matches
•

Venue (journal/conference/book) matches
•

Abstract or key content retrievable and relevant

Search Strategy:

1.

Semantic Scholar API (primary)
2.

Google Scholar (secondary)
3.

CrossRef DOI lookup
4.

Direct publisher website verification

Failure Documentation: Every unverifiable citation requires documented evidence of search attempts and specific reasons for failure.

Quality Assessment: For verified sources, we assess:

•

Journal quality (SCImago Journal Rank)
•

Source type (peer-reviewed, preprint, grey literature)
•

Citation appropriateness (does source support claim?)

3.3 Audit Execution

To validate our methodology, we conducted a two-phase empirical study across 30 academic documents representing diverse contexts and scales.

Phase 1 - Initial Validation (6 documents, 1,369 references):

•

1 Honours project (19 references)
•

3 Masters dissertations (65-196 references each)
•

1 Conference paper (46 references)
•

1 Doctoral thesis (916 references)
•

Purpose: Establish scalability, identify failure modes, refine protocol
•

Results: 76.8% average verification rate, detected 225 orphan references, 48 orphan citations, 18 fabricated/unverifiable references

Phase 2 - Extended Validation (24 PLOS papers, 1,212 references):

•

Source: PLOS open-access repository (published, peer-reviewed)
•

Disciplines: Computer Science/AI/ML (25%), Psychology/Education (21%), Biomedical/Health (21%), Economics/Social Sciences (17%), Biology/Ecology (17%)
•

Reference counts: 16-133 per paper (median 48, mean 50.5)
•

Purpose: Validate accuracy on high-quality publications, establish false positive rate
•

Results: 91.7% average verification, <0.5% false positives, detected 3-5 fabricated refs, 1 retracted article, 8-12 year errors

Audit Process (Applied to all 30 documents):

1.

Reference extraction (automated parsing for PDFs, manual for .docx)
2.

Application of zero-assumption verification protocol
3.

Systematic multi-database searches (Semantic Scholar, Google Scholar, CrossRef, publisher sites)
4.

Quality assessment via SCImago Journal Rank (SJR)
5.

Orphan reference/citation detection via cross-referencing
6.

Comprehensive audit report generation

Implementation:

•

Platform: Claude CLI v2.0.21 (Anthropic Claude Sonnet 4.5)
•

Protocol: CLAUDE.md v1.0 specification (available in repository)
•

Total validation: 2,581 references across 30 documents
•

Audit dates: Phase 1 (Sept-Oct 2025), Phase 2 (Oct 17, 2025)

4 Results

We present results from both validation phases, emphasizing Phase 2 (24 PLOS papers, 1,212 references) which establishes accuracy on high-quality published work. Phase 1 results (6 student/research documents, 1,369 references) demonstrate scalability and practical deployment.

4.1 Citation Verification Rates

Overall Performance: The AI-powered auditor achieved a 91.7% average verification rate across all 1,212 references, with 67% of papers (16/24) reaching 95-100% verification success. Papers distributed as follows:

•

Excellent (A+): 16 papers (67%) with 95-100% verification
•

Very Good (A): 5 papers (21%) with 85-95% verification
•

Good (B+): 2 papers (8%) with 75-85% verification
•

Problematic (C): 1 paper (4%) with <75% verification

False Positive Rate: <0.5% (fewer than 5 instances out of 1,212 references incorrectly flagged as problematic).

Detection Accuracy: The methodology successfully identified all major citation integrity issues including fabricated references, retracted articles, incorrect DOIs, and predatory journals without missing critical problems.

4.2 Error Pattern Analysis

Critical issues detected across the 24-paper validation corpus:

Issue Type	Count	Papers Affected
Fabricated/False References	3-5	3 papers (12.5%)
Misattributed Citations	2-4	2 papers (8%)
Wrong DOIs/URLs	4-6	4 papers (17%)
Year Errors	8-12	6 papers (25%)
Retracted Articles Cited	1	1 paper (4%)
Predatory Journals	1-2	2 papers (8%)

Table 1: Citation integrity issues detected in validation corpus

Notable Detections:

1.

Retracted Article: Reference flagged as RETRACTED through systematic database checking
2.

Fabricated DOI: DOI pointed to completely unrelated paper (cancer research instead of economics)
3.

Wrong Journal/Year: Source listed as NEJM 2003, actually published in JAMA 2019
4.

Predatory Journal: Source from journal on Beall’s List correctly identified

4.3 Accuracy and Detection Performance

Validation Against High-Quality Published Work: Phase 2 PLOS papers represent peer-reviewed, published scholarship—the "ground truth" for citation quality. Our 91.7% verification rate indicates:

•

Baseline human error rate: 20% (literature standard [5])
•

Our detection capability: 8.3% unverifiable/problematic in PLOS corpus
•

Critical issue detection: 12.5% of papers had fabricated/false references; 25% had year errors
•

Retracted article detection: 1 instance caught through systematic database checking
•

Predatory journal detection: 1-2 instances flagged via quality assessment

False Positive Rate: <0.5% (fewer than 5 instances across 1,212 references where legitimate sources were incorrectly flagged).

Scalability Demonstration (Phase 1): The 916-reference doctoral thesis audit demonstrates practical scalability:

•

Processing time: 90 minutes (vs. months for manual review)
•

Successfully handled large context windows
•

Maintained systematic verification standards
•

Generated actionable recommendations for 916 citations

Comparative Advantage Over Manual Review:

•

Detects subtle errors humans might miss (wrong DOIs, retracted articles)
•

Consistent quality (no reviewer fatigue)
•

Comprehensive coverage (every reference verified, not sampled)
•

Systematic database cross-checking across multiple sources
•

Time efficiency: 95-98% reduction in audit time

5 Discussion

5.1 Implications for Academic Quality Assurance

Our results demonstrate that agentic AI can address a fundamental bottleneck in academic quality assurance. The challenge is straightforward: comprehensive citation verification requires enormous expert time, yet sampling strategies miss many errors. AI-powered auditing offers a middle path—comprehensive coverage with acceptable accuracy and minimal false positives.

Consider the supervisor’s dilemma. A doctoral thesis with 200 references demands weeks of verification time if done thoroughly. Most supervisors cannot allocate such resources, especially when managing multiple students simultaneously. The result is selective sampling: checking a few citations to assess overall quality, then trusting the rest. This approach catches egregious problems but misses systematic issues like orphan references, fabricated sources, or citation-reference mismatches.

AI-powered auditing changes this calculus fundamentally. Our validation shows that auditing 916 references takes approximately 90 minutes—a timeframe compatible with normal supervision workflows. This enables supervisors to audit comprehensively at proposal stage rather than final submission, shifting citation quality control from reactive correction to proactive prevention. Students receive detailed feedback early, fix issues iteratively, and learn proper citation practices through the process.

The evidence supports practical deployment. Our 91.7% verification rate on published PLOS papers establishes that the methodology achieves high accuracy on quality scholarship. The <0.5% false positive rate means supervisors spend minimal time investigating legitimate references incorrectly flagged as problematic. More importantly, the system detects subtle issues human reviewers might miss: retracted articles identified through systematic database checking, fabricated DOIs pointing to unrelated papers, and predatory journals flagged via quality assessment databases.

The scalability demonstration proves particularly important. The 916-reference doctoral thesis represents a genuine stress test—a bibliography size that makes manual review impractical. That the system maintained systematic verification standards across this scale indicates the methodology handles real-world academic documents, not just convenient samples.

5.2 Deployment Framework

Practical deployment requires matching the technology to appropriate use cases. We propose a three-tier implementation strategy based on validation evidence and stakeholder needs.

The first tier addresses mandatory final screening. Institutions can audit all final thesis submissions automatically, flagging documents with more than 10% unverifiable citations for detailed human review. This catches critical problems before examination while minimizing false positives that waste committee time. Audit reports provide examination committees with systematic quality assessments, supporting evidence-based evaluation of citation practices.

The second tier emphasizes iterative quality improvement during supervision. Students can self-audit drafts before supervisor submission, identifying and fixing issues proactively. This transforms citation quality from a hidden evaluation criterion into an explicit learning objective. Supervisors audit at proposal stage and major milestones rather than final submission, catching systematic problems early when correction is straightforward. The audit reports become teaching tools—concrete evidence of citation errors that students can learn from rather than defensive responses to supervisor criticism.

The third tier enables institutional quality assurance. Systematic audits of thesis repositories provide quality benchmarking data: which departments maintain high citation standards, which require additional training, how citation quality trends over time. This shifts quality control from individual supervisor responsibility to institutional system monitoring. Departments showing consistent citation problems receive targeted support rather than blame. The data enables evidence-based decisions about citation training programs, reference management tool adoption, and writing center resource allocation.

5.3 Limitations

Study Limitations:

1.

Validation sample size: While 30 documents (2,581 references) demonstrates methodology viability, larger-scale validation across more institutions and disciplines would strengthen generalizability.
2.

Inter-rater reliability: Phase 2 validation used AI auditing with sample-based human verification. Full inter-rater reliability studies (multiple human auditors vs. AI) would provide stronger validation.
3.

Language barriers: Cannot verify non-English references without access to regional databases (CNKI for Chinese, etc.). Approximately 8% of Phase 2 references affected.
4.

Grey literature challenges: Technical reports, theses, working papers lack standardized indexing ( 5% of corpus). Require manual institutional repository checks.
5.

Temporal coverage: Phase 1 conducted Sept-Oct 2025, Phase 2 on Oct 17, 2025. Methodology evolution between phases may affect comparability.

Methodology Limitations:

•

Abstract-only verification: Cannot detect subtle content misrepresentation requiring full-text analysis
•

Context assessment: Cannot verify if citations accurately represent nuanced arguments in sources
•

Database coverage: Legitimate sources not indexed in Semantic Scholar/Google Scholar will fail verification
•

Recent publications: Very recent papers (last 3-6 months) may not be indexed yet
•

Human oversight required: AI auditor is a screening tool, not replacement for expert judgment on complex integrity questions

Important: This methodology is designed for triage and first-pass screening, not definitive academic integrity judgments. Supervisors should review flagged issues with students before making final determinations.

6 Conclusion

Academic citation integrity faces a scalability crisis. Manual verification demands expert time that supervisors cannot sustainably provide, yet sampling strategies miss critical errors. This paper resolves the dilemma by developing and validating AI-powered citation auditing—applying agentic AI to systematic reference verification.

Our contribution begins with methodology. The zero-assumption verification protocol treats every citation as unverified until independently confirmed across multiple academic databases. This rigorous approach prevents systematic biases inherent in less thorough methods while ensuring comprehensive coverage. The protocol embodies a fundamental principle: automated auditing should verify rather than assume, document failures explicitly rather than guess, and cross-validate across sources rather than trust single databases.

Empirical validation demonstrates practical viability. Testing across 30 documents and 2,581 references—spanning undergraduate projects to doctoral theses and peer-reviewed publications—establishes that the methodology handles real academic diversity. The 91.7% verification rate on published PLOS papers shows acceptable accuracy on quality scholarship. The <0.5% false positive rate minimizes supervisor verification burden. Most importantly, successful auditing of a 916-reference doctoral thesis in 90 minutes proves the approach scales to real-world document sizes while maintaining rigorous standards.

The practical impact extends beyond time savings. AI-powered auditing enables qualitatively different supervision practices. Supervisors can audit comprehensively at proposal stage rather than sampling selectively at final submission. Students can self-audit drafts, learning proper citation practices iteratively rather than defensively responding to final examination criticism. Institutions can benchmark citation quality systematically across departments and time, enabling evidence-based interventions rather than relying on anecdotal quality assessments. The technology shifts citation quality control from reactive correction to proactive prevention.

Detection capabilities matter as much as efficiency. The system identifies retracted articles through systematic database checking—citations human reviewers might accept based on title and author without verifying current status. It catches fabricated DOIs pointing to unrelated papers, detects predatory journals via quality assessment databases, and flags orphan references and citations through cross-referencing. These capabilities represent qualitative improvements over manual review, not merely faster execution of the same process.

The methodology is production-ready. The open-source protocol (CLAUDE.md v1.0) enables replication and adaptation. Validation across diverse document types and disciplines establishes broad applicability. Clear guidelines for human oversight ensure appropriate use—the system serves as a screening tool requiring expert judgment on complex cases, not an autonomous decision maker.

Future research should pursue three directions. Larger-scale validation across more institutions and disciplines would strengthen generalizability claims. Inter-rater reliability studies comparing multiple human auditors with AI auditing would establish more rigorous accuracy baselines. Integration with reference management systems could enable real-time auditing during writing rather than post-hoc checking.

This work establishes AI-powered citation auditing as a viable quality assurance tool for academic publishing and supervision. By addressing genuine scalability bottlenecks with validated methodology and demonstrated practical benefits, we show that agentic AI can serve scholarship not by generating content but by systematically verifying it.

Acknowledgments

Computing resources: Claude CLI v2.0.21 (Anthropic) for AI system testing and audit automation.

Data Availability

The complete methodology, protocol specification (CLAUDE.md v1.0), and validation data are available in an open-source repository at https://github.com/leonjvr/ai-citation-auditor. The repository includes:

•

Zero-assumption verification protocol (CLAUDE.md)
•

Comprehensive audit reports for 24 PLOS papers (1,212 references)
•

Analysis scripts and aggregate statistics
•

Example audit reports and documentation
•

Anonymized summaries of Phase 1 validation (protecting student privacy)

All code is released under MIT License. Validation data includes complete audits for replication purposes.

References

[1] Berglund, L., et al. (2023). Do Language Models Know When They’re Hallucinating References? arXiv preprint arXiv:2305.18248.
[2] Walters, W.H., & Wilder, E.I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1), 14045.
[3] Alkaissi, H., & McFarlane, S.I. (2023). Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus, 15(2), e35179.
[4] DeepTRACE. (2025). Auditing Deep Research AI Systems. Technical Report, September 2025.
[5] Citation errors in scientific research and publications. PMC, 2023. https://pmc.ncbi.nlm.nih.gov/articles/PMC10307651/
[6] Simkin, M.V., & Roychowdhury, V.P. (2003). Read before you cite! Complex Systems, 14(3), 269-274.
[7] Effect of ChatGPT on learning performance: meta-analysis. Nature Human Social Sciences Communications, 2025.
[8] FairAIED: Navigating fairness in educational AI. arXiv:2407.18745, 2024.