Detection of metadata manipulations:
Finding sneaked references \mathghost\mathghost\boldsymbol{\mathghost} in the scholarly literature

Lonni Besançon111 Media and Information Technology, Linköping University, Norrköping, Sweden, [email protected],
ORCID: 0000-0002-7207-1276
   Guillaume Cabanac222 Université Toulouse 3 – Paul Sabatier, IRIT UMR 5505 CNRS, 31062 Toulouse, France; Institut Universitaire de France (IUF), France, [email protected], ORCID: 0000-0003-3060-6241    Cyril Labbé 333 Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France, [email protected], ORCID: 0000-0003-4855-7038    Alexander Magazinov444Yandex.Kazakhstan, 43 Dostyq av., Almaty 050010, Kazakhstan, [email protected],
ORCID: 0000-0002-9406-013X
   Jules di Scala555Université Toulouse 3 – Paul Sabatier, IRIT UMR 5505 CNRS, 31062 Toulouse, France; [email protected],
ORCID: 0009-0005-3460-0535
   Dominika Tkaczyk666Crossref, [email protected], ORCID: 0000-0001-5055-7876    Kathryn Weber-Boer777 Digital Science, London, UK, [email protected], ORCID: 0000-0002-4495-3001
(Started late Summer 2024, version of January 7, 2025
Submitted to Journal of the Association for Information Science and Technology)
Abstract

We report evidence of a new set of sneaked references discovered in the scientific literature. Sneaked references are references registered in the metadata of publications without being listed in reference section or in the full text of the actual publications where they ought to be found. We document here 80,2058020580,20580 , 205 references sneaked in metadata of the International Journal of Innovative Science and Research Technology (IJISRT). These sneaked references are registered with Crossref and all cite—thus benefit—this same journal. Using this dataset, we evaluate three different methods to automatically identify sneaked references. These methods compare reference lists registered with Crossref against the full text or the reference lists extracted from PDF files. In addition, we report attempts to scale the search for sneaked references to the scholarly literature.

1 Introduction

Citation-based indices or metrics like the hhitalic_h-index (Hirsch, \APACyear2005), the Journal Impact Factor (Garfield, \APACyear1994) or the Field-Weighted Citation Impact (FWCI) (Purkayastha \BOthers., \APACyear2019) are cornerstones to many rankings: Clarivate’s ‘Highly Cited Researchers’ list888https://clarivate.com/highly-cited-researchers/, the Shanghai Ranking999http://www.shanghairanking.com/, Times Higher Education World University Rankings101010https://www.timeshighereducation.com/world-university-rankings, QS World University Rankings111111https://www.topuniversities.com/qs-world-university-rankings, or U.S. News Education Rankings121212https://www.usnews.com/best-colleges/rankings/. These citation-based performance metrics are provided by various scientometrics services: Google Scholar (hhitalic_h-index), OpenAlex (hhitalic_h-index), Scopus (hhitalic_h-index and FWCI), and the Web of Science (hhitalic_h-index and Journal Impact Factor); Dimensions provides the Field Citation Ratio (FCR) and Relative Citation Ratio (RCR)(Bode \BOthers., \APACyear2023).

Practically speaking, the computation of these indicators requires processing of the metadata describing scientific publications: authors, institutions, reference lists, registration dates, attributing fields to journals and publications, and–critical to the research presented here–reference lists. Crossref131313https://www.crossref.org/ provides infrastructure for registering metadata for scholarly works, including a DOI. To use this infrastructure, organisations join Crossref as members. In many cases, the metadata registered by Crossref members include reference lists. The identifiers of cited works are either provided by Crossref members or automatically added where matching is possible. Crossref is one of the major sources of scholarly data for publishers, authors, librarians, funders, and researchers (Hendricks \BOthers., \APACyear2020). Various scientometrics services like Dimensions, OpenAlex, or SpringerLink make use of metadata deposited with Crossref.

Citation gaming to artificially boost citation-based metrics occurs in various forms (Biagioli \BBA Lippman, \APACyear2020). While most of them involve simply adding references to the research papers directly through a varied set of methods and actors (see, e.g. Beel \BBA Gipp, \APACyear2010; Davis, \APACyear2016; Foley \BBA Valkonen, \APACyear2012; Franck, \APACyear1999; Heathers \BBA Grimes, \APACyear2022; Kojaku \BOthers., \APACyear2021; Labbé, \APACyear2010), sneaked references offer a different pathway to citation gaming (Besançon \BOthers., \APACyear2024). The underlying strategy behind sneaked references is to inject irrelevant and undue citations into the metadata of an accepted article at the time of its registration with scientific repositories. Sneaked references are only present in the metadata of the article and are not part of the actual reference list of this document where they should be found. This malpractice generates undue citations that artificially inflate citation counts.

In this article, we report 2,78227822,7822 , 782 Crossref records spoiled with at least 80,2058020580,20580 , 205 references sneaked into their metadata reference lists. All sneaked references benefit the same journal, namely, the journal in which the reference lists were published. The paper benefiting the most from sneaked references received a total of 6,05960596,0596 , 059 undue citation counts, some of which did make their way into various scientometrics services (see Figure 1 and 2).

We designed and evaluated two different methods to automatically identify sneaked references by comparing references registered with Crossref against either the raw text or the reference lists extracted from PDF files. Both methods assume that references registered with Crossref are registered with enough information that they can be found in the extracted text (e.g., unstructured attribute).

The first method _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 identifies in the list registered with Crossref. This method depends on an identical order of elements in the two lists and assumes that sneaked references appear at the end of the list registered with Crossref.

The second method _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 automatically identifies each and every reference registered with Crossref in the raw text extracted from the PDF file. The rationale behind this method is that a particular reference field in a Crossref record reflects closely the text of this reference in the PDF file.

These two new methods are compared to an existing approach, _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 presented in (Besançon \BOthers., \APACyear2024), which relies on a comparison of reference lists lengths. This approach was found to be effective in providing a lower bound to the number of sneaked references, by comparing reference lists retrieved from HTML document versions to the reference lists registered with Crossref.

These methods work only at the document level. To identify sneaked references in the scientific literature as a whole, one of these methods must be applied to each and every document, individually. We report here the result of an attempt to identify sneaked references at a large scale by applying method _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 on 47,170,721 documents previously processed by Dimensions, published since the year 2000. For each of these documents, the reference list was extracted from the PDF file and stored in a database to be compared with metadata registered with Crossref.

Previous work (Besançon \BOthers., \APACyear2024) has mentioned that in data registered with Crossref, duplicated references sometimes appear together with sneaked references. We therefore attempted to identify duplicated references in Crossref metadata in the hopes of identifying new cases of sneaked references. Section 2 presents the dataset and explains in detail _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 and _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2. The proposed methods are assessed using the collected dataset. Section 3 gives precise information about the 80,2058020580,20580 , 205 references sneaked in metadata of the International Journal of Innovative Science and Research Technology (IJISRT): when and where they were sneaked in, to the benefit of which document, and so on. Section 4 provides insight from attempts to detect sneaked references at a larger scale, including the systematic challenges that inhibit these efforts. Section 5 concludes with a discussion of some of the known routes to erroneous references, the actors involved, and some recommended actions which could address the problem of sneaked references.

Refer to caption
Figure 1: The citation count of 10.38124/ijisrt/ijisrt24apr651 is 1.7k according to Dimensions: Early Dec. 2024 it benefits from at least 6,05960596,0596 , 059 sneaked references (see Figure 6). There is no reason to think that authors are responsible for this discrepancy.
Refer to caption
Figure 2: The citation count of 10.38124/ijisrt/ijisrt24apr651 is 1.8k according to OpenAlex: Early Dec. 2024 it benefits from at least 6,05960596,0596 , 059 sneaked references (see Figure 6). There is no reason to think that authors are responsible for this discrepancy.

2 Dataset and comparison Methods

Section 2.1 presents the information upon which the dataset was identified and details about how it was retrieved. Sections 2.2 and 2.3 give a detailed descriptions of the proposed methods (_1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1, _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 respectively). Sections 2.42.5 and  2.6 provide performances results.

2.1 The International Journal of Innovative Science and Research Technology (IJISRT)

Visual inspection of several PDF files of the International Journal of Innovative Science and Research Technology (IJISRT), and their corresponding Crossref json records reveals that, in some cases, references are sneaked in at the end of the Crossref reference-list attribute.

On 24 July 2024, Cristian Consonni alerted some of the authors about an entry of the Problematic Paper Screener (Cabanac \BOthers., \APACyear2022) that highlighted tortured phrases in a certain IJISRT article (DOI: 10.38124/ijisrt/ijisrt24apr2410PubPeer). He further noted that the article had 237 citations, which was unusually high for an article published in April 2024, only 3 months earlier. Further inspection revealed that there were a significant number of other articles in IJISRT with a seemingly disproportionate number of citations and that most—if not all—citations had come from the same journal. As a result of this discovery, we sought citations in the actual text of the citing articles in vain, which indicates a pattern of sneaked references.

On the same day (24 July 2024), we queried Dimensions for all articles in IJISRT and retrieved the resulting CSV file, including the list of corresponding DOIs. For each retrieved DOI, the corresponding PDF file was downloaded from the publisher website (29 August 2024). Additionally, for all DOIs, Crossref records were downloaded from Crossref using the relevant API (28–29 August 2024). This served as a development dataset, and on 25 November 2024 we downloaded the final dataset presented here.

In this final dataset, the observed sneaked references are always benefiting papers with DOIs prefixed with 10.38124/ijisrt. This prefix identifies the International Journal of Innovative Science and Research Technology (IJISRT). All observed sneaked references appear in Crossref records after the expected references, as an irrelevant addendum to the reference list. We used these two properties (journal-level self-citation and position at the bottom of the sneaked references in Crossref metadata, reference-list attribute) to study to which extent sneaked references occur.

2.2 _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1: Comparing Crossref records with references extracted from PDFs

The idea is to extract a reference list from the PDF files for them to be compared with the ones registered with Crossref. Extracting the reference list from a PDF file can be done using a tool that transforms PDF files into XML files. In XML format, the reference list is clearly identified and can be automatically analysed.

Our process was the following for each collected DOI:

  • The reference list _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C registered with Crossref is built from the json file provided by Crossref. In the following, Last_C𝐿𝑎𝑠subscript𝑡_𝐶Last_{\_}{C}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C denotes the last element of the list _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C.

  • A reference list, _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G (in XML format) is extracted from the PDF file using Grobid (\APACcitebtitleGROBID, \APACyear2008–2023) (default configuration). Unfortunately, Grobid, while being quite reliable, sometimes skips some references. This results in missing references in _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G. Items might be missing in bulk, either at the beginning, end or middle of the reference list. In very exceptional cases, Grobid inserts hallucinated references: _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G might contain references that do not appear in the PDF. We spotted cases where Grobid added the biography of an author as the last item of _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G. This can happen when the bibliography appears, in the PDF, just after the last entry of the reference section (see the left panel in Figure 3). Nevertheless, we’ll use the last reference of _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G, which we denote Last_G𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G.

Comparing Last_G𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G to Last_C𝐿𝑎𝑠subscript𝑡_𝐶Last_{\_}{C}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C gives information about both sneaked references and Grobid’s capability to identify correctly the last reference in the PDF’s reference list. Let us consider the three following cases:

  • Case 1.

    If Last_C=Last_G𝐿𝑎𝑠subscript𝑡_𝐶𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{C}=Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G we conclude that _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C is correct with regards to the PDF version. The Crossref record does not contain any sneaked references.

  • Case 2.

    If r_C𝑟subscript_𝐶\exists r\in\mathcal{R}_{\_}{C}∃ italic_r ∈ caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C such that r=Last_GrLast_C𝑟𝐿𝑎𝑠subscript𝑡_𝐺𝑟𝐿𝑎𝑠subscript𝑡_𝐶r=Last_{\_}{G}\land r\neq Last_{\_}{C}italic_r = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G ∧ italic_r ≠ italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C we conclude that references appearing after r𝑟ritalic_r in _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C are sneaked references, forming a list denoted _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT. In the specific dataset of 10.38124/ijisrt, close analysis of _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT instances reveal that, from time to time, the first items are not sneaked references. This happens when Grobid omits to extract from the PDF the end of the reference section, resulting in a truncated _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G. Considering that in all inspected cases, all the sneaked references concerned DOIs starting with 10.3812410.3812410.3812410.38124, we decided to remove from _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT all elements preceding the first appearance of a DOI starting with 10.3812410.3812410.3812410.38124. In other words, we skipped the references at the top of the list when they are not prefixed by 10.3812410.3812410.3812410.38124. Without this cleaning operation, some of the legitimate references would have been wrongly considered as sneaked references.

  • Case 3.

    If r_Cnot-exists𝑟subscript_𝐶\nexists r\in\mathcal{R}_{\_}{C}∄ italic_r ∈ caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C such that r=Last_G𝑟𝐿𝑎𝑠subscript𝑡_𝐺r=Last_{\_}{G}italic_r = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G we can conclude that Last_G𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G is an artifact created by Grobid. Last_G𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G is a hallucinated reference that does not appear, neither in _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C nor in the PDF file. In that case no conclusion can be drawn. Nevertheless, in the specific dataset of 10.38124/ijisrt, a close analysis reveals that—sometimes—sneaked references can be found at the end of _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C. Again, in all inspected cases, the sneaked references are benefiting DOIs starting with 10.3812410.3812410.3812410.38124. We decided to build _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT with all elements of _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C appearing after the last element that has not a DOI starting with 10.3812410.3812410.3812410.38124. Without this backward check (iterating over the list from the bottom to the first non 10.3812410.3812410.3812410.38124 DOI), some of the sneaked references would have been wrongly considered as sneaked references.

Figure 3 illustrates a Case 3 situation. For a DOI like this, the list _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT is built from the trailing references with DOIs starting with 10.3812410.3812410.3812410.38124. This method could be used for every DOI. Identifying Case 1 and Case 2 is a way to evaluate how precise the extraction of sneaked references using Grobid can be.

Refer to caption
Figure 3: A PDF file with a list of 8 references. The reference list extracted by Grobid (_Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G) does not contain some of the expected references (e.g., references #1, #4, and #5) and does feature non-existing references (e.g., references _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G [6.]). The reference list registered with Crossref (_Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C) contains 3 sneaked references: [9.], [10.], [11.]. This is a Case 3 situation.

2.3 _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2: Comparing Crossref records with the full text extracted form PDFs

The rationale behind this method is that the field unstructured of a particular reference entry in a Crossref record reflects closely the text of this reference in the PDF file. As a consequence, the character string s𝑠sitalic_s found at this field must appear in the text 𝒯𝒯\mathcal{T}caligraphic_T extracted from the PDF file. There is even no need to restrict the search of s𝑠sitalic_s to the reference section. As shown in Section 2.2, identifying correctly the reference section is challenging. Unstructured references are quite long, thus the search of s𝑠sitalic_s in 𝒯𝒯\mathcal{T}caligraphic_T is unlikely to generate false positive.

More specifically, the following steps were performed for every DOI:

  1. 1.

    The reference list _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C registered with Crossref is built from the json file provided by Crossref. In the following, s𝑠sitalic_s denotes an element of the list _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C.

  2. 2.

    The full text 𝒯𝒯\mathcal{T}caligraphic_T is extracted from the PDF file using the pypdf Python library.

  3. 3.

    s_Cfor-all𝑠subscript_𝐶\forall s\in{\mathcal{R}_{\_}{C}}∀ italic_s ∈ caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C, a search of s𝑠sitalic_s in 𝒯𝒯\mathcal{T}caligraphic_T is performed. The goal is to identify ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the substring of 𝒯𝒯\mathcal{T}caligraphic_T that is the closest to s𝑠sitalic_s according to δ(s,s)𝛿𝑠superscript𝑠\delta(s,s^{\prime})italic_δ ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) the normalized Levenstein distance141414partialratio method from the RapidFuzz Python library. The similarity δ(s,s)𝛿𝑠superscript𝑠\delta(s,s^{\prime})italic_δ ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is ranging from 0 (totally different) and 100 (entirely similar).

  4. 4.

    If δ(s,s)<60𝛿𝑠superscript𝑠60\delta(s,s^{\prime})<60italic_δ ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < 60 this means that no character string ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT highly similar to s𝑠sitalic_s could be found in 𝒯𝒯\mathcal{T}caligraphic_T. This most probably happens when the reference s_C𝑠subscript_𝐶s\in\mathcal{R}_{\_}{C}italic_s ∈ caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C does not exist in the document, revealing that s𝑠sitalic_s is a sneaked reference. On the contrary, if δ(s,s)60𝛿𝑠superscript𝑠60\delta(s,s^{\prime})\geq 60italic_δ ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ 60 a character string ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT quite similar to s𝑠sitalic_s exists in 𝒯𝒯\mathcal{T}caligraphic_T. In that case s𝑠sitalic_s is not a sneaked reference. The 60606060 threshold was set experimentally, after manual examination of several cases.

2.4 Measuring the performance of _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 the detection method using the ‘last’ element of reference lists

We consider here the 3,13231323,1323 , 132 records with _Csubscript_𝐶\mathcal{R}_{\_}{C}\neq\emptysetcaligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C ≠ ∅ and _Gsubscript_𝐺\mathcal{R}_{\_}{G}\neq\emptysetcaligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G ≠ ∅ that contain a total of 78,7367873678,73678 , 736 sneaked references. These records are distributed among the three identified cases (see 2) as follows:

  • Case 1.

    331331331331 DOIs (10.5%=331/3,132percent10.5331313210.5\%=\nicefrac{{331}}{{3,132}}10.5 % = / start_ARG 331 end_ARG start_ARG 3 , 132 end_ARG) with no sneaked references where correctly identified because Last_C=Last_G𝐿𝑎𝑠subscript𝑡_𝐶𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{C}=Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G.

  • Case 2.

    When r_C𝑟subscript_𝐶\exists r\in\mathcal{R}_{\_}{C}∃ italic_r ∈ caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C such that r=Last_GrLast_C𝑟𝐿𝑎𝑠subscript𝑡_𝐺𝑟𝐿𝑎𝑠subscript𝑡_𝐶r=Last_{\_}{G}\land r\neq Last_{\_}{C}italic_r = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G ∧ italic_r ≠ italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C then _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT is composed of references appearing after r𝑟ritalic_r in _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C. A cleaning operation (item) might be needed.

    • No Cleaning needed: For 1,78817881,7881 , 788 (%=1,788/3,132\%=\nicefrac{{1,788}}{{3,132}}% = / start_ARG 1 , 788 end_ARG start_ARG 3 , 132 end_ARG) DOIs, _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT was correct. This represents a total of 46,2974629746,29746 , 297 (58.8%=46,297/78,736percent58.8462977873658.8\%=\nicefrac{{46,297}}{{78,736}}58.8 % = / start_ARG 46 , 297 end_ARG start_ARG 78 , 736 end_ARG) sneaked references.

    • Cleaning needed: For 840840840840 DOIs (%=840/3,132\%=\nicefrac{{840}}{{3,132}}% = / start_ARG 840 end_ARG start_ARG 3 , 132 end_ARG), _\mathghostsubscript_\mathghost\mathcal{L}_{\_}{\boldsymbol{\mathghost}}caligraphic_L start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT contains potential false positives: references that would have been classified as sneaked without the cleaning operation. This represents a total of 2032203220322032 references (2.2%=2032/78,736percent2.22032787362.2\%=\nicefrac{{2032}}{{78,736}}2.2 % = / start_ARG 2032 end_ARG start_ARG 78 , 736 end_ARG).

  • Case 3.

    For 173173173173 DOIs, r_Cnot-exists𝑟subscript_𝐶\nexists r\in\mathcal{R}_{\_}{C}∄ italic_r ∈ caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C such that r=Last_G𝑟𝐿𝑎𝑠subscript𝑡_𝐺r=Last_{\_}{G}italic_r = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G. A backward check (item) might be needed to identify sneaked references. For the 173173173173 instances of this case, the backward check identifies sneaked references. Without this check 3,17631763,1763 , 176 sneaked references would have been undetected (4%=3176/78,736percent43176787364\%=\nicefrac{{3176}}{{78,736}}4 % = / start_ARG 3176 end_ARG start_ARG 78 , 736 end_ARG).

2.5 Comparing _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 and _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2

Method _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 relies to a high extent on the tool use to extract the reference list from the PDF. We did implement this method using Grobid. Performances of the method is thus dependant of the ability of Grobid to accurately extract the reference list. Since this task is not trivial, and Grobid occasionally makes mistakes, we had to use additional assumptions about the sneaked references to get reliable results: sneaked references appear after the last genuine reference extracted from the PDF file. This means that this method might not generalize easily to other instances of the sneaked references problem.

Method _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 is not dependant on identifying the reference list in the PDF file and does not make any assumptions about where sneaked references are. As such, this method should generalize better than the first one. Nevertheless, in some corner cases in the text extracted from the PDF, additional text fragments like headers or footers might appear in the middle of a reference, thus making the reference identification impossible.

One common drawback of both methods is that the field unstructured must be correctly deposited with Crossref by publisher for the methods to work properly: one could imagine cases where only reference DOIs are provided.

For every DOI with references in the metadata and available PDF, we compared the numbers of sneaked references reported by both methods (see Table 1). For this dataset, among 3,186(=2,953+233)3,186(=2,953+233)3 , 186 ( = 2 , 953 + 233 ) compared DOIs, the methods disagreed in 233 (7.3%) cases. Among these, for only 11 DOIs a difference greater than 10 is observed for the number of sneaked references reported. This explains why the total numbers of sneaked references detected by the methods differ only by 0.9%percent0.90.9\%0.9 % (80,90980,205/80,205809098020580205\nicefrac{{80,909-80,205}}{{80,205}}/ start_ARG 80 , 909 - 80 , 205 end_ARG start_ARG 80 , 205 end_ARG).

It seems that most discrepancies are due to cases where the first method underestimated the number of sneaked references.

Total processed DOIs 4,077
– DOIs with no references in JSON 855
– DOIs with no PDF 36
– DOIs where methods agreed 2,953
– DOIs where methods disagreed 233
(a)
Method DOIs manipulated sneaked references
_1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 2,782 80,205
_2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 2,787 80,909
(b)
Table 1: Statistics on DOIs (a) and comparison of methods findings (b)

2.6 Measuring the performance of _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 that uses the lengths of registered and extracted reference lists

Comparing the list lengths, (adapting (Besançon \BOthers., \APACyear2024) that uses HTML reference lists and Crossref reference lists) would give 84,2708427084,27084 , 270 sneaked references. This is an overestimation of 5,53455345,5345 , 534 (7%=(84,27078,736)/78,736percent78427078736787367\%=\nicefrac{{(84,270-78,736)}}{{78,736}}7 % = / start_ARG ( 84 , 270 - 78 , 736 ) end_ARG start_ARG 78 , 736 end_ARG) of the total number of sneaked references.

For some DOIs the error can be quite high (max = 465). Relying solely on lists length comparison will generates many false positive.

This an important drawback of method _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 when implemented using Grobid. It suggests that reference lists extracted using Grobid tend to be shorter than the ones actually existing in documents.

Using a length comparison method will thus overestimate the number of sneaked references. Using either the last element method _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 or the raw comparison method _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 is a more accurate way detect sneaked references.

The next section provides a detailed description of the results obtained using the systematic comparison of Grobid output vs Crossref records (_1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1).

3 Characteristics of the sneaked references found in IJISRT

We review the characteristics related to the IJISRT dataset: When and where the sneaked references were inserted and to whom they benefit?

Detailed examination of sneaked references properties aims at delineating the source and the nature of their existences.

3.1 Broad overview: How many? Where and when references were sneaked in? Who are the beneficiaries?

  • The corpus is composed of 4,07740774,0774 , 077 DOIs prefixed by 10.38124/ijisrt.

  • For 3,22232223,2223 , 222 DOIs, a non-empty reference list _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C was downloaded from api.crossref. An empty result can reflect the fact that a document does not contain any reference list. But it also happens when the publisher did not register any reference list for this particular DOI. This means that the references listed in the real document are lost \faSkullCrossbones (see Besançon \BOthers., \APACyear2024). Despite being present in the PDF, they are not registered with Crossref and are not credited to the cited document. Consequently, for 4,0773,222=855407732228554,077-3,222=8554 , 077 - 3 , 222 = 855 DOIs the number of sneaked references is zero, as strictly no references are registered with Crossref.

  • For 3,94039403,9403 , 940 DOIs, a non-empty reference list _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G has been extracted by Grobid from the PDF files. An empty list is generated either when the document does not contain any reference list or when Grobid failed to identify the reference section.

  • The records for 3,13231323,1323 , 132 DOIs have both _Csubscript_𝐶\mathcal{R}_{\_}{C}\neq\emptysetcaligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C ≠ ∅ and _Gsubscript_𝐺\mathcal{R}_{\_}{G}\neq\emptysetcaligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G ≠ ∅. They contain a total of 78,7367873678,73678 , 736 sneaked references.

  • Overall, the 80,2058020580,20580 , 205 sneaked references were found in 2,78227822,7822 , 782 Crossref records. The number of sneaked references in a single paper ranges from 1111 to 71717171, with an average of 28.8328.8328.8328.83 sneaked references par paper (see distributions in Figure 4).

  • The sneaked references are benefiting 2,70327032,7032 , 703 different DOIs. The most extreme case is a single DOI benefiting from 6,05960596,0596 , 059 undue citations.

  • DOIs with sneaked references were created with Crossref between March 2024 and November 2024.

Refer to caption
(a) All DOIs, including those with zero sneaked references.
Refer to caption
(b) DOIs with at least one sneaked reference, 181181181181 DOIs have at least 70707070 sneaked references.
Figure 4: How many DOIs have x𝑥xitalic_x sneaked references. The mode (the most frequent value) is around 20202020 sneaked references.

3.2 Per beneficiary analysis

Refer to caption
(a) All DOIs, including those (2,46924692,4692 , 469) that benefit from a single undue citation.
Refer to caption
(b) Only DOIs benefiting from more than one sneaked reference.
Figure 5: How many DOIs benefit from x𝑥xitalic_x undue citations? For example, 2,60726072,6072 , 607 DOIs benefited from 1111 to 29292929 undue citations, while only 138138138138 DOIs are benefiting from 2222 to 31313131 undue citations.
Refer to caption
Figure 6: The 30303030 DOIs that benefit the most from sneaked references.

A total of 80,2058020580,20580 , 205 sneaked references are benefiting 2,78227822,7822 , 782 different DOIs. The average count of undue citations per benefiting DOI is 28.8328.8328.8328.83. Nevertheless, the distribution is very unbalanced as show in Figure 5. The overwhelming majority of DOIs (n=2,469𝑛2469n=2,469italic_n = 2 , 469) are credited with only a single undue citation. On the other hand, a small number of DOIs benefit from a significant number of undue citations.

The 30303030 DOIs that benefit the most from sneaked references are shown in Figure 6. The figure also shows the count of undue citations that these DOIs benefit from. The DOI benefiting the most (10.38124/ijisrt/ijisrt24apr651) from sneaked references is credited with 6,05960596,0596 , 059 undue citations. Consequently, this particular DOI is incorrectly credited with 1.8k1.8𝑘1.8k1.8 italic_k and 1.7k citations by OpenAlex and Dimensions, respectively (See Figure 1 and Figure 2). This shows that some of the sneaked references effectively made their way through the counting processes onto some scientometric platforms.

3.3 Time Analysis

The information available from Crossref includes the date on which a DOI was first registered: creation date. For sneaked references, we decided to compare the creation date of the citing DOI and the creation date of the cited DOI.

3.3.1 When were benefiting DOIs created?

The oldest unduly cited DOI is 10.38124/volume4issue7 which is the identifier of a volume published in April 2020 (See 7(a)). It seems that this citation does not benefit any individual papers of the volume.

All but nine of the benefiting DOIs have been created between 2024-03-08 at 12:14 and 2024-11-07 at 12:54. The DOI that benefits the most from sneaked references (6,05960596,0596 , 059 undue citations) has been published in April 2024 (see 7(b)).

Refer to caption
(a) The observation on the extreme left represents sneaked references benefiting 10.38124/volume4issue7, a whole volume published in April 2020.
Refer to caption
(b) Same as the left panel but zoomed in on the year 2024. The DOI 10.38124/ijisrt/ijisrt24apr651 benefiting the most from sneaked references (6,05960596,0596 , 059) was created in April 2024.
Figure 7: Number of sneaked references with regards to when the benefiting DOI was created with Crossref.

3.3.2 When were DOIs with sneaked references created?

According to Crossref metadata, DOIs with sneaked references were created between the 2024-03-14 and the 2024-11-25 (see Figure 8).

The first 14 records featuring sneaked references have been registered with only one of them, on 2024-03-14. These individual sneaked references benefit different DOIs.

The number of sneaked references per DOI increased quite rapidly to reach a maximum of 71. The 6 DOIs with 71 sneaked references were all created on either 2024-10-28 or 2024-10-18.

Refer to caption
Figure 8: Date when DOIs with sneaked references were created with Crossref. The more intense the color, the more DOIs with the same number of sneaked references are registered at that time.

3.3.3 Coherence between citing and cited creation date

Since sneaked references began to be included on 2024-03-14, and all but one of the beneficiaries started to be created 5 days earlier (2024-03-09), it is important to check the temporal coherence between the creation dates of the citing and cited works.

Figure 9 shows the number of days between the ‘citing’ creation date and the ‘cited’ creation date. This number is always positive showing that, for all sneaked references, the citing DOIs were created after the cited ones.

Refer to caption
(a) The light blue outliers in the upper left corner are the sneaked references to the oldest unduly cited DOIs, sneaked in on 2024-03. The upper band are sneaked references to ‘old’ DOIs created in 2020 (see 7(a)).
Refer to caption
(b) Same as panel (a), zoom with the upper band removed.
 
 
Figure 9: Temporal coherence between citing and cited DOIs. The y𝑦yitalic_y axis is the number of days between the creation date of the cited DOIs and the creation date of the citing DOIs (x𝑥xitalic_x axis). The darker the blue is, the more observations there are.

Nevertheless, time differences are quite small, starting from 00 days (one instance on 2024-05-22), and are slowly increasing as times passes by (see 9(b)).

10(b) shows the time difference distribution (0δ2500𝛿2500\leq\delta\leq 2500 ≤ italic_δ ≤ 250 in days). The most frequent (with similar-to\sim800 occurrences) value is a difference of six days between citing and cited DOIs… Half of the sneaked references are citing DOIs that were created less than 73 days (median) before their own creation.

Refer to caption
(a) Distribution of time differences between the cited and citing paper for sneaked references.
Refer to caption
(b) Distribution of time differences (Zoom) between the cited and citing paper for sneaked references.
Figure 10: Distribution of time differences between the cited and citing paper for sneaked references.

3.4 Summarizing evidence

As a result it can be said that sneaked references were first added in small numbers in April 2024. At that time only a handful of references were unduly added at registration time. The number of sneaked references increased little by little reaching a maximum of 71717171 sneaked references per paper in November 2024.

The sneaked references are all benefiting the journal they are sneaked in and thus benefiting the publisher that registered them.

The time difference between the cited and the citing paper, for sneaked references, is often surprisingly small. It is also worth noting that sneaked references are not benefiting to cited papers in a very unbalanced way. Most of the papers are only benefiting from one single sneaked reference, while a few are benefiting for hundreds of undue citation.

In the light of these data, it is hard to find definitive evidence that differentiates intentional manipulation from genuine malfunctions in the meta-data registration process.

Nevertheless, awkward metadata registered with Crossref might help to identify venues where references are sneaked in. The next section investigates this hypothesis.

4 Attempts to detect sneaked references at a large scale

Section 4.1 explains how sneaked references can be discovered when they occur together with duplicated references. Section 4.2 discuses results of an attempt to identify sneaked references at a large scale by applying method _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 on 47,170,721 documents.

4.1 Duplicate based Heuristic to circumscribe sneaked references

We hypotheses that sneaked references appear together with duplicated references. Therefore, we try to detect groups of duplicated references in the hope to circumscribe sneaked references.

A bibliography should not contain the same DOI multiple times. The approach we adopted is to consider only the DOI of references to identify duplicates. Thus, two references in a same citing article pointing to the same DOI is considered as a duplicate reference. Working at the DOI level to spot duplicates is imperfect. We noticed that some DOIs occur multiple times in a Crossref record of a bibliography that does not feature duplicates. This can happen when a DOI of a book is used to identify different chapters of this book. For this reason, we excluded books and book chapters from our analysis.

We used the latest Crossref snapshot downloaded on 23 November 2023. It contains a total of 991,206,078 reference entries including duplicates. 3,755,847 (0.38%percent0.380.38\%0.38 %) of these are duplicated 1+ times. Thus, the dump contains a total of 986,772,474 distinct references. Overall, 4,433,404 (0.45%percent0.450.45\%0.45 %) reference entries are duplicates; and there is an average of 1.18 duplicates per duplicated reference.

Our goal is to identify entities that benefit the most from duplicated references. The entities we consider are either publishers, authors, articles, or journals. Let us introduce the following notations:

  • Let D𝐷Ditalic_D be the set of documents (limited to articles),

  • A𝐴Aitalic_A be the set of authors,

  • P𝑃Pitalic_P be the set of publishers

  • and J𝐽Jitalic_J be the set of journals.

Authors are authoring documents that are published in journals and documents are referencing other documents. To denote this, we’ll use the following notations:

  • For aA𝑎𝐴a\in Aitalic_a ∈ italic_A and dD𝑑𝐷d\in Ditalic_d ∈ italic_D, “a  d𝑎  𝑑a\text{\hskip 3.0pt\raisebox{-2.0pt}{{\char 29\relax}}\hskip 3.0pt}ditalic_a ✑ italic_d” means: a𝑎aitalic_a is author of d𝑑ditalic_d

  • For jJ𝑗𝐽j\in Jitalic_j ∈ italic_J and dD𝑑𝐷d\in Ditalic_d ∈ italic_D, “d \faHandshake[regular] j𝑑 \faHandshake[regular] 𝑗d\text{\hskip 3.0pt\faHandshake[regular]\hskip 3.0pt}jitalic_d [regular] italic_j” means: d𝑑ditalic_d is published in j𝑗jitalic_j

  • For (d_1,d_2)D2,nformulae-sequencesubscript𝑑_1subscript𝑑_2superscript𝐷2𝑛(d_{\_}1,d_{\_}2)\in D^{2},n\in\mathbb{N}( italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 ) ∈ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_n ∈ blackboard_N,
    d_1𝑛d_2𝑛subscript𝑑_1subscript𝑑_2d_{\_}1\xrightarrow{\;n\;}d_{\_}2italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 start_ARROW start_OVERACCENT italic_n end_OVERACCENT → end_ARROW italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2” means: d_1subscript𝑑_1d_{\_}1italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1’s metadata contains exactly n𝑛nitalic_n references to d_2subscript𝑑_2d_{\_}2italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2.
    d_1d_2absentsubscript𝑑_1subscript𝑑_2d_{\_}1\xrightarrow{\;\;\;}d_{\_}2italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2” means: d_1subscript𝑑_1d_{\_}1italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1’s metadata contains at least one reference to d_2subscript𝑑_2d_{\_}2italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2.

We also consider the following sets:

  • D_jsubscript𝐷_𝑗D_{\_}jitalic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j the set of the documents published in journal jJ𝑗𝐽j\in Jitalic_j ∈ italic_J: D_j={dDd \faHandshake[regular] j}subscript𝐷_𝑗conditional-set𝑑𝐷𝑑 \faHandshake[regular] 𝑗D_{\_}j=\left\{d\in D\mid d\text{\hskip 3.0pt\faHandshake[regular]\hskip 3.0pt% }j\right\}italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j = { italic_d ∈ italic_D ∣ italic_d [regular] italic_j }.

  • R_dsubscript𝑅_𝑑R_{\_}ditalic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d the set of the references of document dD𝑑𝐷d\in Ditalic_d ∈ italic_D: R_d={(d,r)D2dr}subscript𝑅_𝑑conditional-set𝑑𝑟superscript𝐷2absent𝑑𝑟R_{\_}d=\left\{\left(d,r\right)\in D^{2}\mid d\xrightarrow{\;\;\;}r\right\}italic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d = { ( italic_d , italic_r ) ∈ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_d start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_r }.

  • R_+dsubscriptsuperscript𝑅_𝑑R^{+}_{\_}ditalic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d the set of references of document dD𝑑𝐷d\in Ditalic_d ∈ italic_D, that are duplicated 1+ times:

    R_+d={(d,r)R_dn,n>1,d𝑛r}subscriptsuperscript𝑅_𝑑conditional-set𝑑𝑟subscript𝑅_𝑑formulae-sequence𝑛formulae-sequence𝑛1𝑛𝑑𝑟R^{+}_{\_}d=\left\{\left(d,r\right)\in R_{\_}d\mid n\in\mathbb{N},n>1,d% \xrightarrow{\;n\;}r\right\}italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d = { ( italic_d , italic_r ) ∈ italic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∣ italic_n ∈ blackboard_N , italic_n > 1 , italic_d start_ARROW start_OVERACCENT italic_n end_OVERACCENT → end_ARROW italic_r }. Note that R_+dR_dsubscriptsuperscript𝑅_𝑑subscript𝑅_𝑑R^{+}_{\_}d\subseteq R_{\_}ditalic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ⊆ italic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d.

  • C_dsubscript𝐶_𝑑C_{\_}ditalic_C start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d the set of the documents citing document dD𝑑𝐷d\in Ditalic_d ∈ italic_D:C_d={cDcd}subscript𝐶_𝑑conditional-set𝑐𝐷absent𝑐𝑑C_{\_}d=\left\{c\in D\mid c\xrightarrow{\;\;\;}d\right\}italic_C start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d = { italic_c ∈ italic_D ∣ italic_c start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_d }.

To measure how much a registered DOIs contains duplicated references or how much a paper benefit from duplicated references, the following measure are defined:

  • NbRef(d_1,d_2)NbRefsubscript𝑑_1subscript𝑑_2\operatorname{NbRef}\left(d_{\_}1,d_{\_}2\right)roman_NbRef ( italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 ) denotes the number of times d_1Dsubscript𝑑_1𝐷d_{\_}1\in Ditalic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 ∈ italic_D contains a reference to d_2Dsubscript𝑑_2𝐷d_{\_}2\in Ditalic_d start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 ∈ italic_D.

  • Benef+(d)=_cC_d(NbRef(c,d)1)𝐵𝑒𝑛𝑒superscript𝑓𝑑subscript_𝑐subscript𝐶_𝑑NbRef𝑐𝑑1Benef^{+}(d)=\sum_{\_}{c\in C_{\_}d}\left(\operatorname{NbRef}\left(c,d\right)% -1\right)italic_B italic_e italic_n italic_e italic_f start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_d ) = ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_c ∈ italic_C start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ( roman_NbRef ( italic_c , italic_d ) - 1 ) is the number of duplicated references benefiting d𝑑ditalic_d and
    Benef(d)=|{cC_dNbRef(c,d)>1}|𝐵𝑒𝑛𝑒𝑓𝑑conditional-set𝑐subscript𝐶_𝑑NbRef𝑐𝑑1Benef(d)=\left|\left\{c\in C_{\_}d\mid\operatorname{NbRef}\left(c,d\right)>1% \right\}\right|italic_B italic_e italic_n italic_e italic_f ( italic_d ) = | { italic_c ∈ italic_C start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∣ roman_NbRef ( italic_c , italic_d ) > 1 } | is the number of 1+ duplicated references d𝑑ditalic_d.

  • NbRefDup+(d)=_(d,r)R_+d(NbRef(d,r)1)𝑁𝑏𝑅𝑒𝑓𝐷𝑢superscript𝑝𝑑subscript_𝑑𝑟subscriptsuperscript𝑅_𝑑NbRef𝑑𝑟1NbRefDup^{+}(d)=\sum_{\_}{\left(d,r\right)\in R^{+}_{\_}d}\left(\operatorname{% NbRef}\left(d,r\right)-1\right)italic_N italic_b italic_R italic_e italic_f italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_d ) = ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT ( italic_d , italic_r ) ∈ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ( roman_NbRef ( italic_d , italic_r ) - 1 ) is the number of duplicated references in metadata registered for document d𝑑ditalic_d,
    NbRefDup(d)=|R_+d|𝑁𝑏𝑅𝑒𝑓𝐷𝑢𝑝𝑑subscriptsuperscript𝑅_𝑑NbRefDup(d)=\left|R^{+}_{\_}d\right|italic_N italic_b italic_R italic_e italic_f italic_D italic_u italic_p ( italic_d ) = | italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d | is the number of reference duplicated 1+ times, in metadata registered for document d𝑑ditalic_d.

DOI of the cited document d𝑑ditalic_d Benef+(d)𝐵𝑒𝑛𝑒superscript𝑓𝑑Benef^{+}(d)italic_B italic_e italic_n italic_e italic_f start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_d ) Benef(d)𝐵𝑒𝑛𝑒𝑓𝑑Benef(d)italic_B italic_e italic_n italic_e italic_f ( italic_d )
10.17265/2159-5313/2016.09.003 10 9941099410\,99410 994 6147614761476147
10.1109/geoinformatics.2015.7378602 2042204220422042 464464464464
10.1038/scientificamerican0703-56 696696696696 1111
10.1089/glre.2016.201011 657657657657 336336336336
10.4064/fm-146-3-215-238 504504504504 229229229229
Table 2: Benef+(d)𝐵𝑒𝑛𝑒superscript𝑓𝑑Benef^{+}(d)italic_B italic_e italic_n italic_e italic_f start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_d ) the number of duplicated references benefiting article d𝑑ditalic_d. Benef(d)𝐵𝑒𝑛𝑒𝑓𝑑Benef(d)italic_B italic_e italic_n italic_e italic_f ( italic_d ) the number of 1+ duplicated references to each document.
DOI of the citing document d𝑑ditalic_d NbRefDup+(d)𝑁𝑏𝑅𝑒𝑓𝐷𝑢superscript𝑝𝑑NbRefDup^{+}(d)italic_N italic_b italic_R italic_e italic_f italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_d ) NbRefDup(d)𝑁𝑏𝑅𝑒𝑓𝐷𝑢𝑝𝑑NbRefDup(d)italic_N italic_b italic_R italic_e italic_f italic_D italic_u italic_p ( italic_d )
10.1190/segam2016-full 1029102910291029 485485485485
10.2903/sp.efsa.2017.en-1246 1020102010201020 815815815815
10.1190/segam2016-full2 919919919919 470470470470
10.14412/1995-4484-2020-191-197 863863863863 62626262
10.4236/abb.2012.324065 696696696696 1111
Table 3: Top five documents for NbRefDup+(d)𝑁𝑏𝑅𝑒𝑓𝐷𝑢superscript𝑝𝑑NbRefDup^{+}(d)italic_N italic_b italic_R italic_e italic_f italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_d ) the number of duplicated references in metadata registered for document d𝑑ditalic_d. NbRefDup(d)𝑁𝑏𝑅𝑒𝑓𝐷𝑢𝑝𝑑NbRefDup(d)italic_N italic_b italic_R italic_e italic_f italic_D italic_u italic_p ( italic_d ) is number of reference duplicated 1+ times, in metadata registered for document d𝑑ditalic_d.

Some articles ‘benefit’ from an impressive number of duplications. The top 5 is shown in Table 2). Quite interestingly, the landing page (using doi.org) for the first row is currently a generic error page. Table 3 provide the top 5 papers for which metadata contains a lot of duplicated references. Again the landing page for the first DOI leads to a Page not Found error. Visual inspection of Crossref records for these five DOI reveal simple duplicated references without any obvious sneaked references.

To measure how much a journal does contain duplicated references, the following measures are computed:

  • JourDup+(j)=_dD_j_(d,r)R_+d(NbRef(d,r)1)𝐽𝑜𝑢𝑟𝐷𝑢superscript𝑝𝑗subscript_𝑑subscript𝐷_𝑗subscript_𝑑𝑟subscriptsuperscript𝑅_𝑑NbRef𝑑𝑟1JourDup^{+}(j)=\sum_{\_}{d\in D_{\_}j}\sum_{\_}{\left(d,r\right)\in R^{+}_{\_}% d}\left(\operatorname{NbRef}\left(d,r\right)-1\right)italic_J italic_o italic_u italic_r italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_j ) = ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT ( italic_d , italic_r ) ∈ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ( roman_NbRef ( italic_d , italic_r ) - 1 ) the number of duplicated references found in the metadata of a journal j𝑗jitalic_j.

  • JourDup(j)=|{dD_j|R_+d}|𝐽𝑜𝑢𝑟𝐷𝑢𝑝𝑗conditional-set𝑑subscript𝐷_𝑗subscriptsuperscript𝑅_𝑑JourDup(j)=\left|\left\{d\in D_{\_}j|R^{+}_{\_}d\neq\emptyset\right\}\right|italic_J italic_o italic_u italic_r italic_D italic_u italic_p ( italic_j ) = | { italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j | italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ≠ ∅ } | the number of documents registered for journal j𝑗jitalic_j that contain at least one duplicated reference.

Journal j𝑗jitalic_j JourDup+(j)𝐽𝑜𝑢𝑟𝐷𝑢superscript𝑝𝑗JourDup^{+}(j)italic_J italic_o italic_u italic_r italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_j ) JourDup(j)𝐽𝑜𝑢𝑟𝐷𝑢𝑝𝑗JourDup(j)italic_J italic_o italic_u italic_r italic_D italic_u italic_p ( italic_j ) JourDup+(j)/JourDup(j)𝐽𝑜𝑢𝑟𝐷𝑢superscript𝑝𝑗𝐽𝑜𝑢𝑟𝐷𝑢𝑝𝑗JourDup^{+}(j)/JourDup(j)italic_J italic_o italic_u italic_r italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_j ) / italic_J italic_o italic_u italic_r italic_D italic_u italic_p ( italic_j )
SSRN Electronic Journal 110 390110390110\,390110 390 40 0824008240\,08240 082 2.82.82.82.8
Journal of Behavioral Addictions 22 2862228622\,28622 286 603603603603 37.037.037.037.0
RSC Advances 21 3782137821\,37821 378 13 3161331613\,31613 316 1.61.61.61.6
The Journal of Contemporary Dental Practice 20 4072040720\,40720 407 1731173117311731 11.811.811.811.8
Internationa Journal of Sports Physiology and Performance 19 3511935119\,35119 351 1020102010201020 19.019.019.019.0
Health 18 9661896618\,96618 966 1412141214121412 13.413.413.413.4
Scientific Reports 17 8561785617\,85617 856 14 4561445614\,45614 456 1.21.21.21.2
PLOS ONE 17 5411754117\,54117 541 13 2161321613\,21613 216 1.31.31.31.3
American Journal of Plant Sciences 16 4531645316\,45316 453 1226122612261226 13.413.413.413.4
Creative Education 16 3551635516\,35516 355 863863863863 19.019.019.019.0
Table 4: The ten journals that registered the most duplicated references with Crossref. JourDup+(j)𝐽𝑜𝑢𝑟𝐷𝑢superscript𝑝𝑗JourDup^{+}(j)italic_J italic_o italic_u italic_r italic_D italic_u italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_j ) is the total number of duplicated references registered by this journal, JourDup(j)𝐽𝑜𝑢𝑟𝐷𝑢𝑝𝑗JourDup(j)italic_J italic_o italic_u italic_r italic_D italic_u italic_p ( italic_j ) the number of documents registered for journal j𝑗jitalic_j containing at least one duplicated reference.

Table 4 lists journals for which the highest number of duplicated references are found. Such journals push a great amount of duplicated reference metadata to Crossref. It is important to note that, despite the name, SSRN Electronic Journal is a platform for pre-prints. The first position might be explained by the high number of papers the platform register with Crossref. In second position, the Journal of Behavioral Addictions did register an average of 37 duplicated references over 603603603603 articles. Again, close inspection of some examples reveals only simple duplicated references without any obvious sneaked references.

Previous work (Besançon \BOthers., \APACyear2024) showed that sneaked references are sometimes benefiting to particular authors. Thus, identifying authors that benefit from duplicated references may also be the ones that benefit from sneaked references.

We note s_1a(j,a)subscript𝑠_1𝑎𝑗𝑎s_{\_}{1a}(j,a)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_a ( italic_j , italic_a ) the proportion of references found in journal j𝑗jitalic_j that cite a paper authored by a𝑎aitalic_a:

s_1a(j,a)=_dD_j|{(d,r)R_da  r}|_dD_j|R_d|subscript𝑠_1𝑎𝑗𝑎subscript_𝑑subscript𝐷_𝑗conditional-set𝑑𝑟subscript𝑅_𝑑𝑎  𝑟subscript_𝑑subscript𝐷_𝑗subscript𝑅_𝑑s_{\_}{1a}(j,a)=\frac{\sum_{\_}{d\in D_{\_}j}\left|\left\{\left(d,r\right)\in R% _{\_}d\mid a\text{\hskip 3.0pt\raisebox{-2.0pt}{{\char 29\relax}}\hskip 3.0pt}% r\right\}\right|}{\sum_{\_}{d\in D_{\_}j}\left|R_{\_}d\right|}italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_a ( italic_j , italic_a ) = divide start_ARG ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j | { ( italic_d , italic_r ) ∈ italic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∣ italic_a ✑ italic_r } | end_ARG start_ARG ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j | italic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d | end_ARG

s_1b(j,a)subscript𝑠_1𝑏𝑗𝑎s_{\_}{1b}(j,a)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_b ( italic_j , italic_a ) refers to the proportion of duplicated references found in journal j𝑗jitalic_j that cite a paper authored by a𝑎aitalic_a:

s_1b(j,a)=_dD_j|{(d,r)R_+da  r}|_dD_j|R_+d|subscript𝑠_1𝑏𝑗𝑎subscript_𝑑subscript𝐷_𝑗conditional-set𝑑𝑟subscriptsuperscript𝑅_𝑑𝑎  𝑟subscript_𝑑subscript𝐷_𝑗subscriptsuperscript𝑅_𝑑s_{\_}{1b}(j,a)=\frac{\sum_{\_}{d\in D_{\_}j}\left|\left\{\left(d,r\right)\in R% ^{+}_{\_}d\mid a\text{\hskip 3.0pt\raisebox{-2.0pt}{{\char 29\relax}}\hskip 3.% 0pt}r\right\}\right|}{\sum_{\_}{d\in D_{\_}j}\left|R^{+}_{\_}d\right|}italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_b ( italic_j , italic_a ) = divide start_ARG ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j | { ( italic_d , italic_r ) ∈ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∣ italic_a ✑ italic_r } | end_ARG start_ARG ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j | italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d | end_ARG

Hypothesising duplications occur randomly because they are mistakes, we should observe s_1a(j,a)s_1b(j,a)similar-tosubscript𝑠_1𝑎𝑗𝑎subscript𝑠_1𝑏𝑗𝑎s_{\_}{1a}(j,a)\sim s_{\_}{1b}(j,a)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_a ( italic_j , italic_a ) ∼ italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_b ( italic_j , italic_a ). If s_1b(j,a)subscript𝑠_1𝑏𝑗𝑎s_{\_}{1b}(j,a)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_b ( italic_j , italic_a ) is far greater than s_1a(j,a)subscript𝑠_1𝑎𝑗𝑎s_{\_}{1a}(j,a)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_a ( italic_j , italic_a ) this mean that duplicated references in journal j𝑗jitalic_j are benefiting in an abnormal proportion to author a𝑎aitalic_a.

That is why we compute s_1(j,a)subscript𝑠_1𝑗𝑎s_{\_}1(j,a)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 ( italic_j , italic_a ) an estimation of the number of duplicated references in journal j𝑗jitalic_j that are statistically speaking unexpected for authors a𝑎aitalic_a:

s_1(j,a)=(s_1a(j,a)s_1b(j,a))_dD_j|{(d,r)R_+da  d}|subscript𝑠_1𝑗𝑎subscript𝑠_1𝑎𝑗𝑎subscript𝑠_1𝑏𝑗𝑎subscript_𝑑subscript𝐷_𝑗conditional-set𝑑𝑟subscriptsuperscript𝑅_𝑑𝑎  𝑑s_{\_}1(j,a)=(s_{\_}{1a}(j,a)-s_{\_}{1b}(j,a))\cdot\sum_{\_}{d\in D_{\_}j}% \left|\left\{\left(d,r\right)\in R^{+}_{\_}d\mid a\text{\hskip 3.0pt\raisebox{% -2.0pt}{{\char 29\relax}}\hskip 3.0pt}d\right\}\right|italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 ( italic_j , italic_a ) = ( italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_a ( italic_j , italic_a ) - italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 italic_b ( italic_j , italic_a ) ) ⋅ ∑ start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_j | { ( italic_d , italic_r ) ∈ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_d ∣ italic_a ✑ italic_d } |
Journal Author No. dupli. in journal to author No. dupli. in journal No. ref. from journal to author No. ref. in journal Score
Econometrics: Alchemy or Science? david f hendry 76767676 104104104104 204204204204 1391139113911391 44.444.444.444.4
Construction and Architecture timofey krakhmalnyy 35353535 49494949 58585858 1273127312731273 23.423.423.423.4
International Journal of Scientific Research in Science and Technology harikriishna b jethva 142142142142 981981981981 213213213213 18 9901899018\,99018 990 19.019.019.019.0
International Journal of Scientific Research in Science and Technology bhavesh kataria 142142142142 981981981981 242242242242 18 9901899018\,99018 990 18.718.718.718.7
International Journal of Laser Dentistry a l mckenzie 75757575 315315315315 75757575 7653765376537653 17.117.117.117.1
Construction and Architecture sergej evtushenko 25252525 49494949 57575757 1273127312731273 11.611.611.611.6
International Journal on Disability and Human Development daniel t l shek 199199199199 2650265026502650 283283283283 10 3031030310\,30310 303 9.59.59.59.5
International Journal on Applied Engeneering and Management Letters p s aithal 54545454 182182182182 551551551551 4154415441544154 8.98.98.98.9
Berichte der deutschen chemischen Gesellschaft h staudinger 213213213213 3901390139013901 1027102710271027 56 0355603556\,03556 035 7.77.77.77.7
An Introduction to Community and Primary Health Care elizabeth halcomb 36363636 164164164164 64646464 2224222422242224 6.96.96.96.9
Cambridge Handbook of Multimedia Learning richard e mayer 49494949 216216216216 227227227227 2455245524552455 6.66.66.66.6
Bears of the World jon e swenson 114114114114 1071107110711071 249249249249 4602460246024602 6.06.06.06.0
Table 5: Pairs of authors and journals sorted by s_1(j,h)subscript𝑠_1𝑗s_{\_}1(j,h)italic_s start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 ( italic_j , italic_h ) score. The columns reflect: name of journal, name of author, number of references that are duplicated 1+ times in journal and benefiting to author, number of 1+ duplicated reference(s) in journal, number of references (without duplications) benefiting to author in journal, total number of references (without duplications) in journal.

Computing this score (Table 5) can reveal statistical anomalies that may (or may not) reflect citation gaming.

The computed leaderboard features Harikrishna B. Jethva and Bhavesh Kataria together with the International Journal of Scientific Research in Science and Technology from the Technoscience Academy publisher. This case being the one described in (Besançon \BOthers., \APACyear2024). This result is coherent with our hypothesis that duplicated references might be correlated with sneaked references. At the time of the Crossref snapshot was created metadata were not yet corrected. Since then, when asked by Crossref, the publisher did correct the records and removed sneaked references.

Some other authors in this list are suspected to manipulate their hhitalic_h index (e.g., P. S. Aithal151515https://www.researchgate.net/post/Excessiveself-citationinhisresearchpaperswhichhasartificiallyinflatedhisH-indexscore).

We did not check all the articles published by the journal–author pairs of this leaderboard, and further analysis might give new interesting results. Nevertheless, the case of the pair International Journal of Laser Dentistry and A. L. McKenzie is of some interest. we indeed found sneaked references benefiting to A. L. McKenzie’s articles. For example, 10.5005/jp-journals-10022-1031 contains a duplicated sneaked reference to 10.1109/geoinformatics.2015.7378602). Deeper investigations reveal that these sneaked references are not resulting from intentional manipulations but most probably are the consequence of genuine errors. This sneaked reference might be unintentional as its seems that this journal always sent the same reference list for all the metadata of its articles, except for the n𝑛nitalic_n first references of each list that are replaced by the n𝑛nitalic_n references of the current published article. This reference list contains the expected list of references (found in the PDF file) but are always padded up to 155155155155 with the same set of sneaked references. This journal is no more active and for each and every published article d𝑑ditalic_d the website (landing page) is providing a list of exactly 155155155155 references that are the ones registered at Crossref.

4.2 Scaling the detection of sneaked references to the entire scientific literature

Hoping that a combination of methods might enable the detection and validation of sneaked references at scale, we compared the length of the reference list extracted from PDFs using Grobid (_0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0) with the references registered with Crossref. For articles published since 2000, the number of reference items identified by Grobid were compared to the number of references provided by Crossref. In order to account for a reported Grobid uncertainty rate of 0.05, we allowed for a margin of error, comparing the length of references identified by the full-text Grobid processing to 0.95 times the Crossref references count. A total of 4,172,499 articles out of 47,170,721 processed were found to have fewer references than 95 percent of the Crossref reference count.

In order to determine which publications had been added and which authors or journals had been inserted erroneously, we attempted to match the references extracted by Grobid with the identifiers supplied by Crossref. This approach turns out to be challenging, partially as a result of the inconsistency of reference formatting in the PDFs (many were missing DOIs) but also highlighted an error in the initial approach: references which are provided in supplementary attachments were not counted in the original Grobid-processed PDFs.

Although this attempt to systematically identify erroneous references was not effective, there remain a total of 1,564,408 publications with between 5 and 500 additional references reported to Crossref beyond those identified by Grobid processing. The possibility remains that this approach may work for a subset of the total dataset (see Section 2.6 for more details on the limitations of the precision of the extraction of references using Grobid).

5 Conclusions

We investigated three ways (_0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0, _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1, and _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2) to automatically identify sneaked references by comparing references registered with Crossref and the ones extracted from PDF files using the Grobid software.

The first one _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 (Besançon \BOthers., \APACyear2024), based the direct comparison between list lengths, leads to an overestimation (7%) of the number of sneaked references.

The second one _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 relies on the last items of each list (Section 2.2). It supposes than the order of the two lists is the same, which is quite a strong assumption. If Last_C=Last_G𝐿𝑎𝑠subscript𝑡_𝐶𝐿𝑎𝑠subscript𝑡_𝐺Last_{\_}{C}=Last_{\_}{G}italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C = italic_L italic_a italic_s italic_t start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G we conclude that _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C is correct with regards to the PDF version. This is always the case in this specific dataset. But it could be that some references were still sneaked in, although that cannot be checked without a thorough manual analysis.

The third one _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 (Section 2.3) seems to provide the more accurate results. It does not require any complex reference extraction from the PDF file,

The current state of full text data prevents _0subscript_0\mathcal{M}_{\_}0caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 0 from working at large scale.
Computing expensive so tried to limit the search using heuristic (duplication).

5.1 Corrections and updates to references

We identified a new set of sneaked references that benefit a single journal: IJISRT. The sneaked references are registered with Crossref along with the metadata for this journal. As a result, sneaked references inflate citations counts for this journal and for some of its articles.

Crossref provides infrastructure for registering metadata for scholarly works, including a DOI. To use this infrastructure, organisations join Crossref as members, taking on related obligations161616https://www.crossref.org/membership/terms/. At the most basic level, Crossref members are responsible for depositing accurate metadata for each content item they produce.

When a serious issue with the metadata is detected, Crossref contacts the member to investigate the situation and work with them to rectify the problem where applicable. In some rare cases, the member’s access to register new scholarly works or update their existing records might be temporarily suspended or their membership permanently revoked.

The Crossref records for 2.7ksimilar-toabsent2.7𝑘\sim{}2.7k∼ 2.7 italic_k DOIs from the International Journal of Innovative Science and Research Technology require corrections to remove the 81ksimilar-toabsent81𝑘\sim{}81k∼ 81 italic_k sneaked references. In November 2024, Crossref contacted the member responsible for the International Journal of Innovative Science and Research Technology to ask for an explanation. Based on the member’s replies to the enquiries, it was clear there was an intention to manipulate the citation record, and as such Crossref have started the process to revoke this organization’s membership.

More information on Crossref’s membership revocation process can be found here 171717https://www.crossref.org/operations-and-sustainability/membership-operations/revocation/. The most up-to-date list of revoked Crossref members is available here.181818https://docs.google.com/spreadsheets/d/1cCkdvtqEM1urmrUQZ4-LGzOmf5812aVkRFJc5UryHw/edit This member will appear on the list if their revocation is ratified by the Crossref board.

Crossref encourages the community to report cases via the dedicated ”metadata quality improvements” channel191919https://community.crossref.org/c/tech-support/metadata-quality-improve/45 on its forum.

5.2 Implications

There are a number of possible sources of erroneous references, not all of which are nefarious. The identification of publications whose reference count fails to match the references actually listed in the references section (which may differ, in turn, from the in-text citations), is only the first, judgment neutral, step. There are some patterns detectable in the erroneous references, which hint at the source.

For example, in one situation there were several publications with a valid list of references which seemed to have been written on top of a constant, longer list of references in identical order. The number of total references was constant, while the number of valid references varied. This situation suggests to us an error in pasting, where a shorter list of references pasted over, rather than replaced, a longer list of references from an earlier metadata submission. In other cases, we found multiple identical references pasted at the end of the list of valid references. This behavior suggests a less technical source, and less benign intentions.

There are different methods for members to register scholarly metadata with Crossref. These range from plugins integrated into publishing platforms to registration forms with different metadata fields for members to fill. XML files can also by directly sent using HTTPS POST. On the member side, during the publishing process different actors might have different kind of access to the metadata records, providing different kinds of opportunities to sneak references in or register erroneous metadata.

Sneaked references remain one of many possibilities to practice citation gaming (see Section 1). More of such gaming will continue to prevail as long as academic value is tightly coupled to specific metrics.

Given the variety of reasons for erroneous references, there are multiple approaches that could be taken to improve the situation, which include improving the tools by which editors submit reference lists, the automated deduplication of reference lists after submission, and systematically cross-checking publications using Grobid in collections (proprietary or otherwise) which contain full-text PDFs. Deliberate efforts, which measure the rate of success of these approaches, are advisable.

5.3 Future work

Beyond this specific data set, the extent to which sneaked references are distorting citation counts is unknown. It might remain a very limited phenomenon but this needs to be verified by further investigations.

Identifying those journals or authors which have been most frequently associated with erroneous references, at scale, may allow us to identify the beneficiaries of sneaked references. This could act as a heuristic device to search for additional sneaked references (see Section 4.1), and to distinguish between erroneous and sneaked references.

Future work will attempt to identify erroneous references at a larger scale. We will continue to attempt to define patterns that can be used to flag sneaked references. This big data approach will help us determine whether the sneaked reference would have had a bibliometric effect, resulting in any increase in the Journal Impact Factor (or other journal-based metrics).

Data and code availability

Along with the source code implementing _1subscript_1\mathcal{M}_{\_}1caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 1 and _2subscript_2\mathcal{M}_{\_}2caligraphic_M start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT 2 we are releasing the dataset, which can be found at 10.5281/zenodo.14319568. The code can be found at 10.5281/zenodo.14291988.

Conflicts of interest

Two of the authors are employed by the providers of the data used in the analysis: Dimensions (KWB) and Crossref (DT). GC is an AE at JASIST.

Acknowledgement

CL and GC acknowledge the NanoBubbles project that has received Synergy grant funding from the European Research Council (ERC), within the European Union’s Horizon 2020 program, grant agreement no. 951393, doi:10.3030/951393. LB was supported, in part by the Knut and Alice Wallenberg Foundation (grant KAW 2019.0024). KWB would like to thank Balbir Thomas and Ruth Whittam for their participation in discovery and coding.

References

  • Beel \BBA Gipp (\APACyear2010) \APACinsertmetastarBeelAndGipp2010{APACrefauthors}Beel, J.\BCBT \BBA Gipp, B.  \APACrefYearMonthDay2010. \BBOQ\APACrefatitleOn the robustness of Google Scholar against spam On the robustness of Google Scholar against spam.\BBCQ \BIn \APACrefbtitleHT’10: Proceedings of the 21st ACM conference on Hypertext and hypermedia HT’10: Proceedings of the 21st ACM conference on Hypertext and hypermedia (\BPGS 297–298). \APACaddressPublisherACM. {APACrefDOI} 10.1145/1810617.1810683 \PrintBackRefs\CurrentBib
  • Besançon \BOthers. (\APACyear2024) \APACinsertmetastarBesan2024Sneaked{APACrefauthors}Besançon, L., Cabanac, G., Labbé, C.\BCBL \BBA Magazinov, A.  \APACrefYearMonthDay2024. \BBOQ\APACrefatitleSneaked references: Fabricated reference metadata distort citation counts Sneaked references: Fabricated reference metadata distort citation counts.\BBCQ \APACjournalVolNumPagesJournal of the Association for Information Science and Technology75121368–1379. {APACrefDOI} 10.1002/asi.24896 \PrintBackRefs\CurrentBib
  • Biagioli \BBA Lippman (\APACyear2020) \APACinsertmetastarbiagioli2020gaming{APACrefauthors}Biagioli, M.\BCBT \BBA Lippman, A. (\BEDS).   \APACrefYear2020. \APACrefbtitleGaming the metrics: Misconduct and manipulation in academic research Gaming the metrics: Misconduct and manipulation in academic research. \APACaddressPublisherMIT Press. \PrintBackRefs\CurrentBib
  • Bode \BOthers. (\APACyear2023) \APACinsertmetastarBodeEtAl2023{APACrefauthors}Bode, C., Christian Herzog, R\BPBIM., Daniel Hook\BCBL \BBA Wade, A.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleA Guide to the Dimensions Data Approach A Guide to the Dimensions Data Approach.\BBCQ \APACjournalVolNumPagesDimensions Report. {APACrefDOI} 10.6084/m9.figshare.5783094 \PrintBackRefs\CurrentBib
  • Cabanac \BOthers. (\APACyear2022) \APACinsertmetastarcabanac:hal-03829578{APACrefauthors}Cabanac, G., Labbé, C.\BCBL \BBA Magazinov, A.  \APACrefYearMonthDay2022\APACmonth05. \BBOQ\APACrefatitleThe ‘Problematic Paper Screener’ automatically selects suspect publications for post-publication (re)assessment The ‘Problematic Paper Screener’ automatically selects suspect publications for post-publication (re)assessment.\BBCQ \BIn \APACrefbtitle7th World Conference on Research Integrity (WCRI 2022). 7th World Conference on Research Integrity (WCRI 2022). \APACaddressPublisherCape Town, South Africa. {APACrefURL} https://hal.science/hal-03829578 \APACrefnoteThe theme of the conference is ‘Fostering Research Integrity in an Unequal World’ {APACrefDOI} 10.48550/arXiv.2210.04895 \PrintBackRefs\CurrentBib
  • Davis (\APACyear2016) \APACinsertmetastarDavis2016{APACrefauthors}Davis, P.  \APACrefYearMonthDay2016\APACmonth0926. \APACrefbtitleVisualizing Citation Cartels. Visualizing citation cartels. {APACrefURL} https://wp.me/peaj1R-cdk \APACrefnoteScholarly Kitchen \PrintBackRefs\CurrentBib
  • Foley \BBA Valkonen (\APACyear2012) \APACinsertmetastarFoleyAndValkonen2012{APACrefauthors}Foley, J\BPBIA.\BCBT \BBA Valkonen, L.  \APACrefYearMonthDay2012. \BBOQ\APACrefatitleAre higher cited papers accepted faster for publication? [Editorial] Are higher cited papers accepted faster for publication? [Editorial].\BBCQ \APACjournalVolNumPagesCortex486647–653. {APACrefDOI} 10.1016/j.cortex.2012.03.018 \PrintBackRefs\CurrentBib
  • Franck (\APACyear1999) \APACinsertmetastarFranck1999{APACrefauthors}Franck, G.  \APACrefYearMonthDay1999. \BBOQ\APACrefatitleScientific Communication—A Vanity Fair? [Essays on Science and Society] Scientific communication—A vanity fair? [Essays on science and society].\BBCQ \APACjournalVolNumPagesScience286543753–55. {APACrefDOI} 10.1126/science.286.5437.53 \PrintBackRefs\CurrentBib
  • Garfield (\APACyear1994) \APACinsertmetastarGarfield1994{APACrefauthors}Garfield, E.  \APACrefYearMonthDay1994June20. \BBOQ\APACrefatitleThe impact factor The impact factor.\BBCQ \APACjournalVolNumPagesCurrent Contents253-7. \PrintBackRefs\CurrentBib
  • \APACcitebtitleGROBID (\APACyear2008–2023) \APACinsertmetastarGROBID\APACrefbtitleGROBID. Grobid. \APACrefYearMonthDay2008–2023. \APAChowpublishedhttps://github.com/kermitt2/grobid. \APACaddressPublisherGitHub. \PrintBackRefs\CurrentBib
  • Heathers \BBA Grimes (\APACyear2022) \APACinsertmetastarHeathersAndGrimes2022{APACrefauthors}Heathers, J\BPBIA.\BCBT \BBA Grimes, D\BPBIR.  \APACrefYearMonthDay2022. \APACrefbtitleImpact Factor Manipulation — The Mechanics Behind A Precipitous Rise In Impact Factor: A Case Study From the British Journal of Sports Medicine. Impact Factor manipulation — the mechanics behind a precipitous rise in Impact Factor: A case study from the British Journal of Sports Medicine. \APACrefnoteOSF preprint {APACrefDOI} 10.17605/osf.io/4c6xa \PrintBackRefs\CurrentBib
  • Hendricks \BOthers. (\APACyear2020) \APACinsertmetastarHendricksEtAl2020{APACrefauthors}Hendricks, G., Tkaczyk, D., Lin, J.\BCBL \BBA Feeney, P.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleCrossref: The sustainable source of community-owned scholarly metadata Crossref: The sustainable source of community-owned scholarly metadata.\BBCQ \APACjournalVolNumPagesQuantitative Science Studies11414–427. {APACrefDOI} 10.1162/qssa00022 \PrintBackRefs\CurrentBib
  • Hirsch (\APACyear2005) \APACinsertmetastarHIndex{APACrefauthors}Hirsch, J\BPBIE.  \APACrefYearMonthDay2005. \BBOQ\APACrefatitleAn index to quantify an individual’s scientific research output An index to quantify an individual’s scientific research output.\BBCQ \APACjournalVolNumPagesProceedings of the National Academy of Sciences1024616569-16572. {APACrefURL} https://www.pnas.org/doi/abs/10.1073/pnas.0507655102 {APACrefDOI} 10.1073/pnas.0507655102 \PrintBackRefs\CurrentBib
  • Kojaku \BOthers. (\APACyear2021) \APACinsertmetastarKojaku2021{APACrefauthors}Kojaku, S., Livan, G.\BCBL \BBA Masuda, N.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDetecting anomalous citation groups in journal networks Detecting anomalous citation groups in journal networks.\BBCQ \APACjournalVolNumPagesScientific Reports111. {APACrefDOI} 10.1038/s41598-021-93572-3 \PrintBackRefs\CurrentBib
  • Labbé (\APACyear2010) \APACinsertmetastarLabbe2010{APACrefauthors}Labbé, C.  \APACrefYearMonthDay2010. \BBOQ\APACrefatitleIke Antkare, one of the great stars in the scientific firmament Ike Antkare, one of the great stars in the scientific firmament.\BBCQ \APACjournalVolNumPagesISSI Newsletter6248–52. {APACrefURL} https://www.issi-society.org/media/1126/newsletter22.pdf \PrintBackRefs\CurrentBib
  • Purkayastha \BOthers. (\APACyear2019) \APACinsertmetastarPURKAYASTHA2019635{APACrefauthors}Purkayastha, A., Palmaro, E., Falk-Krzesinski, H\BPBIJ.\BCBL \BBA Baas, J.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleComparison of two article-level, field-independent citation metrics: Field-Weighted Citation Impact (FWCI) and Relative Citation Ratio (RCR) Comparison of two article-level, field-independent citation metrics: Field-weighted citation impact (fwci) and relative citation ratio (rcr).\BBCQ \APACjournalVolNumPagesJournal of Informetrics132635-642. {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1751157718303559 {APACrefDOI} https://doi.org/10.1016/j.joi.2019.03.012 \PrintBackRefs\CurrentBib

6 Appendix

Refer to caption
(a) Distribution of the number of references registered for a DOIs with Crossref (length of _Csubscript_𝐶\mathcal{R}_{\_}{C}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C).
Refer to caption
(b) Distribution of the number of references extracted from the PDF (length of _Gsubscript_𝐺\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G).
Refer to caption
(c) Distribution of the raw differences _C_Gsubscript_𝐶subscript_𝐺\mathcal{R}_{\_}{C}-\mathcal{R}_{\_}{G}caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_C - caligraphic_R start_POSTSUBSCRIPT _ end_POSTSUBSCRIPT italic_G. Positive numbers are \mathghost\mathghost\boldsymbol{\mathghost} , negative \faSkullCrossbones