SlideShare a Scribd company logo
Comparing Published Scientific
Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell
@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone
@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.edu
University of California Los Angeles
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
2
Scientific Output in Numbers
Global STM publishing market > $25 billion
• 55% of this from USA
• 28% from Europe, Middle East
• Journals core part of scholarly communication process
• English language journal revenue: ~ $10 billion
• ~ 70% of that out of libraries’ budget
• > 28k scholarly peer-reviewed journals (+3.5% p.a.)
• ~ 2.5 million articles per year (+3% p.a.)
• 21% of research papers from USA
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
3
University of California Publication Impact
“Research Performance of the UC System,” Elsevier, March 2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
4
Open Access by Disciplines
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010
http://dx.doi.org/10.1371/journal.pone.0011273
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
5
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
6
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
 20.4% OA rate
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
7
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
 20.4% OA rate
2015
“Open Access and Sources of Full-Text Articles in Google Scholar in Different
Subject Fields”, Hammid et al.
(http://dx.doi.org/10.1007/s11192-015-1642-2)
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
8
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
 20.4% OA rate
2015
“Open Access and Sources of Full-Text Articles in Google Scholar in Different
Subject Fields”, Hammid et al.
(http://dx.doi.org/10.1007/s11192-015-1642-2)
 61.1% OA rate
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
9
Pre-print v. Final Published
arXiv.org
• Average annual operating cost for 2013 - 2017:
$826,000
Final Published
• English language STM journals: $10 billion in 2013
http://arxiv.org/help/support/faq#3D
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
10
Role of Publisher
• Entrepreneur
• Copyediting
• Tagging
• Marketer
• Distributor
• E-Host
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
11
Value of Publisher
“Once you’ve gone through the peer review process, if you look
at the article that is actually published in a journal, it looks
radically different [to the one submitted due to] that process of
transformation, the copy-editing, the database linking, the data
visualisation tools, making sure that the metadata for the article
is all right, so when people come to [Elsevier database]
ScienceDirect or type a search into Google, they can actually
find what they are looking for on their platforms.”
Gemma Hersh
http://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
12
Working Assumptions
1. If the publishers’ argument is valid, the text of a
pre-print paper should vary significantly from its
corresponding post-print version.
1. By applying standard similarity measures, we
should be able to detect and quantify such
differences.
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
13
Assembling a pre-print corpus
Source: arXiv.org
• 1.1 million publication records
• Metadata (typical DC, including DOI) obtained
via OAI-PMH interface
• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
14
Finding a matching post-print corpus
1. Extract DOIs from arXiv metadata
• 44.5% or articles have DOI
2. CrossRef’s Metadata Search API
• Match by DOI
• Download article & metadata in XML/PDF
 Results in:
• 11,017 full text articles
• Majority published by Elsevier between 2003 and
2015
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
15
Text Comparison Methods
1. Length ratio
2. Levenshtein ratio
3. Cosine similarity
4. Jaccard coefficient
5. Sorensen similarity
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
16
Comparison of Sections
“Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015
http://dx.doi.org/10.1145/2756406.2756948
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
17
Comparison of Sections
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
18
Title Comparison
Explore our findings at http://sologlo.library.ucla.edu/prepost
Papers
Similarity (1 = most similar)
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
1100020003000400050006000700080009000
0102030405060708090100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
19
Comparison of Sections
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
20
Abstract Comparison
Papers
Similarity (1 = most similar)
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
1100020003000400050006000700080009000
0102030405060708090100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Explore our findings at http://sologlo.library.ucla.edu/prepost
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
21
10.1016/j.physletb.2006.10.068
Physics Letters B
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
22
Comparison of Sections
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
23
Body Comparison
Papers
Similarity (1 = most similar)
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
110002000300040005000600070008000
0102030405060708090100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Explore our findings at http://sologlo.library.ucla.edu/prepost
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
24
Publication Dates
Papers
0100030005000
1−90
91−180
181−270
271−360
361−450
451−540
541−630
631−720
>720
Pre−print first
Final published first
Number of days
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
25
Assembling a pre-print corpus
Source: arXiv.org
• 1.1 million publication records
• metadata (typical DC, including DOI) obtained
via OAI-PMH interface
• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
• *Latest version used if multiple available*
• 35% of all arXiv papers have > 1 version
• 58% of our matched papers have > 1 version
• Repeat experiment with *earliest version*
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
26
Publication Dates of Earliest Versions
Papers
Number of days
01000200030004000
1−90
91−180
181−270
271−360
361−450
451−540
541−630
631−720
>720
Pre−print first
Final published first
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
27
Title Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1000−800−600−400−2000200
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
28
Title Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1000−800−600−400−2000200
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
29
Title Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1000−800−600−400−2000200
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
30
Abstract Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1500−1000−5000500
1009080706050403020100
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
31
Body Deltas
Papers
%ofallpapers
1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0
−1500−1000−50005001000
100806040200
Length
Levenshtein
Cosine
Sorensen
Jaccard
Percentage
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
32
Discussion & Future Work
• Single corpus experiment
• Pre-print/final published matches based on:
• DOIs
• CrossRef API results
• UCLA serial subscriptions (majority Elsevier
publications)
• Expand to other disciplines/publishers
• Overlay with ISI Impact factor and usage statistics
• Refine extraction/comparison of authors and
references
• Operate at scale
Comparing Published Scientific
Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell
@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone
@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.edu
University of California Los Angeles

More Related Content

What's hot (20)

PPTX
A replication crisis in the making: how we reward unreliable science
Björn Brembs
 
PPT
Bibliosight Project - JournalTOCs Workshop
azami
 
PPTX
Why canceling subscriptions may just yet save scholarship
Björn Brembs
 
PPT
ER&L KBART Update
Jason Price, PhD
 
PPTX
Creating Pockets of Persistence
Herbert Van de Sompel
 
PPTX
Forging New Links: Libraries in the Semantic Web
Gillian Byrne
 
PPT
RPI Research in Linked Open Government Systems
James Hendler
 
PPTX
Hiberlink: Investigating Reference Rot, December 2013
Herbert Van de Sompel
 
PDF
Semantic Web Applications in Libraries: The Road to BIBFRAME
National Information Standards Organization (NISO)
 
PPTX
BIBFRAME : the future of cataloguing?
Thomas Meehan
 
PPTX
MLA CE Course: Third-Party PubMed Tools
National Network of Libraries of Medicine, Pacific Northwest Region
 
PDF
Presentation1
brigoliphhoebelyn1
 
PPTX
Open Access NBIC Workshop April 19, 2011
Philip Bourne
 
PDF
How to build your own citation index
GESIS
 
PPT
Linked Open Data for Libraries
Lukas Koster
 
PPT
Federated Search Falls Short
slknight
 
PDF
Giving researchers credit for data
Jisc
 
PDF
Crossref webinar - Maintaining your metadata - latest
Crossref
 
A replication crisis in the making: how we reward unreliable science
Björn Brembs
 
Bibliosight Project - JournalTOCs Workshop
azami
 
Why canceling subscriptions may just yet save scholarship
Björn Brembs
 
ER&L KBART Update
Jason Price, PhD
 
Creating Pockets of Persistence
Herbert Van de Sompel
 
Forging New Links: Libraries in the Semantic Web
Gillian Byrne
 
RPI Research in Linked Open Government Systems
James Hendler
 
Hiberlink: Investigating Reference Rot, December 2013
Herbert Van de Sompel
 
Semantic Web Applications in Libraries: The Road to BIBFRAME
National Information Standards Organization (NISO)
 
BIBFRAME : the future of cataloguing?
Thomas Meehan
 
Presentation1
brigoliphhoebelyn1
 
Open Access NBIC Workshop April 19, 2011
Philip Bourne
 
How to build your own citation index
GESIS
 
Linked Open Data for Libraries
Lukas Koster
 
Federated Search Falls Short
slknight
 
Giving researchers credit for data
Jisc
 
Crossref webinar - Maintaining your metadata - latest
Crossref
 

Viewers also liked (7)

PPTX
Jason chinchilla
Jason Paz
 
PPTX
Companies that produce & distribute rn b genre
fahrinsultana
 
PDF
Ood启思录01
yiditushe
 
PPTX
Carol vernallis theory
fahrinsultana
 
PPTX
About Webtechnologies
BinTech Services
 
PDF
Interrogating the Politics and Performativity of Web Archiving
Jessica Ogden
 
Jason chinchilla
Jason Paz
 
Companies that produce & distribute rn b genre
fahrinsultana
 
Ood启思录01
yiditushe
 
Carol vernallis theory
fahrinsultana
 
About Webtechnologies
BinTech Services
 
Interrogating the Politics and Performativity of Web Archiving
Jessica Ogden
 
Ad

Similar to Comparing Published Scientific Journal Articles to Their Pre-print Versions (20)

PDF
Preprints: a journey though time
Graham Steel
 
PPTX
Publishing and impact Wageningen University IL for PhD 20141202
Hugo Besemer
 
PPTX
British Library
clarivate
 
PDF
A Science Mapping Analysis Of Blood Donation Behaviour
Bria Davis
 
PPT
Author workshop TU Delft 20111122
Anke Versteeg
 
PDF
STRETCHING THE BOUNDARIES OF PUBLISHING: ALTERNATIVES
Nicolaie Constantinescu
 
PDF
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
PPTX
Publish be cited, or perish
Wouter Gerritsma
 
PPTX
SciVerse @ TJU
rachelmccullough
 
PPT
Peer Review and Science2.0
Jean-Claude Bradley
 
PPT
The future of scholarly publishing: where do we go from here?
Research Information Network
 
PPTX
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
PPT
Stevan Harnad - Scholarly/Scientific Impact Metrics in the Open Access Era
Conferência Luso-Brasileira de Ciência Aberta
 
PPT
Open Access Publishing: More Readers, More Impact
Western New York/Ontario Chapter Association of College and Research Libraries
 
PPTX
Where to publish_130709
opl10
 
PDF
The Initiative for Open Citations and the OpenCitations Corpus
University of Bologna
 
PPTX
Publishing and impact 20141028
Hugo Besemer
 
PDF
Science in the context of journals, Open, and the future
Benjamin Laken
 
PPTX
Holy Cross Lunch and Learn
rachelmccullough
 
Preprints: a journey though time
Graham Steel
 
Publishing and impact Wageningen University IL for PhD 20141202
Hugo Besemer
 
British Library
clarivate
 
A Science Mapping Analysis Of Blood Donation Behaviour
Bria Davis
 
Author workshop TU Delft 20111122
Anke Versteeg
 
STRETCHING THE BOUNDARIES OF PUBLISHING: ALTERNATIVES
Nicolaie Constantinescu
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Publish be cited, or perish
Wouter Gerritsma
 
SciVerse @ TJU
rachelmccullough
 
Peer Review and Science2.0
Jean-Claude Bradley
 
The future of scholarly publishing: where do we go from here?
Research Information Network
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Stevan Harnad - Scholarly/Scientific Impact Metrics in the Open Access Era
Conferência Luso-Brasileira de Ciência Aberta
 
Where to publish_130709
opl10
 
The Initiative for Open Citations and the OpenCitations Corpus
University of Bologna
 
Publishing and impact 20141028
Hugo Besemer
 
Science in the context of journals, Open, and the future
Benjamin Laken
 
Holy Cross Lunch and Learn
rachelmccullough
 
Ad

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
PPTX
Evaluating Memento Service Optimizations
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
PPTX
Smart Routing of Memento Requests
Martin Klein
 
PPTX
Building Event Collections from Crawling Web Archives
Martin Klein
 
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
PPTX
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
PPTX
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
PPTX
Robust Linking to Web Resources
Martin Klein
 
PPTX
Signposting for Repositories
Martin Klein
 
PPTX
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
Evaluating Memento Service Optimizations
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
Smart Routing of Memento Requests
Martin Klein
 
Building Event Collections from Crawling Web Archives
Martin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
Robust Linking to Web Resources
Martin Klein
 
Signposting for Repositories
Martin Klein
 
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 

Recently uploaded (20)

PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 

Comparing Published Scientific Journal Articles to Their Pre-print Versions

  • 1. Comparing Published Scientific Journal Articles to Their Pre-print Versions Martin Klein Peter Broadwell @mart1nkle1n @peterbroadwell with Sharon E. Farb and Todd Grappone @farbthink, @liber8er {martinklein,broadwell,farb,grappone}@library.ucla.edu University of California Los Angeles
  • 2. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 2 Scientific Output in Numbers Global STM publishing market > $25 billion • 55% of this from USA • 28% from Europe, Middle East • Journals core part of scholarly communication process • English language journal revenue: ~ $10 billion • ~ 70% of that out of libraries’ budget • > 28k scholarly peer-reviewed journals (+3.5% p.a.) • ~ 2.5 million articles per year (+3% p.a.) • 21% of research papers from USA “STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
  • 3. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 3 University of California Publication Impact “Research Performance of the UC System,” Elsevier, March 2015
  • 4. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 4 Open Access by Disciplines “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010 http://dx.doi.org/10.1371/journal.pone.0011273
  • 5. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 5 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)
  • 6. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 6 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)  20.4% OA rate
  • 7. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 7 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)  20.4% OA rate 2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al. (http://dx.doi.org/10.1007/s11192-015-1642-2)
  • 8. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 8 Open Access Rate Overall 2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. (http://dx.doi.org/10.1371/journal.pone.0011273)  20.4% OA rate 2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al. (http://dx.doi.org/10.1007/s11192-015-1642-2)  61.1% OA rate
  • 9. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 9 Pre-print v. Final Published arXiv.org • Average annual operating cost for 2013 - 2017: $826,000 Final Published • English language STM journals: $10 billion in 2013 http://arxiv.org/help/support/faq#3D “STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
  • 10. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 10 Role of Publisher • Entrepreneur • Copyediting • Tagging • Marketer • Distributor • E-Host
  • 11. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 11 Value of Publisher “Once you’ve gone through the peer review process, if you look at the article that is actually published in a journal, it looks radically different [to the one submitted due to] that process of transformation, the copy-editing, the database linking, the data visualisation tools, making sure that the metadata for the article is all right, so when people come to [Elsevier database] ScienceDirect or type a search into Google, they can actually find what they are looking for on their platforms.” Gemma Hersh http://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037
  • 12. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 12 Working Assumptions 1. If the publishers’ argument is valid, the text of a pre-print paper should vary significantly from its corresponding post-print version. 1. By applying standard similarity measures, we should be able to detect and quantify such differences.
  • 13. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 13 Assembling a pre-print corpus Source: arXiv.org • 1.1 million publication records • Metadata (typical DC, including DOI) obtained via OAI-PMH interface • PDF versions of articles available via Amazon’s S3 service (using “requester pays” option)
  • 14. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 14 Finding a matching post-print corpus 1. Extract DOIs from arXiv metadata • 44.5% or articles have DOI 2. CrossRef’s Metadata Search API • Match by DOI • Download article & metadata in XML/PDF  Results in: • 11,017 full text articles • Majority published by Elsevier between 2003 and 2015
  • 15. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 15 Text Comparison Methods 1. Length ratio 2. Levenshtein ratio 3. Cosine similarity 4. Jaccard coefficient 5. Sorensen similarity
  • 16. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 16 Comparison of Sections “Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015 http://dx.doi.org/10.1145/2756406.2756948
  • 17. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 17 Comparison of Sections
  • 18. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 18 Title Comparison Explore our findings at http://sologlo.library.ucla.edu/prepost Papers Similarity (1 = most similar) %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 1100020003000400050006000700080009000 0102030405060708090100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 19. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 19 Comparison of Sections
  • 20. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 20 Abstract Comparison Papers Similarity (1 = most similar) %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 1100020003000400050006000700080009000 0102030405060708090100 Length Levenshtein Cosine Sorensen Jaccard Percentage Explore our findings at http://sologlo.library.ucla.edu/prepost
  • 21. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 21 10.1016/j.physletb.2006.10.068 Physics Letters B
  • 22. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 22 Comparison of Sections
  • 23. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 23 Body Comparison Papers Similarity (1 = most similar) %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 110002000300040005000600070008000 0102030405060708090100 Length Levenshtein Cosine Sorensen Jaccard Percentage Explore our findings at http://sologlo.library.ucla.edu/prepost
  • 24. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 24 Publication Dates Papers 0100030005000 1−90 91−180 181−270 271−360 361−450 451−540 541−630 631−720 >720 Pre−print first Final published first Number of days
  • 25. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 25 Assembling a pre-print corpus Source: arXiv.org • 1.1 million publication records • metadata (typical DC, including DOI) obtained via OAI-PMH interface • PDF versions of articles available via Amazon’s S3 service (using “requester pays” option) • *Latest version used if multiple available* • 35% of all arXiv papers have > 1 version • 58% of our matched papers have > 1 version • Repeat experiment with *earliest version*
  • 26. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 26 Publication Dates of Earliest Versions Papers Number of days 01000200030004000 1−90 91−180 181−270 271−360 361−450 451−540 541−630 631−720 >720 Pre−print first Final published first
  • 27. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 27 Title Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1000−800−600−400−2000200 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 28. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 28 Title Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1000−800−600−400−2000200 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 29. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 29 Title Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1000−800−600−400−2000200 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 30. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 30 Abstract Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1500−1000−5000500 1009080706050403020100 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 31. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 31 Body Deltas Papers %ofallpapers 1 ... 0.9 0.9 ... 0.8 0.8 ... 0.7 0.7 ... 0.6 0.6 ... 0.5 0.5 ... 0.4 0.4 ... 0.3 0.3 ... 0.2 0.2 ... 0.1 0.1 ... 0 −1500−1000−50005001000 100806040200 Length Levenshtein Cosine Sorensen Jaccard Percentage
  • 32. Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 32 Discussion & Future Work • Single corpus experiment • Pre-print/final published matches based on: • DOIs • CrossRef API results • UCLA serial subscriptions (majority Elsevier publications) • Expand to other disciplines/publishers • Overlay with ISI Impact factor and usage statistics • Refine extraction/comparison of authors and references • Operate at scale
  • 33. Comparing Published Scientific Journal Articles to Their Pre-print Versions Martin Klein Peter Broadwell @mart1nkle1n @peterbroadwell with Sharon E. Farb and Todd Grappone @farbthink, @liber8er {martinklein,broadwell,farb,grappone}@library.ucla.edu University of California Los Angeles