SlideShare a Scribd company logo
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
Reference Rot in Scholarly Communication:
A Reliable Quantification and
a Proposed Solution
Martin Klein
@mart1nkle1n
Research Library
Los Alamos National Laboratory
Acknowledgements:
Herbert Van de Sompel, Shawn Jones, Harihar Shankar (LANL)
Richard Tobin, Claire Grover (University of of Edinburgh)
Andy Jackson (British Library)
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
2
Agenda
1. Definition: Reference Rot
2. Study: Quantification of Content Drift
3. Proposed Solution: Robust Links
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
3
Agenda
1. Definition: Reference Rot
2. Study: Quantification of Content Drift
3. Proposed Solution: Robust Links
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
4
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
5
http://dl00.org
2000
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
6
http://dl00.org
2004
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
7
http://dl00.org
2005
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
8
http://dl00.org
2008
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
9
http://hiberlink.org/
Definition:
• Link Rot + Content Drift = Reference Rot
Observation:
• Links to these resources are subject to Reference Rot
• Web at large resources referenced in scholarly articles
Problem:
• Thread to integrity of the web-based scholarly record
• Resources do not have the same sense of fixity like e.g.,
journal articles
• Resources’ custodianship is different, in terms of long-
term archiving, integrity, and access
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
10
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
11
http://dx.doi.org/10.1371/journal.pone.0115253
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
12
Agenda
1. Definition: Reference Rot
2. Study: Quantification of Content Drift
3. Proposed Solution: Robust Links
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
13
http://dx.doi.org/10.1371/journal.pone.0167475
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
14
Study Dataset
• 3.5 million articles from arXiv, Elsevier, PMC
• Published between Jan 1997 – Dec 2012
• Converted from PDF to XML
• Extraction of URIs to web at large resources (>1 million)
• Keep track of articles’ publication date
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
15
Corpora
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
16
Novel Approach to Assess Content Drift
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
17
Step 1: Find Mementos
• ~ 1 million URI references
• ~ 650k Memento Pre/Post pairs
discovered via Memento
https://mementoweb.org
https://tools.ietf.org/html/rfc7089
t t+1t-1
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
18
Step 2: Select Representative Mementos
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
19
Referenced in
http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110
published on August 15th 2009
May 8th 2009 August 27th 2009
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
20
Referenced in
http://arxiv.org/abs/astro-ph/9707064
published on July 4th 1977
June 7th 1997 today
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
21
• Apply content similarity measures
• How similar is representative?
Step 2: Select Representative Mementos
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
22
Content Similarity Measures
• Compute normalized scores (values between 0...100) for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
23
Representative Mementos
• Idea
• If perfect score in all 4 similarity measures
 Memento Pre and Post are the same
 Representative Mementos
• Sanity check needed
• Via HTTP headers: E-Tag and Last-Modified
• If same for Pre and Post Memento
 HTTP-same
• Sanity check passed!
• 98.88% of Memento pairs that are HTTP-same have perfect
score in all 4 similarity measures
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
24
• ~ 313k referenced URIs have
representative Mementos
Step 2: Select Representative Mementos
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
25
• 241k out of 313k URIs have a live web version
Step 3: Dereference Live Web Version of URI
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
26
Step 4: Representative Memento vs. Live Version
• Apply content similarity measures
• Bin results into 6 clusters
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
27
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
28
Aggregate
Similarity
Score
Good:
23.7% of
URIs have
*not*
drifted!
Bad:
3/4 URIs
*have*
drifted!
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
29
Content Drift & Link Rot Over Time - arXiv
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
30
arXiv
Elsevier
PMC
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
31
Agenda
1. Definition: Reference Rot
2. Study: Quantification of Content Drift
3. Proposed Solution: Robust Links
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
32
Why don’t DOIs solve this problem?
• Designed to combat link rot of references to scholarly articles
• Fully relies on the custodians of DOI-identified resources to
maintain links
• Strong incentives to invest in link stability, usability of their content
is at stake
• Custodians of web at large resources do not have such incentives
(typically web admins, not academic publishers)
• Not overly concerned about longevity of their website, the scholarly
record
• Reference rot is a problem largely rooted outside scholarly
communication community, will have to be solved by it
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
33
Common Practice – original URI (& last accessed date)
Original URI Accessed
datetime
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
34
Robust Links
1. Create a snapshot of referenced resources in a public web
archive
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
35
Common Practice – Archived URI
URI of archived snapshot
Capture datetime
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
36
Robust Links
1. Create a snapshot of referenced resources in a publically available
web archive
2. Decorate links with:
• URI of archived snapshot
• datetime of archiving
• resource’s original URI
Benefits:
• Original URI allows finding captures in all web archives
• Capture datetime allows finding an appropriate capture in all
web archives
• Uniform, machine-actionable
http://robustlinks.mementoweb.org/
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
37
Robust Links - Desired Practice
URI of archived snapshot
Capture datetime
Original URI
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
38
Link Decoration
<a href="https://www.cni.org/">CNI</a>
http://robustlinks.mementoweb.org/spec
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
39
Link Decoration
<a href="https://www.cni.org/"
data-versionurl="http://archive.is/fItfw"
data-versiondate="2016-12-11">
CNI</a>
http://robustlinks.mementoweb.org/spec
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
40
Link Decoration in Action
http://dx.doi.org/10.1045/november2015-vandesompel
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
41
Link Decoration in Action
http://dx.doi.org/10.1045/november2015-vandesompel
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
42
Take-Aways
1. Scholarly articles increasingly contain URI references to web at
large resources.
2. Such resources are subject to reference rot (link rot + content drift).
3. Custodians of these resources are typically not overly concerned
with archiving of their content and longevity of the scholarly record.
4. Authors, publishers, and other parties can help tackle this problem
by making links more robust.
Reference Rot in Scholarly Communication
@mart1nkle1n
CNI Fall Meeting, 12/12/2016, Washington, DC
Reference Rot in Scholarly Communication:
A Reliable Quantification and
a Proposed Solution
Martin Klein
@mart1nkle1n
Research Library
Los Alamos National Laboratory

More Related Content

What's hot (20)

PPTX
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Simeon Warner
 
PPTX
PID Signposting Pattern
Herbert Van de Sompel
 
PDF
Quantifying Orphaned Annotations in Hypothes.is
maturban
 
PPTX
To the Rescue of the Orphans of Scholarly Communication
Martin Klein
 
PPT
Achieving Link Integrity for Managed Collections
Herbert Van de Sompel
 
PPTX
Signposting Overview (Version November 2017)
Herbert Van de Sompel
 
PPTX
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
Herbert Van de Sompel
 
PPTX
Forging New Links: Libraries in the Semantic Web
Gillian Byrne
 
PDF
Clark - Metadata is the Message
National Information Standards Organization (NISO)
 
PPTX
The Progress of BIBFRAME, by Angela Kroeger
Angela Kroeger
 
PDF
A document-inspired way for tracking changes of RDF data - The case of the Op...
University of Bologna
 
PPTX
FAIR Signposting: A KISS Approach to a Burning Issue
Herbert Van de Sompel
 
PPTX
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 
PDF
OpenCitations
University of Bologna
 
PPTX
Reference Rot
Shawn Jones
 
PDF
Linked Data - Radical Change?
Richard Wallis
 
PPTX
A Brief Overview of BIBFRAME, by Angela Kroeger
Angela Kroeger
 
PPTX
Basic concept of Linked & Linked open Government data
saima hanif
 
PPTX
Consuming Linked Data SemTech2010
Juan Sequeda
 
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Simeon Warner
 
PID Signposting Pattern
Herbert Van de Sompel
 
Quantifying Orphaned Annotations in Hypothes.is
maturban
 
To the Rescue of the Orphans of Scholarly Communication
Martin Klein
 
Achieving Link Integrity for Managed Collections
Herbert Van de Sompel
 
Signposting Overview (Version November 2017)
Herbert Van de Sompel
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
Herbert Van de Sompel
 
Forging New Links: Libraries in the Semantic Web
Gillian Byrne
 
The Progress of BIBFRAME, by Angela Kroeger
Angela Kroeger
 
A document-inspired way for tracking changes of RDF data - The case of the Op...
University of Bologna
 
FAIR Signposting: A KISS Approach to a Burning Issue
Herbert Van de Sompel
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 
OpenCitations
University of Bologna
 
Reference Rot
Shawn Jones
 
Linked Data - Radical Change?
Richard Wallis
 
A Brief Overview of BIBFRAME, by Angela Kroeger
Angela Kroeger
 
Basic concept of Linked & Linked open Government data
saima hanif
 
Consuming Linked Data SemTech2010
Juan Sequeda
 

Similar to Reference Rot in Scholarly Communication: A Reliable Quantification and a Proposed Solution (20)

PPTX
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
PDF
Research IT @ Illinois: Establishing Service Responsive to Investigator Needs
John Towns
 
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
PPTX
Robust Linking to Web Resources
Martin Klein
 
PPTX
web_archive_interoperability_memento
Martin Klein
 
PPTX
Paul Evan Peters Lecture
Herbert Van de Sompel
 
PPTX
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
National Information Standards Organization (NISO)
 
PPTX
Web Today, Good Tomorrow? Transactional archiving of web content
Peter Burnhill
 
PPTX
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
Peter Broadwell
 
PPTX
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
EDINA, University of Edinburgh
 
PPTX
Prototypes of pro-active approaches to support the archiving of web reference...
EDINA, University of Edinburgh
 
PDF
Varnum Tracking Link Origins Working Group
National Information Standards Organization (NISO)
 
PPTX
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
PRELIDA Project
 
PDF
The opac and the web
University of Missouri
 
PPTX
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
EDINA, University of Edinburgh
 
PDF
Introduction to Social Network Analysis
Patti Anklam
 
PPTX
Reference Rot: Threat and Remedy
EDINA, University of Edinburgh
 
PPTX
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]
Peter Burnhill
 
PPTX
Creation, Transformation, Dissemination and Preservation: Advocating for Scho...
NASIG
 
PPTX
A Perspective on Archiving the Scholarly Record
Herbert Van de Sompel
 
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
Research IT @ Illinois: Establishing Service Responsive to Investigator Needs
John Towns
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
Robust Linking to Web Resources
Martin Klein
 
web_archive_interoperability_memento
Martin Klein
 
Paul Evan Peters Lecture
Herbert Van de Sompel
 
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
National Information Standards Organization (NISO)
 
Web Today, Good Tomorrow? Transactional archiving of web content
Peter Burnhill
 
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
Peter Broadwell
 
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
EDINA, University of Edinburgh
 
Prototypes of pro-active approaches to support the archiving of web reference...
EDINA, University of Edinburgh
 
Varnum Tracking Link Origins Working Group
National Information Standards Organization (NISO)
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
PRELIDA Project
 
The opac and the web
University of Missouri
 
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
EDINA, University of Edinburgh
 
Introduction to Social Network Analysis
Patti Anklam
 
Reference Rot: Threat and Remedy
EDINA, University of Edinburgh
 
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]
Peter Burnhill
 
Creation, Transformation, Dissemination and Preservation: Advocating for Scho...
NASIG
 
A Perspective on Archiving the Scholarly Record
Herbert Van de Sompel
 
Ad

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
PPTX
Evaluating Memento Service Optimizations
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
PPTX
Smart Routing of Memento Requests
Martin Klein
 
PPTX
Building Event Collections from Crawling Web Archives
Martin Klein
 
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
PPTX
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
PPTX
Uniform Access to Raw Mementos
Martin Klein
 
PPTX
Robust Links - a proposed solution to reference rot in scholarly communication
Martin Klein
 
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Martin Klein
 
PPTX
Comparing Published Scientific Journal Articles to Their Pre-print Versions
Martin Klein
 
PPTX
Preserving Born-Digital News Panel JCDL 2016
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
Evaluating Memento Service Optimizations
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
Smart Routing of Memento Requests
Martin Klein
 
Building Event Collections from Crawling Web Archives
Martin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
Uniform Access to Raw Mementos
Martin Klein
 
Robust Links - a proposed solution to reference rot in scholarly communication
Martin Klein
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Martin Klein
 
Comparing Published Scientific Journal Articles to Their Pre-print Versions
Martin Klein
 
Preserving Born-Digital News Panel JCDL 2016
Martin Klein
 
Ad

Recently uploaded (20)

PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
PPTX
internet básico presentacion es una red global
70965857
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
PDF
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PPTX
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
PPTX
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
PPT
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
PPTX
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
PDF
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
PDF
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
Orchestrating things in Angular application
Peter Abraham
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
internet básico presentacion es una red global
70965857
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 

Reference Rot in Scholarly Communication: A Reliable Quantification and a Proposed Solution

  • 1. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC Reference Rot in Scholarly Communication: A Reliable Quantification and a Proposed Solution Martin Klein @mart1nkle1n Research Library Los Alamos National Laboratory Acknowledgements: Herbert Van de Sompel, Shawn Jones, Harihar Shankar (LANL) Richard Tobin, Claire Grover (University of of Edinburgh) Andy Jackson (British Library)
  • 2. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 2 Agenda 1. Definition: Reference Rot 2. Study: Quantification of Content Drift 3. Proposed Solution: Robust Links
  • 3. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 3 Agenda 1. Definition: Reference Rot 2. Study: Quantification of Content Drift 3. Proposed Solution: Robust Links
  • 4. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 4
  • 5. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 5 http://dl00.org 2000
  • 6. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 6 http://dl00.org 2004
  • 7. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 7 http://dl00.org 2005
  • 8. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 8 http://dl00.org 2008
  • 9. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 9 http://hiberlink.org/ Definition: • Link Rot + Content Drift = Reference Rot Observation: • Links to these resources are subject to Reference Rot • Web at large resources referenced in scholarly articles Problem: • Thread to integrity of the web-based scholarly record • Resources do not have the same sense of fixity like e.g., journal articles • Resources’ custodianship is different, in terms of long- term archiving, integrity, and access
  • 10. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 10
  • 11. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 11 http://dx.doi.org/10.1371/journal.pone.0115253
  • 12. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 12 Agenda 1. Definition: Reference Rot 2. Study: Quantification of Content Drift 3. Proposed Solution: Robust Links
  • 13. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 13 http://dx.doi.org/10.1371/journal.pone.0167475
  • 14. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 14 Study Dataset • 3.5 million articles from arXiv, Elsevier, PMC • Published between Jan 1997 – Dec 2012 • Converted from PDF to XML • Extraction of URIs to web at large resources (>1 million) • Keep track of articles’ publication date
  • 15. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 15 Corpora
  • 16. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 16 Novel Approach to Assess Content Drift
  • 17. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 17 Step 1: Find Mementos • ~ 1 million URI references • ~ 650k Memento Pre/Post pairs discovered via Memento https://mementoweb.org https://tools.ietf.org/html/rfc7089 t t+1t-1
  • 18. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 18 Step 2: Select Representative Mementos
  • 19. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 19 Referenced in http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110 published on August 15th 2009 May 8th 2009 August 27th 2009
  • 20. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 20 Referenced in http://arxiv.org/abs/astro-ph/9707064 published on July 4th 1977 June 7th 1997 today
  • 21. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 21 • Apply content similarity measures • How similar is representative? Step 2: Select Representative Mementos
  • 22. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 22 Content Similarity Measures • Compute normalized scores (values between 0...100) for: • Simhash • Jaccard • Sørensen-Dice • Cosine
  • 23. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 23 Representative Mementos • Idea • If perfect score in all 4 similarity measures  Memento Pre and Post are the same  Representative Mementos • Sanity check needed • Via HTTP headers: E-Tag and Last-Modified • If same for Pre and Post Memento  HTTP-same • Sanity check passed! • 98.88% of Memento pairs that are HTTP-same have perfect score in all 4 similarity measures
  • 24. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 24 • ~ 313k referenced URIs have representative Mementos Step 2: Select Representative Mementos
  • 25. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 25 • 241k out of 313k URIs have a live web version Step 3: Dereference Live Web Version of URI
  • 26. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 26 Step 4: Representative Memento vs. Live Version • Apply content similarity measures • Bin results into 6 clusters
  • 27. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 27
  • 28. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 28 Aggregate Similarity Score Good: 23.7% of URIs have *not* drifted! Bad: 3/4 URIs *have* drifted!
  • 29. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 29 Content Drift & Link Rot Over Time - arXiv
  • 30. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 30 arXiv Elsevier PMC
  • 31. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 31 Agenda 1. Definition: Reference Rot 2. Study: Quantification of Content Drift 3. Proposed Solution: Robust Links
  • 32. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 32 Why don’t DOIs solve this problem? • Designed to combat link rot of references to scholarly articles • Fully relies on the custodians of DOI-identified resources to maintain links • Strong incentives to invest in link stability, usability of their content is at stake • Custodians of web at large resources do not have such incentives (typically web admins, not academic publishers) • Not overly concerned about longevity of their website, the scholarly record • Reference rot is a problem largely rooted outside scholarly communication community, will have to be solved by it
  • 33. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 33 Common Practice – original URI (& last accessed date) Original URI Accessed datetime
  • 34. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 34 Robust Links 1. Create a snapshot of referenced resources in a public web archive
  • 35. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 35 Common Practice – Archived URI URI of archived snapshot Capture datetime
  • 36. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 36 Robust Links 1. Create a snapshot of referenced resources in a publically available web archive 2. Decorate links with: • URI of archived snapshot • datetime of archiving • resource’s original URI Benefits: • Original URI allows finding captures in all web archives • Capture datetime allows finding an appropriate capture in all web archives • Uniform, machine-actionable http://robustlinks.mementoweb.org/
  • 37. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 37 Robust Links - Desired Practice URI of archived snapshot Capture datetime Original URI
  • 38. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 38 Link Decoration <a href="https://www.cni.org/">CNI</a> http://robustlinks.mementoweb.org/spec
  • 39. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 39 Link Decoration <a href="https://www.cni.org/" data-versionurl="http://archive.is/fItfw" data-versiondate="2016-12-11"> CNI</a> http://robustlinks.mementoweb.org/spec
  • 40. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 40 Link Decoration in Action http://dx.doi.org/10.1045/november2015-vandesompel
  • 41. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 41 Link Decoration in Action http://dx.doi.org/10.1045/november2015-vandesompel
  • 42. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC 42 Take-Aways 1. Scholarly articles increasingly contain URI references to web at large resources. 2. Such resources are subject to reference rot (link rot + content drift). 3. Custodians of these resources are typically not overly concerned with archiving of their content and longevity of the scholarly record. 4. Authors, publishers, and other parties can help tackle this problem by making links more robust.
  • 43. Reference Rot in Scholarly Communication @mart1nkle1n CNI Fall Meeting, 12/12/2016, Washington, DC Reference Rot in Scholarly Communication: A Reliable Quantification and a Proposed Solution Martin Klein @mart1nkle1n Research Library Los Alamos National Laboratory

Editor's Notes

  • #17: Previously, archival status (14-day window) as proxy
  • #18: Previously, archival status (14-day window) as proxy
  • #20: IceCube Neutrino Observatory at the University of Wisconsin http://icecube.wisc.edu
  • #21: Institute for Astronomy at the University of Hawaii http://www.ifa.hawaii.edu/~cowie/k_table.html