SlideShare a Scribd company logo
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
A major challenge in web archiving:
Scale vs. Quality
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!
https://twitter.com/brewster_kahle/status/1016003169589981184
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!
https://twitter.com/brewster_kahle/status/1118172506777509890
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!
https://twitter.com/brewster_kahle/status/1139700494748663809
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!!
https://twitter.com/brewster_kahle/status/1170820482104348672
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/*/http://cnn.com
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/20190808041346/https://www.cnn.com/
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!
https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!!
https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Scale?
https://twitter.com/mart1nkle1n/status/1136705116738904067
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer
http://tracer.mementoweb.org
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Inspired by:
• LOCKSS
• Same automated approach for resources of a class
• Webrecorder
• Manual recording of web resources
• Various attempts aimed at automating interactions/behaviors
• E.g., Brozzler, Browsertrix
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Implementation
• Client-side:
• Tracer Chrome extension leveraging Selenium IDE
• JSON-formatted Trace for download
• Server-side:
• Stormcrawler
• Selenium (Chrome) with Tracer plug-in
• WarcProxy
• file-system storage for WARC files
http://stormcrawler.net/
https://www.seleniumhq.org/projects/webdriver/
https://github.com/odie5533/WarcProxy
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Current Memento Tracer Capabilities
• Single clicks/links
• All links in an area
• Repeated click on links, with stop condition
• Slides
• Pagination
• Nested traces i.e., “trace in a trace”
• Trace for portal A  follow link to portal B  execute
trace for portal B
• Identification of page/portal for which a trace exists by URI
(pattern)
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Benefits
• Scalability
• Trace created once is applicable to all web resources of
the same class
• Traces shared via repository (edits, versioning)
• Quality
• Trace used as set of instructions for browser-based
capture framework
• Resource boundary explicit
• Tradeoff
• Quality vs performance
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Evaluation of Scalability & Quality
• Dataset made of GitHub repositories and Slideshare slide decks
• 17,646 GitHub repositories (via changelog.com)
• 12,280 Slideshare decks (via Explore feature)
• Archival goals:
• GitHub: get all repository files and ZIP file
• Slideshare: get all slides and notes
• Quality eval:
• Compare against Webrecorder
• Scalability eval:
• Large amount of high-quality captures
• Compare against crawl time of common crawler
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
• Not a trivial dimension to evaluate!
• Decision to evaluate by amount of URIs in live web version vs.
archived snapshot
• Based on manually generated snapshots with Webrecorder
• Random sample of 100 repos and slide decks
• Expectation:
• 100% of URIs from live web in archived snapshot
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
100 @ GitHub 100 @ Slideshare
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality at Scale
17,646 @ GitHub 12,280 @ Slideshare
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Cost of Quality at Scale
• Runtime difference between Memento Tracer and common web
crawler for the same amount of URIs
• Plus 20 seconds per URI, on average
• Faster than previous approaches, discovers many more URIs
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Take aways
• Memento Tracer aims at finding a balance between quality and scale
• Human in the loop, benefits from patterns of web resources
• Experiments provide indicators for high quality, reliability, scale
• Cost involved, slower than simple crawlers
• Optimizations possible, further potential and limitations to be
explored
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving

More Related Content

Similar to The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving (20)

PPTX
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
The Frick Collection
 
PDF
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Marina Santini
 
PPTX
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
Ardan Patwardhan
 
PDF
Easter JISC metadata May25 DT
dstudhope
 
PPSX
An Introduction to Semantic Web Technology
Ankur Biswas
 
PDF
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
PDF
Mobile Multi-domain Search over Structured Web Data
AtakanAral
 
PDF
Current and emerging trends in library services
Nikesh Narayanan
 
PPTX
Scaling Prometheus on Kubernetes with Thanos
Thomas Riley
 
PDF
Semtech2006
Adrian Walker
 
PDF
Web-Scale Discovery: Post Implementation
Rachel Vacek
 
PPTX
Ocls 4th annual breakfast 2016
Jan Dawson
 
PPTX
CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP
 
PPTX
BlogForever Project presentation at MTSR2013
eimgreece
 
PPT
Marc and beyond: 3 Linked Data Choices
Richard Wallis
 
PDF
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
PPTX
Full stack development using javascript what and why - ajay chandravadiya
ajayrcgmail
 
PPTX
ASTQB washington-sept-2015
Dan Boutin
 
PPT
opacs.ppt
Kiran Malik
 
PDF
Leaving the Ivory Tower: Research in the Real World
C4Media
 
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
The Frick Collection
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Marina Santini
 
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
Ardan Patwardhan
 
Easter JISC metadata May25 DT
dstudhope
 
An Introduction to Semantic Web Technology
Ankur Biswas
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
Mobile Multi-domain Search over Structured Web Data
AtakanAral
 
Current and emerging trends in library services
Nikesh Narayanan
 
Scaling Prometheus on Kubernetes with Thanos
Thomas Riley
 
Semtech2006
Adrian Walker
 
Web-Scale Discovery: Post Implementation
Rachel Vacek
 
Ocls 4th annual breakfast 2016
Jan Dawson
 
CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP
 
BlogForever Project presentation at MTSR2013
eimgreece
 
Marc and beyond: 3 Linked Data Choices
Richard Wallis
 
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
Full stack development using javascript what and why - ajay chandravadiya
ajayrcgmail
 
ASTQB washington-sept-2015
Dan Boutin
 
opacs.ppt
Kiran Malik
 
Leaving the Ivory Tower: Research in the Real World
C4Media
 

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
PPTX
Evaluating Memento Service Optimizations
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
PPTX
Smart Routing of Memento Requests
Martin Klein
 
PPTX
Building Event Collections from Crawling Web Archives
Martin Klein
 
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
PPTX
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
PPTX
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
PPTX
Robust Linking to Web Resources
Martin Klein
 
PPTX
Signposting for Repositories
Martin Klein
 
PPTX
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
PPTX
Uniform Access to Raw Mementos
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
Evaluating Memento Service Optimizations
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
Smart Routing of Memento Requests
Martin Klein
 
Building Event Collections from Crawling Web Archives
Martin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
Robust Linking to Web Resources
Martin Klein
 
Signposting for Repositories
Martin Klein
 
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
Uniform Access to Raw Mementos
Martin Klein
 
Ad

Recently uploaded (20)

PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPTX
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
PPTX
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
PDF
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PDF
Digital Security in 2025 with Adut Angelina
The ClarityDesk
 
PPTX
internet básico presentacion es una red global
70965857
 
PPTX
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PDF
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
PDF
The Complete Guide to Chrome Net Internals DNS – 2025
Orage Technologies
 
PPTX
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PDF
Technical Guide to Build a Successful Shopify Marketplace from Scratch.pdf
CartCoders
 
PDF
DevOps Design for different deployment options
henrymails
 
PDF
Pas45789-Energs-Efficient-Craigg1ing.pdf
lafinedelcinghiale
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
PDF
123546568reb2024-Linux-remote-logging.pdf
lafinedelcinghiale
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
Digital Security in 2025 with Adut Angelina
The ClarityDesk
 
internet básico presentacion es una red global
70965857
 
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
The Complete Guide to Chrome Net Internals DNS – 2025
Orage Technologies
 
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
Technical Guide to Build a Successful Shopify Marketplace from Scratch.pdf
CartCoders
 
DevOps Design for different deployment options
henrymails
 
Pas45789-Energs-Efficient-Craigg1ing.pdf
lafinedelcinghiale
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
123546568reb2024-Linux-remote-logging.pdf
lafinedelcinghiale
 
Ad

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

  • 1. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory [email protected] @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
  • 2. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 A major challenge in web archiving: Scale vs. Quality
  • 3. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale! https://twitter.com/brewster_kahle/status/1016003169589981184
  • 4. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!! https://twitter.com/brewster_kahle/status/1118172506777509890
  • 5. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!! https://twitter.com/brewster_kahle/status/1139700494748663809
  • 6. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!!! https://twitter.com/brewster_kahle/status/1170820482104348672
  • 7. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/*/http://cnn.com
  • 8. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/20190808041346/https://www.cnn.com/
  • 9. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  • 10. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity! https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
  • 11. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity!! https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
  • 12. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Scale? https://twitter.com/mart1nkle1n/status/1136705116738904067
  • 13. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale
  • 14. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale Memento Tracer http://tracer.mementoweb.org
  • 15. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org Inspired by: • LOCKSS • Same automated approach for resources of a class • Webrecorder • Manual recording of web resources • Various attempts aimed at automating interactions/behaviors • E.g., Brozzler, Browsertrix
  • 16. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org
  • 17. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Implementation • Client-side: • Tracer Chrome extension leveraging Selenium IDE • JSON-formatted Trace for download • Server-side: • Stormcrawler • Selenium (Chrome) with Tracer plug-in • WarcProxy • file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  • 18. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 19. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 20. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 21. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 22. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
  • 23. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Current Memento Tracer Capabilities • Single clicks/links • All links in an area • Repeated click on links, with stop condition • Slides • Pagination • Nested traces i.e., “trace in a trace” • Trace for portal A  follow link to portal B  execute trace for portal B • Identification of page/portal for which a trace exists by URI (pattern)
  • 24. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Benefits • Scalability • Trace created once is applicable to all web resources of the same class • Traces shared via repository (edits, versioning) • Quality • Trace used as set of instructions for browser-based capture framework • Resource boundary explicit • Tradeoff • Quality vs performance
  • 25. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Evaluation of Scalability & Quality • Dataset made of GitHub repositories and Slideshare slide decks • 17,646 GitHub repositories (via changelog.com) • 12,280 Slideshare decks (via Explore feature) • Archival goals: • GitHub: get all repository files and ZIP file • Slideshare: get all slides and notes • Quality eval: • Compare against Webrecorder • Scalability eval: • Large amount of high-quality captures • Compare against crawl time of common crawler
  • 26. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality • Not a trivial dimension to evaluate! • Decision to evaluate by amount of URIs in live web version vs. archived snapshot • Based on manually generated snapshots with Webrecorder • Random sample of 100 repos and slide decks • Expectation: • 100% of URIs from live web in archived snapshot
  • 27. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality 100 @ GitHub 100 @ Slideshare
  • 28. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality at Scale 17,646 @ GitHub 12,280 @ Slideshare
  • 29. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Cost of Quality at Scale • Runtime difference between Memento Tracer and common web crawler for the same amount of URIs • Plus 20 seconds per URI, on average • Faster than previous approaches, discovers many more URIs
  • 30. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Take aways • Memento Tracer aims at finding a balance between quality and scale • Human in the loop, benefits from patterns of web resources • Experiments provide indicators for high quality, reliability, scale • Cost involved, slower than simple crawlers • Optimizations possible, further potential and limitations to be explored
  • 31. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory [email protected] @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving