SlideShare a Scribd company logo
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
Building Event Collections
from
Crawling Web Archives
Martin Klein1
Lyudmila Balakireva1
Herbert Van de Sompel2
1Research Library
Los Alamos National Laboratory
2Data Archiving and Networked Services
The Netherlands
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
2
Inspiration from Previous Work
https://doi.org/10.1007/978-3-319-67008-9_10
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
3
Published at WebSci 2018
https://doi.org/10.1145/3201064.3201085
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
4
1. Can we create event collections by focused crawling online-
available web archives?
2. How do event collections created from the archived web
compare to those created from the live web?
3. How does the amount of time passed since the event affect
the collections built from the live and the archived web?
4. How do event collections built from the archived web
compare to manually curated collections?
Questions
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
5
• Often orchestrated by subject matter experts, archivists,
special collection librarians, technicians
• Potentially with guidance from institutional collection policy
• Results in a list of seeds (URIs, social media accounts, etc)
• Utilization of crawling services such as Archive-It, Social Feed
Manager
Background – Event Collection Building
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
6
• Temporal: time passed since event is of concern
 Use of web archives via Memento infrastructure
• Selection: seeds often picked manually
 Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
 Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
7
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
8
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
9
• Temporal: time passed since event is of concern
 Use of web archives
• Selection: seeds often picked manually
 Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
 Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
10
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
11
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
12
• Temporal: time passed since event is of concern
 Use of web archives
• Selection: seeds often picked manually
 Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
 Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
13
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
14
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
15
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
16
1. Content of Wikipedia page + random 60% of page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
2. Content of remaining 40% of Wikipedia page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
• Compute cosine similarity value between vectors 1 and 2
• Run 10 times
• Take average cosine similarity value as content threshold
Content Relevance
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
17
• Define temporal interval for which crawled pages are
considered relevant
• Event date extracted from Wikipedia event page
Temporal Relevance
1
Event Date Change Point Today
0 0
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
18
Change Point Detection
2016−06−12 2016−11−05 2017−03−31 2017−08−24
020406080100
Edit Dates
Percentage
46
• Plot number of Wikipedia page
edits per day
• Run R’s changepoint algorithm
• Detect significant change in curve
https://cran.r-project.org/web/packages/changepoint/index.html
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
19
• Extract datetime from pages via:
• URI
http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/
• Meta tags
<meta property="article:published" itemprop="datePublished"
content="2017-12-09T10:14:50-05:00" />
• ODU’s Carbondate tool
http://carbondate.cs.odu.edu/
• Memento datetime
• X-Header
Datetime Extraction
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
20
• Topics limited to terror attacks and mass shootings in the U.S.
• From different times in the past
• Take content and temporal relevance into account
• Equally weighted
• Use events’ Wikipedia page as input for focused crawler
• Version that was live at change point
Experiment Details
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
21
• Focused crawl of:
• 22 archives, simultaneously, via Memento infrastructure
• The live web
• Seeds
• Memento of Wikipedia page references closest to and
after event time
• Subject to temporal and contextual relevance assessment
• Crawled outlinks
• Memento of outlinks closest to and after event time
• Subject to temporal and contextual relevance assessment
Crawl Details
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
22
• Crawl stop conditions:
• No more relevant documents left
• 5 levels deep
• Utilized crawl priority queue
Crawl Details
Level 2
Level 1
Level 0
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
23
• New York City, October 31st 2017
• Las Vegas, October 1st 2017
• Orlando, June 12th 2016
• San Bernadino, December 2nd 2015
• Tucson, January 8th 2011
• Binghampton, April 3rd 2009
Collections Crawled (in November 2017)
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
24
NYC, 10/31/2017 – URIs per Level
0 1 2 3 4 5
Crawl depth
NumberofURIs
0500100015002000
Web Archive Crawl
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
0500100015002000
Live Web Crawl
0102030405060708090100
Percent
All URIs
Relevant URIs
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
25
TUC, 01/08/2011 – URIs per Level
0 1 2 3 4 5
Crawl depth
NumberofURIs
020000400006000080000
Web Archive Crawl
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
020000400006000080000
Live Web Crawl
0102030405060708090100
Percent
All URIs
Relevant URIs
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
26
NYC, 10/31/2017 – Relevance over…
Crawled Documents Crawl Time
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
27
TUC, 01/08/2011 – Relevance over…
Crawled Documents Crawl Time
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
28
TUC, 01/08/2011 – Comparison to Archive-IT
0 5000 10000 15000
050001000015000
Documents
AccumulatedRelevance
Web Archive Crawl
Archive−It Crawl
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
29
TUC, 01/08/2011 – Web Archive Contributions
web.archive.org 75%
wayback.archive−it.org
14%
webarchive.loc.gov 7%
web.archive.bibalex.org 2%
archive.is 2%
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
30
• Web archives are great resources to build event collections of
web resources
• Crawling web archives is much slower than the live web
• Collections about very recent events benefit more from the
live web than the archived web
but
• Collections about events from the distant past benefit more
from the archived web than the live web
• Utilizing multiple web archives is beneficial for the collection
• Focused crawls have the potential to outperform manual
collection building
Takeaways
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
31
https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
Building Event Collections
from
Crawling Web Archives
Martin Klein1
Lyudmila Balakireva1
Herbert Van de Sompel2
1Research Library
Los Alamos National Laboratory
2Data Archiving and Networked Services
The Netherlands

More Related Content

Similar to Building Event Collections from Crawling Web Archives (20)

PDF
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Flink Forward
 
PDF
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
PPTX
 Challenges in Managing Online Business Communities
Thomas Gottron
 
PPTX
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
 
PPTX
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Alexander Nwala
 
PDF
Creating Structure in Web Archives With Collections: Different Concepts From ...
Himarsha Jayanetti
 
PDF
It is hard to compute fixity on archived web pages
maturban
 
PPSX
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
WARCnet
 
PDF
Search, Exploration and Analytics of Evolving Data
Nattiya Kanhabua
 
PPTX
The role public libraries play in supporting digital literacy
IL Group (CILIP Information Literacy Group)
 
PPTX
Information sharing about Columbia University Library’s recent web archiving ...
Anna Perricci
 
PPTX
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
The Frick Collection
 
PPTX
Collaboration and Cash: Web Archiving Incentive Awards
Anna Perricci
 
PPTX
Elastic Meetup - Elasticsearch and Linked Data
Quentin Reul
 
PDF
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
ambaldvl55
 
PPT
Eternal Cities?
collectionsaustralia
 
PDF
Hahn "Wikidata as a hub to library linked data re-use"
National Information Standards Organization (NISO)
 
PPTX
Crowdsourcing as productive engagement with cultural heritage
Mia
 
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
PPTX
Actions to Ensure the Integrity and Continuity of the Scholarly Record
EDINA, University of Edinburgh
 
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Flink Forward
 
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
 Challenges in Managing Online Business Communities
Thomas Gottron
 
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Alexander Nwala
 
Creating Structure in Web Archives With Collections: Different Concepts From ...
Himarsha Jayanetti
 
It is hard to compute fixity on archived web pages
maturban
 
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
WARCnet
 
Search, Exploration and Analytics of Evolving Data
Nattiya Kanhabua
 
The role public libraries play in supporting digital literacy
IL Group (CILIP Information Literacy Group)
 
Information sharing about Columbia University Library’s recent web archiving ...
Anna Perricci
 
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
The Frick Collection
 
Collaboration and Cash: Web Archiving Incentive Awards
Anna Perricci
 
Elastic Meetup - Elasticsearch and Linked Data
Quentin Reul
 
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
ambaldvl55
 
Eternal Cities?
collectionsaustralia
 
Hahn "Wikidata as a hub to library linked data re-use"
National Information Standards Organization (NISO)
 
Crowdsourcing as productive engagement with cultural heritage
Mia
 
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
Actions to Ensure the Integrity and Continuity of the Scholarly Record
EDINA, University of Edinburgh
 

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
PPTX
Evaluating Memento Service Optimizations
Martin Klein
 
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
PPTX
Robust Linking to Web Resources
Martin Klein
 
PPTX
Signposting for Repositories
Martin Klein
 
PPTX
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
PPTX
Uniform Access to Raw Mementos
Martin Klein
 
PPTX
Robust Links - a proposed solution to reference rot in scholarly communication
Martin Klein
 
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Martin Klein
 
PPTX
To the Rescue of the Orphans of Scholarly Communication
Martin Klein
 
PPTX
web_archive_interoperability_memento
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
Evaluating Memento Service Optimizations
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
Robust Linking to Web Resources
Martin Klein
 
Signposting for Repositories
Martin Klein
 
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 
Uniform Access to Raw Mementos
Martin Klein
 
Robust Links - a proposed solution to reference rot in scholarly communication
Martin Klein
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Martin Klein
 
To the Rescue of the Orphans of Scholarly Communication
Martin Klein
 
web_archive_interoperability_memento
Martin Klein
 
Ad

Recently uploaded (20)

PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPTX
英国假毕业证诺森比亚大学成绩单GPA修改UNN学生卡网上可查学历成绩单
Taqyea
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PDF
How to Fix Error Code 16 in Adobe Photoshop A Step-by-Step Guide.pdf
Becky Lean
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PDF
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
PPTX
Powerpoint Slides: Eco Economic Epochs.pptx
Steven McGee
 
PDF
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PPTX
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PPTX
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PPTX
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
unit 2_2 copy right fdrgfdgfai and sm.pptx
nepmithibai2024
 
PPTX
Template Timeplan & Roadmap Product.pptx
ImeldaYulistya
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
英国假毕业证诺森比亚大学成绩单GPA修改UNN学生卡网上可查学历成绩单
Taqyea
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
How to Fix Error Code 16 in Adobe Photoshop A Step-by-Step Guide.pdf
Becky Lean
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
Powerpoint Slides: Eco Economic Epochs.pptx
Steven McGee
 
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
Orchestrating things in Angular application
Peter Abraham
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
unit 2_2 copy right fdrgfdgfai and sm.pptx
nepmithibai2024
 
Template Timeplan & Roadmap Product.pptx
ImeldaYulistya
 
Ad

Building Event Collections from Crawling Web Archives

  • 1. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ Building Event Collections from Crawling Web Archives Martin Klein1 Lyudmila Balakireva1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands
  • 2. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 2 Inspiration from Previous Work https://doi.org/10.1007/978-3-319-67008-9_10
  • 3. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 3 Published at WebSci 2018 https://doi.org/10.1145/3201064.3201085
  • 4. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 4 1. Can we create event collections by focused crawling online- available web archives? 2. How do event collections created from the archived web compare to those created from the live web? 3. How does the amount of time passed since the event affect the collections built from the live and the archived web? 4. How do event collections built from the archived web compare to manually curated collections? Questions
  • 5. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 5 • Often orchestrated by subject matter experts, archivists, special collection librarians, technicians • Potentially with guidance from institutional collection policy • Results in a list of seeds (URIs, social media accounts, etc) • Utilization of crawling services such as Archive-It, Social Feed Manager Background – Event Collection Building
  • 6. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 6 • Temporal: time passed since event is of concern  Use of web archives via Memento infrastructure • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach
  • 7. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 7
  • 8. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 8
  • 9. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 9 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach
  • 10. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 10
  • 11. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 11
  • 12. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 12 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach
  • 13. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 13 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 14. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 14 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 15. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 15 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 16. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 16 1. Content of Wikipedia page + random 60% of page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) 2. Content of remaining 40% of Wikipedia page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) • Compute cosine similarity value between vectors 1 and 2 • Run 10 times • Take average cosine similarity value as content threshold Content Relevance
  • 17. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 17 • Define temporal interval for which crawled pages are considered relevant • Event date extracted from Wikipedia event page Temporal Relevance 1 Event Date Change Point Today 0 0
  • 18. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 18 Change Point Detection 2016−06−12 2016−11−05 2017−03−31 2017−08−24 020406080100 Edit Dates Percentage 46 • Plot number of Wikipedia page edits per day • Run R’s changepoint algorithm • Detect significant change in curve https://cran.r-project.org/web/packages/changepoint/index.html
  • 19. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 19 • Extract datetime from pages via: • URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/ • Meta tags <meta property="article:published" itemprop="datePublished" content="2017-12-09T10:14:50-05:00" /> • ODU’s Carbondate tool http://carbondate.cs.odu.edu/ • Memento datetime • X-Header Datetime Extraction
  • 20. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 20 • Topics limited to terror attacks and mass shootings in the U.S. • From different times in the past • Take content and temporal relevance into account • Equally weighted • Use events’ Wikipedia page as input for focused crawler • Version that was live at change point Experiment Details
  • 21. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 21 • Focused crawl of: • 22 archives, simultaneously, via Memento infrastructure • The live web • Seeds • Memento of Wikipedia page references closest to and after event time • Subject to temporal and contextual relevance assessment • Crawled outlinks • Memento of outlinks closest to and after event time • Subject to temporal and contextual relevance assessment Crawl Details
  • 22. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 22 • Crawl stop conditions: • No more relevant documents left • 5 levels deep • Utilized crawl priority queue Crawl Details Level 2 Level 1 Level 0 Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
  • 23. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 23 • New York City, October 31st 2017 • Las Vegas, October 1st 2017 • Orlando, June 12th 2016 • San Bernadino, December 2nd 2015 • Tucson, January 8th 2011 • Binghampton, April 3rd 2009 Collections Crawled (in November 2017)
  • 24. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 24 NYC, 10/31/2017 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 0500100015002000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 0500100015002000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  • 25. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 25 TUC, 01/08/2011 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 020000400006000080000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 020000400006000080000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  • 26. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 26 NYC, 10/31/2017 – Relevance over… Crawled Documents Crawl Time
  • 27. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 27 TUC, 01/08/2011 – Relevance over… Crawled Documents Crawl Time
  • 28. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 28 TUC, 01/08/2011 – Comparison to Archive-IT 0 5000 10000 15000 050001000015000 Documents AccumulatedRelevance Web Archive Crawl Archive−It Crawl
  • 29. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 29 TUC, 01/08/2011 – Web Archive Contributions web.archive.org 75% wayback.archive−it.org 14% webarchive.loc.gov 7% web.archive.bibalex.org 2% archive.is 2%
  • 30. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 30 • Web archives are great resources to build event collections of web resources • Crawling web archives is much slower than the live web • Collections about very recent events benefit more from the live web than the archived web but • Collections about events from the distant past benefit more from the archived web than the live web • Utilizing multiple web archives is beneficial for the collection • Focused crawls have the potential to outperform manual collection building Takeaways
  • 31. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 31 https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
  • 32. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ Building Event Collections from Crawling Web Archives Martin Klein1 Lyudmila Balakireva1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands