MementoMap Framework
for Flexible and Adaptive
Web Archive Profiling
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
Fernando Melo, Daniel Bicho, and Daniel Gomes
FCT: Arquivo.pt, Lisbon, Portugal
@ibnesayeed @WebSciDL @PT_WebArchive
Supported by NSF Grant IIS-1526700
JCDL '19, June 4, 2019, Fort Worth, Urbana-Champaign, Illinois
@ibnesayeed 2
$ memgator -a archives.json -f cdxj example.com 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
198014 web.archive.org
13548 wayback.archive-it.org
1191 webarchive.loc.gov
1044 swap.stanford.edu
953 arquivo.pt
525 wayback.vefsafn.is
225 perma-archives.org
221 archive.md
23 www.webarchive.org.uk
$ memgator -a archives.json -f cdxj jcdl.org 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
410 web.archive.org
2 www.webarchive.org.uk
2 arquivo.pt
1 archive.md
Cross-archive Memento Lookup With MemGator
https://github.com/oduwsdl/MemGator
@ibnesayeed
Memento Aggregator
3
@ibnesayeed
Memento Aggregator
4
@ibnesayeed
Memento Aggregator
5
@ibnesayeed
Memento Aggregator
6
@ibnesayeed
Memento Aggregator
7
@ibnesayeed
Memento Aggregator
8
@ibnesayeed
Broadcasting is Evil
9
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the
traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP
addr from where you're seeing the traffic? I presume the requests are for Memento
TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has
gotten really high, and also I was asked to remove an archive due to the traffic it was
causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an
aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the
container to change the archivelist ;)
Ilya
Broadcasting is wasteful, both clients & archives suffer!
@ibnesayeed
Memento Lookup Routing
10
Let’s fix the broadcasting issue
with a more informed routing.
@ibnesayeed
MemGator Log Responses from Various Archives
11
93% of the requests
made from MemGator
to upstream archives
were wasteful.
@ibnesayeed
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
12
Blind spot of a
content-based profile
Blind spot of a
usage-based profile
@ibnesayeed
If Only Archives Could Tell When to Ask Them
● Websites advertise their holdings using sitemap.xml, why can’t archives?
○ Archives have billions or even hundreds of billions URI-Ms
○ Such exhaustive lists would go stale very quickly
● How about robots.txt?
○ It is compact, but is exclusion format, it does not tell what the site has
○ It assumes a single domain, patterns are for paths (not the domain name)
● How about combining the two ideas?
○ Introducing MementoMap!
13
@ibnesayeed
A MementoMap Example
14
!context ["http://oduwsdl.github.io/contexts/ukvs"]
!id {uri: "http://archive.example.org/"}
!fields {keys: ["surt"], values: ["frequency"]}
!meta {type: "MementoMap", name: "A Test Web Archive", year: 1996}
!meta {updated_at: "2018-09-03T13:27:52Z"}
* 54321/20000
com,* 10000+
org,arxiv)/ 100
org,arxiv)/* 2500~/900
org,arxiv)/pdf/* 0
uk,co,bbc)/images/* 300+/20-
https://github.com/oduwsdl/ORS/blob/master/ukvs.md
@ibnesayeed
SURTs Representation with Wildcard
15
Original SURTs did not have wildcards.
In practice the common “http://(” prefix
is removed.
@ibnesayeed
Arquivo.pt Index Statistics
16
The Internet archive is
about 150 times bigger
than Arquivo.pt.
@ibnesayeed
Top Arquivo.pt TLDs
17
Arquivo.pt was created to
archive sites of interest of
Portuguese people.
Over time web archives collect
many things they didn’t intend to
and miss a lot they would have
liked to archive.
@ibnesayeed
Who Would have Thought
Arquivo.pt has 10K+ .онлайн Sites?
18
“.онлайн”
(encoded as “xn--80asehdb”)
is an IDN gTLD which means
“.online”
@ibnesayeed
Distribution of URI-Ms over URI-Rs in Arquivo.pt
19
70% mementos belong
to only 30% URI-Rs.
@ibnesayeed
URI-M vs. URI-R Summary of Arquivo.pt
20
@ibnesayeed
Last two years
are still in
embargo period.
Yearly URI-Rs, URI-Ms, and Status Codes in Arquivo.pt
21
Early years of data
came from various
other archives.
@ibnesayeed
Cumulative Growth of URI-Ms and URI-Rs in Arquivo.pt
22
50% mementos were
captured in the last two
active years alone.
@ibnesayeed
Most Archived URI-Rs in Arquivo.pt
23
Arquivo is obsessed with transparent single pixel images and corner graphics.
@ibnesayeed
Unique Items With Exact Host and Path Depths
24
Where do we draw the line?
5+ or 10+ deep?
@ibnesayeed
HxPx Host and Path Depth Statistics of Arquivo.pt
25
@ibnesayeed
Shape of HxPx Key Tree of Arquivo.pt
26
@ibnesayeed
Global HxPx Reduction Rate
27
@ibnesayeed
Incremental Children Reduction Rate
28
@ibnesayeed
Processed Lines vs. Compacted MementoMap Growth
29
com,example)/a/1/x
com,example)/a/2
com,example)/a/3
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/a/*
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/*
@ibnesayeed
MementoMap Generation, Compaction, and Lookup
30
1.5% Relative Cost yields 60% Accuracy.
Arquivo.pt can save 60% waisted traffic
by publishing 119MB summary file!
@ibnesayeed
Dissemination and Discovery Methods
31
GET /.well-known/mementomap HTTP/1.1
Host: arquivo.pt
Link: <https://arquivo.pt/path/to/mementomap.ukvs>;
rel="mementomap"
<link href="https://arquivo.pt/path/to/mementomap.ukvs"
rel="mementomap">
Well-known URI
Link Header
Link HTML Element
@ibnesayeed
Future Work
● Generate MementoMap on the whole index, not a sample
● Generate blacklists by processing access logs
● Incorporate MementoMap in replay systems
● Encourage archives and aggregators to adopt it
32
@ibnesayeed
Conclusions
● Described MementoMap - a flexible and efficient archive profiling framework
● Analyzed complete index of Arquivo.pt to understand nature of web archives
● Evaluated MementoMap against Arquivo.pt’s index
● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file)
● Proposed “mementomap” as a well-known URI suffix as well as a link relation
for dissemination of MementoMap
● Implemented a single-pass, memory-efficient, and parallelization-friendly
MementoMap generation/compaction algorithm
● Open-sourced the implementation
○ https://github.com/oduwsdl/MementoMap
33

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

  • 1.
    MementoMap Framework for Flexibleand Adaptive Web Archive Profiling Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA Fernando Melo, Daniel Bicho, and Daniel Gomes FCT: Arquivo.pt, Lisbon, Portugal @ibnesayeed @WebSciDL @PT_WebArchive Supported by NSF Grant IIS-1526700 JCDL '19, June 4, 2019, Fort Worth, Urbana-Champaign, Illinois
  • 2.
    @ibnesayeed 2 $ memgator-a archives.json -f cdxj example.com > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 198014 web.archive.org 13548 wayback.archive-it.org 1191 webarchive.loc.gov 1044 swap.stanford.edu 953 arquivo.pt 525 wayback.vefsafn.is 225 perma-archives.org 221 archive.md 23 www.webarchive.org.uk $ memgator -a archives.json -f cdxj jcdl.org > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 410 web.archive.org 2 www.webarchive.org.uk 2 arquivo.pt 1 archive.md Cross-archive Memento Lookup With MemGator https://github.com/oduwsdl/MemGator
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    @ibnesayeed Broadcasting is Evil 9 From:Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is wasteful, both clients & archives suffer!
  • 10.
    @ibnesayeed Memento Lookup Routing 10 Let’sfix the broadcasting issue with a more informed routing.
  • 11.
    @ibnesayeed MemGator Log Responsesfrom Various Archives 11 93% of the requests made from MemGator to upstream archives were wasteful.
  • 12.
    @ibnesayeed What is Archivedin Arquivo.pt? What is Accessed from MemGator? 12 Blind spot of a content-based profile Blind spot of a usage-based profile
  • 13.
    @ibnesayeed If Only ArchivesCould Tell When to Ask Them ● Websites advertise their holdings using sitemap.xml, why can’t archives? ○ Archives have billions or even hundreds of billions URI-Ms ○ Such exhaustive lists would go stale very quickly ● How about robots.txt? ○ It is compact, but is exclusion format, it does not tell what the site has ○ It assumes a single domain, patterns are for paths (not the domain name) ● How about combining the two ideas? ○ Introducing MementoMap! 13
  • 14.
    @ibnesayeed A MementoMap Example 14 !context["http://oduwsdl.github.io/contexts/ukvs"] !id {uri: "http://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {type: "MementoMap", name: "A Test Web Archive", year: 1996} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- https://github.com/oduwsdl/ORS/blob/master/ukvs.md
  • 15.
    @ibnesayeed SURTs Representation withWildcard 15 Original SURTs did not have wildcards. In practice the common “http://(” prefix is removed.
  • 16.
    @ibnesayeed Arquivo.pt Index Statistics 16 TheInternet archive is about 150 times bigger than Arquivo.pt.
  • 17.
    @ibnesayeed Top Arquivo.pt TLDs 17 Arquivo.ptwas created to archive sites of interest of Portuguese people. Over time web archives collect many things they didn’t intend to and miss a lot they would have liked to archive.
  • 18.
    @ibnesayeed Who Would haveThought Arquivo.pt has 10K+ .онлайн Sites? 18 “.онлайн” (encoded as “xn--80asehdb”) is an IDN gTLD which means “.online”
  • 19.
    @ibnesayeed Distribution of URI-Msover URI-Rs in Arquivo.pt 19 70% mementos belong to only 30% URI-Rs.
  • 20.
    @ibnesayeed URI-M vs. URI-RSummary of Arquivo.pt 20
  • 21.
    @ibnesayeed Last two years arestill in embargo period. Yearly URI-Rs, URI-Ms, and Status Codes in Arquivo.pt 21 Early years of data came from various other archives.
  • 22.
    @ibnesayeed Cumulative Growth ofURI-Ms and URI-Rs in Arquivo.pt 22 50% mementos were captured in the last two active years alone.
  • 23.
    @ibnesayeed Most Archived URI-Rsin Arquivo.pt 23 Arquivo is obsessed with transparent single pixel images and corner graphics.
  • 24.
    @ibnesayeed Unique Items WithExact Host and Path Depths 24 Where do we draw the line? 5+ or 10+ deep?
  • 25.
    @ibnesayeed HxPx Host andPath Depth Statistics of Arquivo.pt 25
  • 26.
    @ibnesayeed Shape of HxPxKey Tree of Arquivo.pt 26
  • 27.
  • 28.
  • 29.
    @ibnesayeed Processed Lines vs.Compacted MementoMap Growth 29 com,example)/a/1/x com,example)/a/2 com,example)/a/3 com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/a/* com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/*
  • 30.
    @ibnesayeed MementoMap Generation, Compaction,and Lookup 30 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% waisted traffic by publishing 119MB summary file!
  • 31.
    @ibnesayeed Dissemination and DiscoveryMethods 31 GET /.well-known/mementomap HTTP/1.1 Host: arquivo.pt Link: <https://arquivo.pt/path/to/mementomap.ukvs>; rel="mementomap" <link href="https://arquivo.pt/path/to/mementomap.ukvs" rel="mementomap"> Well-known URI Link Header Link HTML Element
  • 32.
    @ibnesayeed Future Work ● GenerateMementoMap on the whole index, not a sample ● Generate blacklists by processing access logs ● Incorporate MementoMap in replay systems ● Encourage archives and aggregators to adopt it 32
  • 33.
    @ibnesayeed Conclusions ● Described MementoMap- a flexible and efficient archive profiling framework ● Analyzed complete index of Arquivo.pt to understand nature of web archives ● Evaluated MementoMap against Arquivo.pt’s index ● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file) ● Proposed “mementomap” as a well-known URI suffix as well as a link relation for dissemination of MementoMap ● Implemented a single-pass, memory-efficient, and parallelization-friendly MementoMap generation/compaction algorithm ● Open-sourced the implementation ○ https://github.com/oduwsdl/MementoMap 33