Posts

Showing posts with the label Archival Replay

2025-06-11: Paper Summary: "Jawa: Web Archival in the Era of JavaScript" (Goel et al. OSDI '22)

Image
Figure 1: Jawa Overview. Figure from https://www.usenix.org/system/files/osdi22-goel.pdf While working on the Saving Ads project , we identified problems with replaying ads (technical report: “ Archiving and Replaying Current Web Advertisements: Challenges and Opportunities ”) that used JavaScript code to dynamically generate URLs. These URLs included random values that differed during crawl time and replay time, resulting  in failed requests upon replay. Figure 2 shows an example ad iframe URL that failed to replay, because a dynamically generated random value was used in the subdomain. URL matching approaches like fuzzy matching could resolve these problems by matching the dynamically generated URL with the URL that was crawled. Figure 2: Different SafeFrame URLs during crawl and replay sessions. Google’s pubads_impl.js ( WACZ | URI-R: https://securepubads.g.doubleclick.net/pagead/managed/js/gpt/m202308210101/pubads_impl.js?cb=31077272 ) generates the random SafeFrame URL. Goe...

2023-12-05: Updates to Memento Damage

Image
Hello Internet and Archivists!      I'm back for another blog post and I'm excited to share some overhauls, updates, and research that I have been working on for the Memento Damage service, previously developed by Dr. Justin Brunelle , Erika Siregar , and Grant Atkins . The project page is still currently running the original project build but I wanted to share some behind-the-scenes updates before we roll out the new build soon! Under the Hood Fig. 1: Homepage for the Memento Damage web service      When I took on this project, there were many components needing update due to age; the code base had been on Python version 2 still as it had sat over the years as Dr. Brunelle and previous students had graduated and moved on to other endeavors. Updating the code base to Python version 3 was one of the top to-do items! Over a lot of time learning the code base and refactoring I was able to clean up and modernize the code a bit thanks to the syntax and language...

2023-01-18: In A Terminal Far, Far Away...

Image
HTTP  and HTML are the reigning champs in terms of delivering content from the Web to your computer, typically though a Web browser. Content and data available over HTTP could generally be categorized to be mostly within the surface web. This, however, only constitutes a small portion of what content is available on the complete Web. Some protocols, such as FTP, are no longer supported by browsers and accessed by way of more specialized programs. Modern formats, such as IPFS , also exist but have limited adoption and often still require external software. While content available over these non-HTTP protocols is still "on the Net", the extent to which it is archived remains murky. A Trip Down Internet Lane Originally, I was inspired to write this blog post because I was exploring different representations of content on the web I remembered a gem from my younger days on the Internet. That gem was the ASCII Star Wars animation that you could watch over your terminal by ty...