Posts

Showing posts with the label PyWB

2023-02-26: Animating Changes in Webpages, Featuring George Santos's Biography

Image
Figure 1: The Wayback Machine "Changes" tool showing the difference between George Santos's biography on December 19, 2022 and February 3, 2023. Deletions are highlighted in yellow and additions are highlighted in blue. Source:  https://web.archive.org/web/diff/20221219173515/20230203162225/https://georgeforny.com/about/ The Washington Post recently published the article, " See the evolution of lies in George Santos's campaign biography ." George Santos is a member of the U.S. House of Representatives in the 118th U.S. Congress. He has steadily removed claims from his website that have been proven to be false, such as holding a bachelor's degree from Baruch College. In the Washington Post article, the journalists used the Internet Archive's Wayback Machine to find and view previous versions of his webpage that included the false claims. To show change over time, they interspersed colored text boxes with the change text throughout the article. Previo...

2022-07-22: Summary of "Web Archiving and Search Personalized"

Image
The Web Archiving and Search Personalized system automatically captures, archives, and indexes pages for both full-text search and replay. (Source: Kiesel et al., Figure 1a) According to a study conducted by Teevan et al. in 2007, 39% of search queries represent users trying to re-find previously viewed pages [1]. One approach to supporting users in this task is automatic personal web archiving. Each page that the user visits is saved, so that it can be found later, similar to an automated version of the "bookmark as archive" feature in Mabe et al.’s Memento-aware browser prototype [2]. However, creating a system that can save web pages as they are viewed, index them for full-text search, and replay them later is an ambitious goal. Johannes Kiesel ( @KieselJohannes ), Arjen P. de Vries ( @arjenpdevries ), Matthias Hagen ( @matthias_hagen ), Benno Stein ( @bennostein ), and Martin Potthast ( @martinpotthast ) created a prototype system for this purpose in their paper “Web Arc...

2020-03-26: Memento Compliance Audit of PyWB

Image
This document is an audit report of the latest development version of PyWB , a Web archive replay sytem, for its Memento (RFC 7089) compliance. As a growing number of public Web archives are moving towards deploying PyWB, it becomes critical to comply with standards to ensure that tools in the archiving ecosystem continue to function as expected. To audit the Memento compliance of PyWB I established the following setup: Captured example.com five times in separate WARC files with the gap of a few minutes each using warcio Created various test instances of PyWB's develop branch, which is one commit ahead of the v-2.4.0-rc6-test version (commit hash: 92e459bda52a2b03f33a4b0b8094ed424248d2a5 ) Initialized a collection named example and loaded freshly captured warc files in it for replay Placed multiple custom configuration files that are loaded by setting PYWB_CONFIG_FILE environment variable for each test instance Preserved the state of the relevant folder tree in pywb...