DSHR's Blog: web archiving

Showing posts with label web archiving. Show all posts

Tuesday, May 21, 2024

Pew Research On Link Rot

Source

When Online Content Disappears by Athena Chapekis, Samuel Bestvater, Emma Remy and Gonzalo Rivero reports results from this research:

we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.

We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.

Their results are not surprising, but there are a number of surprising things about their report. Below the fold, I explain.

The Internet Archive's "Long Tail" Program

In 2018 I helped the Internet Archive get a two-year Mellon Foundation grant aimed at preserving the "long tail" of academic literature from small publishers, which is often at great risk of loss. In 2020 I wrote The Scholarly Record At The Internet Archive explaining the basic idea:

The project takes two opposite but synergistic approaches:

Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.

Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.

Below the fold I report on subsequent developments in this project.

Dangerous Complacency

The topic of web archiving has been absent from this blog for a while, but recently Sawood Alam alerted me to Cliff Lynch's post from January entitled The Dangerous Complacency of “Web Archiving” Rhetoric. Lynch's thesis is that using the term "web archiving" obscures the fact that we can only collect and preserve a fraction of "the Web". The topic is one I've written about many times, at least since my Spring 2009 CNUI Plenary, so below the fold I return to it.

Zittrain On Internet Rot

I spent two decades working on the problem of preserving digital documents, especially those published on the Web, in the LOCKSS Program. So I'm in agreement with the overall argument of Jonathan Zittrain's The Internet Is Rotting, that digital information is evanescent and mutable, and that libraries are no longer fulfilling their mission to be society's memory institutions. He writes:

People tend to overlook the decay of the modern web, when in fact these numbers are extraordinary—they represent a comprehensive breakdown in the chain of custody for facts. Libraries exist, and they still have books in them, but they aren’t stewarding a huge percentage of the information that people are linking to, including within formal, legal documents. No one is. The flexibility of the web—the very feature that makes it work, that had it eclipse CompuServe and other centrally organized networks—diffuses responsibility for this core societal function.

And concludes:

Society can’t understand itself if it can’t be honest with itself, and it can’t be honest with itself if it can only live in the present moment. It’s long overdue to affirm and enact the policies and technologies that will let us see where we’ve been, including and especially where we’ve erred, so we might have a coherent sense of where we are and where we want to go.

In our first paper about LOCKSS, Vicky Reich and I wrote:

Librarians have a well-founded confidence in their ability to provide their readers with access to material published on paper, even if it is centuries old. Preservation is a by-product of the need to scatter copies around to provide access. Librarians have an equally well-founded skepticism about their ability to do the same for material published in electronic form. Preservation is totally at the whim of the publisher.

A subscription to a paper journal provides the library with an archival copy of the content. Subscribing to a Web journal rents access to the publisher’s copy. The publisher may promise "perpetual access", but there is no business model to support the promise. Recent events have demonstrated that major journals may vanish from the Web at a few months notice.

Although I agree with Zittrain's big picture, I have some problems with his details. Below the fold I explain the issues I have with them.

What Is The Point?

During a discussion of NFTs, Larry Masinter pointed me to his 2012 proposal The 'tdb' and 'duri' URI schemes, based on dated URIs. The proposal's abstract reads:

This document defines two URI schemes.  The first, 'duri' (standing
for "dated URI"), identifies a resource as of a particular time.
This allows explicit reference to the "time of retrieval", similar to
the way in which bibliographic references containing URIs are often
written.

The second scheme, 'tdb' ( standing for "Thing Described By"),
provides a way of minting URIs for anything that can be described, by
the means of identifying a description as of a particular time.
These schemes were posited as "thought experiments", and therefore
this document is designated as Experimental.

As far as I can tell, this proposal went nowhere, but it raises a question that is also raised by NFTs. What is the point of a link that is unlikely to continue to resolve to the expected content? Below the fold I explore this question.

NFTs and Web Archiving

One of the earliest observations of the behavior of the Web at scale was "link rot". There were a lot of 404s, broken links. Research showed that the half-life of Web pages was alarmingly short. Even in 1996 this problem was obvious enough for Brewster Kahle to found the Internet Archive to address it. From the Wikipedia entry for Link Rot:

A 2003 study found that on the Web, about one link out of every 200 broke each week,^[1] suggesting a half-life of 138 weeks. This rate was largely confirmed by a 2016–2017 study of links in Yahoo! Directory (which had stopped updating in 2014 after 21 years of development) that found the half-life of the directory's links to be two years.^[2]

One might have thought that academic journals were a relatively stable part of the Web, but research showed that their references decayed too, just somewhat less rapidly. A 2013 study found a half-life of 9.3 years. See my 2015 post The Evanescent Web.

I expect you have noticed the latest outbreak of blockchain-enabled insanity, Non-Fungible Tokens (NFTs). Someone "paying $69M for a JPEG" or $560K for a New York Times column attracted a lot of attention. Follow me below the fold for the connection between NFTs, "link rot" and Web archiving.

More On Archiving Twitter

Himarsha Jayanetti from Michael Nelson's group at Old Dominion follows up on the work I discussed in Michael Nelson's Group On Archiving Twitter with Twitter rewrites your URLs, but assumes you’ll never rewrite theirs: more problems replaying archived Twitter:

Source

URLs shared on Twitter are automatically shortened to t.co links. Twitter does this to track its engagements and also protect its users from sites with malicious content. Twitter replaces these t.co URLs with HTML that suggests the original URL so that the end-user does not see the t.co URLs while browsing. When these t.co URLs are replayed through web archives, they are rewritten to an archived URL (URI-M) and should be rendered in the web archives as in the live web, without displaying these t.co URI-Ms to the end-user.

But, as the screen-grab from the Wayback Machine shows, they may not be. Below the fold, a look at Jayanetti's explanation.

The New Oldweb.today

Two days before Christmas Ilya Kreymer posted Announcing the New OldWeb.today. The old oldweb.today was released five years ago, and Ilya described the details in a guest post here. It was an important step forward in replaying preserved Web content because users could view the old Web content as it would have been rendered at the time it was published, not as rendered in a modern browser. I showed an example of the difference this made in The Internet is for Cats.

Below the fold, I look at why the new oldweb.today is an improvement on the old version, which is still available at classic.oldweb.today

Michael Nelson's Group On Archiving Twitter

The rise and fall of the Trump administration has amply illustrated the importance of Twitter in the historical record. Alas, Twitter has no economic motivation to cater to the needs of historians. As they work to optimize Twitter's user experience, the engineers are likely completely unaware of the problems they are causing the Web archives trying to preserve history. Even if they were aware, they would be unable to justify the time and effort necessary to mitigate them.

Over the last six months Michael Nelson's group at Old Dominion University have continued their excellent work to evaluate exactly how much trouble future historians will have to contend with in three new blog posts from Kritika Garg and Himarsha Jayanetti:

Below the fold, some commentary on each of them.

Don't Say We Didn't Warn You

Just over a quarter-century ago, Stanford Libraries' HighWire Press pioneered the switch of academic journal publishing from paper to digital when they put the Journal of Biological Chemistry on-line. Even in those early days of the Web, people understood that Web pages, and links to them, decayed over time. A year later, Brewster Kahle founded the Internet Archive to preserve them for posterity.

One difficulty was that although academic journals contained some of the Web content that was most important to preserve for the future, the Internet Archive could not access them because they were paywalled. Two years later, Vicky Reich and I started the LOCKSS (Lots Of Copies Keep Stuff Safe) program to address this problem. In 2000's Permanent Web Publishing we wrote:

Librarians have a well-founded confidence in their ability to provide their readers with access to material published on paper, even if it is centuries old. Preservation is a by-product of the need to scatter copies around to provide access. Librarians have an equally well-founded skepticism about their ability to do the same for material published in electronic form. Preservation is totally at the whim of the publisher.

A subscription to a paper journal provides the library with an archival copy of the content. Subscribing to a Web journal rents access to the publisher's copy. The publisher may promise "perpetual access", but there is no business model to support the promise. Recent events have demonstrated that major journals may vanish from the Web at a few months notice.

This poses a problem for librarians, who subscribe to these journals in order to provide both current and future readers with access to the material. Current readers need the Web editions. Future readers need paper; there is no other way to be sure the material will survive.

Now, Jeffrey Brainard's Dozens of scientific journals have vanished from the internet, and no one preserved them and Diana Kwon's More than 100 scientific journals have disappeared from the Internet draw attention to this long-standing problem. Below the fold I discuss the paper behind the Science and Nature articles.

Carl Malamud Wins (Mostly)

In Supreme Court rules Georgia can’t put the law behind a paywall Timothy B. Lee writes:

A narrowly divided US Supreme Court on Monday upheld the right to freely share the official law code of Georgia. The state claimed to own the copyright for the Official Code of Georgia Annotated and sued a nonprofit called Public.Resource.Org for publishing it online. Monday's ruling is not only a victory for the open-government group, it's an important precedent that will help secure the right to publish other legally significant public documents.

"Officials empowered to speak with the force of law cannot be the authors of—and therefore cannot copyright—the works they create in the course of their official duties," wrote Chief Justice John Roberts in an opinion that was joined by four other justices on the nine-member court.

Below the fold, commentary on various reports of the decision, and more.

Yay, Library of Congress!

LoC Web Archive team

The web archiving team at the Library of Congress got some high-visibility, well-deserved publicity in the New York Times with Steven Kurutz's Meet Your Meme Lords:

For the past 20 years, a small team of archivists at the Library of Congress has been collecting the web, quietly and dutifully in its way. The initiative was born out of a desire to collect and preserve open-access materials from the web, especially U.S. government content around elections, which makes this the team’s busy season.

But the project has turned into a sweeping catalog of internet culture, defunct blogs, digital chat rooms, web comics, tweets and most other aspects of online life.

Kurutz did a good job; the article is well worth reading.

Tuesday, February 18, 2020

The Scholarly Record At The Internet Archive

The Internet Archive has been working on a Mellon-funded grant aimed at collecting, preserving and providing persistent access to as much of the open-access academic literature as possible. The motivation is that much of the "long tail" of academic literature comes from smaller publishers whose business model is fragile, and who are at risk of financial failure or takeover by the legacy oligopoly publishers. This is particularly true if their content is open access, since they don't have subscription income. This "long tail" content is thus at risk of loss or vanishing behind a paywall.

The project takes two opposite but synergistic approaches:

Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.

Below the fold, a discussion of the progress that has been made so far.

Web Packaging for Web Archiving

Supporting Web Archiving via Web Packaging by Sawood Alam, Michele C Weigle, Michael L Nelson, Martin Klein, and Herbert Van de Sompel is their position paper for the Internet Architecture Board's ESCAPE workshop (Exploring Synergy between Content Aggregation and the Publisher Ecosystem). It describes the considerable potential importance of Web Packaging, the topic of the workshop, for Web archiving, but also the problems it poses because, like the Web before Memento, it ignores the time dimension.

Source: Frederic Filloux

Despite living in the heart of Silicon Valley, our home Internet connection is 3M/1Mbit DSL from Sonic; we love our ISP and I refuse to do business with AT&T or Comcast. As you can imagine, the speed with which Web pages load has been a topic of particular interest for this blog, for example here and here. (which starts from a laugh-out-loud, must-read post from Maciej Cegłowski). Then, three years ago, Frederic Filloux's Bloated HTML, the best and the worse triggered my rant Fighting the Web Flab:

Filloux continues:

In due fairness, this cataract of code loads very fast on a normal connection.
His "normal" connection must be much faster than my home's 3Mbit/s DSL. But then the hope kicks in:

The Guardian technical team was also the first one to devise a solid implementation of Google's new Accelerated Mobile Page (AMP) format. In doing so, it eliminated more than 80% of the original code, making it blazingly fast on a mobile device.
Great, but AMP is still 20 bytes of crud for each byte of content. What's the word for 20 times faster than "blazingly"?

Web Packaging is a response to:

In recent years, a number of proprietary formats have been defined to enable aggregators of news and other articles to republish Web resources; for example, Google’s AMP, Facebook’s Instant Articles, Baidu’s MIP, and Apple’s News Format.

Below the fold I look into the history that got us to this point, and where we may be going.

MementoMap

I've been writing about how important Memento is for Web archiving, and how its success depends upon the effectiveness of Memento Aggregators since at least 2011:

In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators

Memento Aggregators turned out to be both useful, and a hard engineering problem. Below the fold, a discussion of MementoMap Framework for Flexible and Adaptive Web Archive Profiling by Sawood Alam et al from Old Dominion University and Arquivo.pt, which both reviews the history of finding out how hard it is, and reports on fairly encouraging progress in attacking it.

Guest post: Ilya Kreymer's Client-Side Replay Technology

Ilya Kreymer gave a brief description of his recent development of client-side replay for WARC-based Web archives in this comment on my post Michael Nelson's CNI Keynote: Part 3. It uses Service Workers, which Matt Gaunt describes in Google's Web Fundamentals thus:

A service worker is a script that your browser runs in the background, separate from a web page, opening the door to features that don't need a web page or user interaction. Today, they already include features like push notifications and background sync. In the future, service workers might support other things like periodic sync or geofencing. The core feature discussed in this tutorial is the ability to intercept and handle network requests, including programmatically managing a cache of responses.

Client-side replay was clearly an important advance, so I asked him for a guest post with the details. Below the fold, here it is.

Michael Nelson's CNI Keynote: Part 3

Here is the conclusion of my three-part "lengthy disquisition" on Michael Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides).

Part 1 and Part 2 addressed Nelson's description of the problems of the current state of the art. Below the fold I address the way forward.

Michael Nelson's CNI Keynote: Part 2

My "lengthy disquisition" on Michael Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides). continues here. Part 1 had an introduction and discussion of two of my issues with Nelson's big picture.

Below the fold I address my remaining issues with Nelson's big picture of the state of the art. Part 3 will compare his and my views of the path ahead.

Michael Nelson's CNI Keynote: Part 1

Michael Nelson and his group at Old Dominion University have made major contributions to Web archiving. Among them are a series of fascinating papers on the problems of replaying archived Web content. I've blogged about several of them, most recently in All Your Tweets Are Belong To Kannada and The 47 Links Mystery. Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides) understandably focuses on recounting much of this important research. I'm a big fan of this work, and there is much to agree with in the rest of the talk.

But I have a number of issues with the big picture Nelson paints. Part of the reason for the gap in posting recently was that I started on a draft that discussed both the big picture issues and a whole lot of minor nits, and I ran into the sand. So I finally put that draft aside and started this one. I tried to restrict myself to the big picture, but despite that it is still too long for a single post. Follow me below the fold for the first part of a lengthy disquisition.

Lets Put Our Money Where Our Ethics Are

I found a video of Jefferson Bailey's talk at the Ethics of Archiving the Web conference from a year ago. It was entitled Lets Put Our Money Where Our Ethics Are. The talk is the first 18.5 minutes of this video. It focused on the paucity of resources devoted to archiving the huge proportion of our culture that now lives on the evanescent Web. I've also written on this topic, for example in Pt. 2 of The Amnesiac Civilization. Below the fold, some detailed numbers (that may by now be somewhat out-of-date) and their implications.

Tuesday, May 21, 2024

Tuesday, July 26, 2022

Thursday, March 31, 2022

Tuesday, August 17, 2021

Thursday, April 22, 2021

Thursday, April 15, 2021

Thursday, February 11, 2021

Tuesday, January 5, 2021

Tuesday, December 29, 2020

Thursday, September 17, 2020

Tuesday, May 5, 2020

Thursday, April 9, 2020

Tuesday, February 18, 2020

Tuesday, December 31, 2019

Tuesday, October 22, 2019

Thursday, October 3, 2019

Thursday, June 20, 2019

Tuesday, June 18, 2019

Thursday, June 13, 2019

Thursday, May 2, 2019