DSHR's Blog: digital preservation

Showing posts with label digital preservation. Show all posts

Thursday, July 10, 2025

The Festschrift For Cliff Lynch

The festschrift that includes the edited version of the draft we posted back in April entitled Lots Of Cliff Keeps Stuff Safe has been officially published as Networking Networks: A Festschrift in Honor of Clifford Lynch, an open access supplement to portal: Libraries and the Academy 25, no. 3. Joan K. Lippincott writes:

The final CNI membership meeting of Cliff’s tenure, held April 7–8, 2025, in Milwaukee, was to include a surprise presentation of the Festschrift’s table of contents. Though Cliff’s health prevented him from attending in person, he participated virtually and heard readings of excerpts from each contribution. Clifford Lynch passed away shortly after, on April 10, 2025. Authors completed their essays before his passing, and the original text remains unchanged.

Below the fold is a bried snippet of each of the invited contributions and some comments.

Cliff Lynch RIP

Source

Last Tuesday Cliff Lynch delivered an abbreviated version of his traditional closing summary and bon voyage to CNI's 2025 Spring Membership Meeting via Zoom from his sick-bed. Last Thursday night he died, still serving as Executive Director. CNI has posted In Memoriam: Clifford Lynch.

Cliff impacted a wide range of areas. The best overview is Mike Ashenfelder's 2013 profile of Cliff Lynch in the Library of Congress' Digital Preservation Pioneer series, which starts:

Clifford Lynch is widely regarded as an oracle in the culture of networked information. Lynch monitors the global information ecosystem for cultural trends and technological developments. He ponders their variables, interdependencies and influencing factors. He confers with colleagues and draws conclusions. Then he reports his observations through lectures, conference presentations and writings. People who know about Lynch pay close attention to what he has to say.

Lynch is a soft-spoken man whose work, for more than thirty years, has had an impact — directly or indirectly — on the computer, information and library science communities.

Below the fold are some additional personal notes on Cliff's contributions.

Cliff Lynch's festschrift

Vicky and I were invited to contribute to a festschrift celebrating Cliff Lynch's retirement from the Coalition for Networked Information. We decided to focus on his role in the long-running controversy over how digital information was to be preserved for the long haul.

Below the fold is our contribution, before it was copy-edited for portal: Libraries and the Academy.

Paul Evan Peters Award Lecture

At the Spring 2025 Membership Meeting of the Coalition for Networked Information, Vicky and I received the Paul Evan Peters Award.

You can tell this is an extraordinary honor from the list of previous awardees, and the fact that it is the first time it has been awarded in successive years. Part of the award is the opportunity to make an extended presentation to open the meeting. Our talk was entitled Lessons From LOCKSS, and the abstract was:

Vicky and David will look back over their two decades with the LOCKSS Program. Vicky will focus on the Program's initial goals and how they evolved as the landscape of academic communication changed. David will focus on the Program's technology, how it evolved, and how this history reveals a set of seductive, persistent but impractical ideas.

CNI has posted the video of the entire opening plenary to YouTube. Don Waters' generous introduction starts at 14:28 and Vicky starts talking at 20:00.

Below the fold is the text with links to the sources, information that appeared on slides but was not spoken, and much additional information in footnotes.

Paul Evan Peters Award

Year	Awardee
2024	Tony Hey
2022	Paul Courant
2020	Francine Berman
2017	Herbert Van de Sompel
2014	Donald A.B. Lindberg
2011	Christine L. Borgman
2008	Daniel E. Atkins
2006	Paul Ginsparg
2004	Brewster Kahle
2002	Vinton Gray Cerf
2000	Tim Berners-Lee

It has just been announced that at the Spring 2025 Membership Meeting of the Coalition for Networked Information in Milwaukee, WI April 7^th and 8^th, Vicky and I are to receive the Paul Evan Peters Award. The press release announcing the award is here.

Vicky and I are honored and astonished by this award. Honored because it is the premiere award in the field, and astonished because we left the field more than seven years ago to take up our new full-time career as grandparents. We are all the more astonished because we are not even eligible for the award; the rules clearly state that the "award will be granted to an individual".

You can tell this is an extraordinary honor from the list of previous awardees, and the fact that it is the first time it has been awarded in successive years. Vicky and I are extremely grateful to the Association of Research Libraries, CNI and EDUCAUSE, who sponsor the award.

Original Logo

Part of the award is the opportunity to make an extended presentation to open the meeting. The text of our talk, entitled Lessons From LOCKSS, with links to the sources and information that appeared on slides but was not spoken, should appear here on April 7^th.

The work that the award recognizes was not ours alone, but the result of a decades-long effort by the entire LOCKSS team. It was made possible by support from the LOCKSS community and many others, including Michael Lesk then at NSF, Donald Waters then at the Mellon Foundation, the late Karen Hunter at Elsevier, Stanford's Michael Keller and CNI's Cliff Lynch.

Tuesday, September 3, 2024

"Owning" e-books

The basic aspiration of the LOCKSS Program when we started a quarter century ago was to enable libraries to continue their historical mission of collecting, preserving, and providing readers with access to academic journals. In the paper world libraries which subscribed to a journal owned a copy; in the digital world they could only rent access to the publisher's copy. This allowed the oligoply academic publishers to increase their rent extraction from research and education budgets.

LOCKSS provided a cheap way for libraries to collect, preserve and provide access to their own copy of journals. The competing e-journal preservation systems accepted the idea of rental; they provided an alternate place from which access could be rented if it were denied by the publisher.

Similarly, libraries that purchased a paper book owned a copy that they could loan to readers. The transition to e-books meant that they were only able to rent access to the publisher's copy, and over time the terms of this rental grew more and more onerous.

Below the fold I look into a recent effort to mitigate this problem.

Video Game Preservation

Source

I have written fairly often about the problems of preserving video games, most recently last year in in Video Game History. It was based upon Phil Salvador's Survey of the Video Game Reissue Market in the United States. Salvador's main focus was on classic console games but he noted a specific problem with more recent releases:

The largest major platform shutdown in recent memory is the closure of the digital stores for the Nintendo 3DS and Wii U platforms. Nintendo shut down the 3DS and Wii U eShops on March 27, 2023, resulting in the removal of 2,413 digital titles. Although many of these are likely available on other platforms, Video Games Chronicle estimates that over 1,000 of those games were exclusive to those platforms’ digital stores and are no longer available in any form, including first-party Nintendo titles like Dr. Luigi, Dillon’s Rolling Western, Mario & Donkey Kong: Minis on the Move, and Pokémon Rumble U. The closures also affected around 500 historical games reissued by Nintendo through their Virtual Console storefronts, over 300 of which are believed not available on any other platform or service.

Below the fold I discuss recent developments in this area.

NDSA Sustainability Excellence Award

Yesterday, at the DigiPres conference, Vicky Reich and I were awarded a "Sustainability Excellence Award" by the National Digital Stewardship Alliance. This is a tribute to the sustained hard work of the entire LOCKSS team over more than a quarter-century.

Below the fold are the citation and our response.

Video Game History

Arguably, video games have become the most important entertainment medium. 2022 US video game revenue topped $85B, compared with global movie industry revenue of $76.7B. Game history is thus an essential resource for scholars of culture, but the industry's copyright paranoia means they have little access to it.

Salvador Table 1

In 87% Missing: the Disappearance of Classic Video Games Kelsey Lewin of the Video Game History Foundation describes their recent study in cooperation with the Software Preservation Network, published by Phil Salvador as Survey of the Video Game Reissue Market in the United States. The report's abstract doesn't mince words:

Only 13 percent of classic video games published in the United States are currently in release (n = 1500, ±2.5%, 95% CI). These low numbers are consistent across platform ecosystems and time periods. Troublingly, the reissue rate drops below 3 percent for games released prior to 1985—the foundational era of video games—indicating that the interests of the marketplace may not align with the needs of video game researchers. Our experiences gathering data for this study suggest that these problems will intensify over time due to a low diversity of reissue sources and the long-term volatility of digital game storefronts.

Below the fold I discuss some of the details.

Where Did The Number 3 Come From?

The Keepers Registry, which tracks the preservation of academic journals by various "keepers" (preservation agencies), currently says:

20,127 titles are being ‘kept safe’ by 3 or more Keepers

The registry backs this up with this page, showing the number of journals being preserved by N keepers.

Source

The NDSA Levels of Digital Preservation: An Explanation and Uses from 2013 is still in wide use as a guide to preserving digital content. It uses specifies the number of independent copies as 2 for "Level 1" and 3 for "Levels 2-4".

Alicia Wise of CLOCKSS asked "where did the number 3 come from?" Below the fold I discuss the backstory.

A Quarter-Century Of Preservation

The Internet Archive turned 25 yesterday! Congratulations to Brewster and the hordes of miniature people who have built this amazing institution.

For the Archive's home-town newspaper, Chase DiFeliciantoni provided a nice appreciation in He founded the Internet Archive with a utopian vision. That hasn't changed, but the internet has:

Kahle’s quest to build what he calls “A Library of Alexandria for the internet” started in the 1990s when he began sending out programs called crawlers to take digital snapshots of every page on the web, hundreds of billions of which are available to anyone through the archive’s Wayback Machine.

That vision of free and open access to information is deeply entwined with the early ideals of Silicon Valley and the origins of the internet itself.

“The reason for the internet and specifically the World Wide Web was to make it so that everyone’s a publisher and everybody can go and have a voice,” Kahle said. To him, the need for a new type of library for that new publishing system, the internet, was obvious.

We (virtually) attended the celebration — you can watch the archived stream here., and please donate to help with the $3M match they announced.

Tuesday, June 8, 2021

Unreliability At Scale

Thomas Claiburn's FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof discusses two recent papers that are relevant to the extraordinary levels of reliability needed in long-term digital preservation at scale.

Silent Data Corruption at Scale by Harish Dattatraya Dixit et al from Facebook (blog post summary here).
Cores that don’t count by Peter Hochschild et al from Google.

Below the fold some commentary on both papers.

The New Oldweb.today

Two days before Christmas Ilya Kreymer posted Announcing the New OldWeb.today. The old oldweb.today was released five years ago, and Ilya described the details in a guest post here. It was an important step forward in replaying preserved Web content because users could view the old Web content as it would have been rendered at the time it was published, not as rendered in a modern browser. I showed an example of the difference this made in The Internet is for Cats.

Below the fold, I look at why the new oldweb.today is an improvement on the old version, which is still available at classic.oldweb.today

I Rest My Case

Jeff Rothenberg's seminal 1995 Ensuring the Longevity of Digital Documents focused on the threat of the format in which the documents were encoded becoming obsolete, and rendering its content inaccessible. This was understandable, it was a common experience in the preceeding decades. Rothenberg described two different approaches to the problem, migrating the document's content from the doomed format to a less doomed one, and emulating the software that accessed the document in a current environment.

The Web has dominated digital content since 1995, and in the Web world formats go obsolete very slowly, if at all, because they are in effect network protocols. The example of IPv6 shows how hard it is to evolve network protocols. But now we are facing the obsolescence of a Web format that was very widey used as the long effort to kill off Adobe's Flash comes to fruition. Fortunately, Jason Scott's Flash Animations Live Forever at the Internet Archive shows that we were right all along. Below the fold, I go into the details.

Don't Say We Didn't Warn You

Just over a quarter-century ago, Stanford Libraries' HighWire Press pioneered the switch of academic journal publishing from paper to digital when they put the Journal of Biological Chemistry on-line. Even in those early days of the Web, people understood that Web pages, and links to them, decayed over time. A year later, Brewster Kahle founded the Internet Archive to preserve them for posterity.

One difficulty was that although academic journals contained some of the Web content that was most important to preserve for the future, the Internet Archive could not access them because they were paywalled. Two years later, Vicky Reich and I started the LOCKSS (Lots Of Copies Keep Stuff Safe) program to address this problem. In 2000's Permanent Web Publishing we wrote:

Librarians have a well-founded confidence in their ability to provide their readers with access to material published on paper, even if it is centuries old. Preservation is a by-product of the need to scatter copies around to provide access. Librarians have an equally well-founded skepticism about their ability to do the same for material published in electronic form. Preservation is totally at the whim of the publisher.

A subscription to a paper journal provides the library with an archival copy of the content. Subscribing to a Web journal rents access to the publisher's copy. The publisher may promise "perpetual access", but there is no business model to support the promise. Recent events have demonstrated that major journals may vanish from the Web at a few months notice.

This poses a problem for librarians, who subscribe to these journals in order to provide both current and future readers with access to the material. Current readers need the Web editions. Future readers need paper; there is no other way to be sure the material will survive.

Now, Jeffrey Brainard's Dozens of scientific journals have vanished from the internet, and no one preserved them and Diana Kwon's More than 100 scientific journals have disappeared from the Internet draw attention to this long-standing problem. Below the fold I discuss the paper behind the Science and Nature articles.

Archival Cloud Storage Pricing

Although there are significant technological risks to data stored for the long term, its most important vulnerability is to interruptions in the money supply. The current pandemic is likely to cause archives to suffer significant interruptions in the money supply.

In Cloud For Preservation I described how much of the motivation for using cloud services was their month-by-month pay-for-what-you-use billing, which transforms capital expenditures (CapEx) into operational expenditures (OpEx). Organizations typically find OpEx much easier to justify than CapEx because:

The numbers they look at are smaller, even if what they add up to over time is greater.
OpEx is less of a commitment, since it can be decreased if circumstances change.

Unfortunately, the lower the commitment the higher the risk to long-term preservation. Since it doesn't deliver immediate returns, it is likely to be first on the chopping block. Thus both reducing storage cost and increasing its predictability are important for sustainable digital preservation. Below the fold I revisit this issue.

The Scholarly Record At The Internet Archive

The Internet Archive has been working on a Mellon-funded grant aimed at collecting, preserving and providing persistent access to as much of the open-access academic literature as possible. The motivation is that much of the "long tail" of academic literature comes from smaller publishers whose business model is fragile, and who are at risk of financial failure or takeover by the legacy oligopoly publishers. This is particularly true if their content is open access, since they don't have subscription income. This "long tail" content is thus at risk of loss or vanishing behind a paywall.

The project takes two opposite but synergistic approaches:

Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.

Below the fold, a discussion of the progress that has been made so far.

Library of Congress Storage Architecture Meeting

.The Library of Congress has finally posted the presentations from the 2019 Designing Storage Architectures for Digital Collections workshop that took place in early September, I've greatly enjoyed the earlier editions of this meeting, so I was sorry I couldn't make it this time. Below the fold, I look at some of the presentations.

Seeds Or Code?

Svalbard Summer '69

I'd like to congratulate Microsoft on a truly excellent PR stunt, drawing attention to two important topics about which I've been writing for a long time, the cultural significance of open source software, and the need for digital preservation. Ashlee Vance provides the channel to publicize the stunt in Open Source Code Will Survive the Apocalypse in an Arctic Cave. In summary, near Longyearbyen on Spitzbergen is:

the Svalbard Global Seed Vault, where seeds for a wide range of plants, including the crops most valuable to humans, are preserved in case of some famine-inducing pandemic or nuclear apocalypse.

Nearby, in a different worked-out coal mine, is the Arctic World Archive:

The AWA is a joint initiative between Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and very-long-term digital preservation provider Piql AS. AWA is devoted to archival storage in perpetuity. The film reels will be stored in a steel-walled container inside a sealed chamber within a decommissioned coal mine on the remote archipelago of Svalbard. The AWA already preserves historical and cultural data from Italy, Brazil, Norway, the Vatican, and many others.

Github, the newly-acquired Microsoft subsidiary, will deposit there:

The 02/02/2020 snapshot archived in the GitHub Arctic Code Vault will sweep up every active public GitHub repository, in addition to significant dormant repos as determined by stars, dependencies, and an advisory panel. The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.

Follow me below the fold for an explanation of why I call this admirable effort a PR stunt, albeit a well-justified one.

Auditing The Integrity Of Multiple Replicas

The fundamental problem in the design of the LOCKSS system was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:

Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.

Our approach, published now more than 16 years ago, was to have each node in the network compare its content with that of the consensus among a randomized subset of the other nodes holding the same content. They did so using a peer-to-peer protocol using proof-of-work, in some respects one of the many precursors of Satoshi Nakamoto's Bitcoin protocol.

Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.

A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work.

Thursday, July 10, 2025

Tuesday, April 15, 2025

Thursday, April 10, 2025

Monday, April 7, 2025

Friday, January 31, 2025

Tuesday, September 3, 2024

Tuesday, June 11, 2024

Thursday, November 16, 2023

Thursday, August 3, 2023

Tuesday, June 14, 2022

Friday, October 22, 2021

Tuesday, June 8, 2021

Tuesday, January 5, 2021

Tuesday, November 24, 2020

Thursday, September 17, 2020

Tuesday, March 31, 2020

Tuesday, February 18, 2020

Thursday, January 9, 2020

Tuesday, November 19, 2019

Thursday, November 14, 2019