Showing posts with label library of congress. Show all posts
Showing posts with label library of congress. Show all posts

Thursday, April 9, 2020

Yay, Library of Congress!

LoC Web Archive team
The web archiving team at the Library of Congress got some high-visibility, well-deserved publicity in the New York Times with Steven Kurutz's Meet Your Meme Lords:
For the past 20 years, a small team of archivists at the Library of Congress has been collecting the web, quietly and dutifully in its way. The initiative was born out of a desire to collect and preserve open-access materials from the web, especially U.S. government content around elections, which makes this the team’s busy season.

But the project has turned into a sweeping catalog of internet culture, defunct blogs, digital chat rooms, web comics, tweets and most other aspects of online life.
Kurutz did a good job; the article is well worth reading.

Thursday, January 9, 2020

Library of Congress Storage Architecture Meeting

.The Library of Congress has finally posted the presentations from the 2019 Designing Storage Architectures for Digital Collections workshop that took place in early September, I've greatly enjoyed the earlier editions of this meeting, so I was sorry I couldn't make it this time. Below the fold, I look at some of the presentations.

Tuesday, November 19, 2019

Seeds Or Code?

Svalbard Summer '69  
I'd like to congratulate Microsoft on a truly excellent PR stunt, drawing attention to two important topics about which I've been writing for a long time, the cultural significance of open source software, and the need for digital preservation. Ashlee Vance provides the channel to publicize the stunt in Open Source Code Will Survive the Apocalypse in an Arctic Cave. In summary, near Longyearbyen on Spitzbergen is:
the Svalbard Global Seed Vault, where seeds for a wide range of plants, including the crops most valuable to humans, are preserved in case of some famine-inducing pandemic or nuclear apocalypse.
Nearby, in a different worked-out coal mine, is the Arctic World Archive:
The AWA is a joint initiative between Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and very-long-term digital preservation provider Piql AS. AWA is devoted to archival storage in perpetuity. The film reels will be stored in a steel-walled container inside a sealed chamber within a decommissioned coal mine on the remote archipelago of Svalbard. The AWA already preserves historical and cultural data from Italy, Brazil, Norway, the Vatican, and many others.
Github, the newly-acquired Microsoft subsidiary, will deposit there:
The 02/02/2020 snapshot archived in the GitHub Arctic Code Vault will sweep up every active public GitHub repository, in addition to significant dormant repos as determined by stars, dependencies, and an advisory panel. The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.
Follow me below the fold for an explanation of why I call this admirable effort a PR stunt, albeit a well-justified one.

Tuesday, July 16, 2019

The EFF vs. DMCA Section 1201

As the EFF's Parker Higgins wrote:
Simply put, Section 1201 means that you can be sued or even jailed if you bypass digital locks on copyrighted works—from DVDs to software in your car—even if you are doing so for an otherwise lawful reason, like security testing.;
Section 1201 is obviously a big problem for software preservation, especially when it comes to games.

Last December in Software Preservation Network I discussed both the SPN's important documents relating to the DMCA:
Below the fold, some important news about Section 1201.

Thursday, May 2, 2019

Lets Put Our Money Where Our Ethics Are

I found a video of Jefferson Bailey's talk at the Ethics of Archiving the Web conference from a year ago. It was entitled Lets Put Our Money Where Our Ethics Are. The talk is the first 18.5 minutes of this video. It focused on the paucity of resources devoted to archiving the huge proportion of our culture that now lives on the evanescent Web. I've also written on this topic, for example in Pt. 2 of The Amnesiac Civilization. Below the fold, some detailed numbers (that may by now be somewhat out-of-date) and their implications.

Tuesday, December 4, 2018

Selective Amnesia

Last year's series of posts and PNC keynote entitled The Amnesiac Civilization were about the threats to our cultural heritage from inadequate funding of Web archives, and the resulting important content that is never preserved. But content that Web archives do collect and preserve is also under a threat that can be described as selective amnesia. David Bixenspan's When the Internet Archive Forgets makes the important, but often overlooked, point that the Internet Archive isn't an elephant:
On the internet, there are certain institutions we have come to rely on daily to keep truth from becoming nebulous or elastic. Not necessarily in the way that something stupid like Verrit aspired to, but at least in confirming that you aren’t losing your mind, that an old post or article you remember reading did, in fact, actually exist. It can be as fleeting as using Google Cache to grab a quickly deleted tweet, but it can also be as involved as doing a deep dive of a now-dead site’s archive via the Wayback Machine. But what happens when an archive becomes less reliable, and arguably has legitimate reasons to bow to pressure and remove controversial archived material?
...
Over the last few years, there has been a change in how the Wayback Machine is viewed, one inspired by the general political mood. What had long been a useful tool when you came across broken links online is now, more than ever before, seen as an arbiter of the truth and a bulwark against erasing history.
Below the fold, some commentary on the vulnerability of Web history to censorship.

Thursday, June 21, 2018

Software Heritage Archive Goes Live

June 7th was a big day for software preservation; it was the formal opening of Software Heritage's archive. Congratulations to Roberto di Cosmo and the team! There's a post on the Software Heritage blog with an overview:
Today, June 7th 2018, we are proud to be back at Unesco headquarters to unveil a major milestone in our roadmap: the grand opening of the doors of the Software Heritage archive to the public (the slides of the presentation are online). You can now look at what we archived, exploring the largest collection of software source code in the world: you can explore the archive right away, via your web browser. If you want to know more, an upcoming post will guide you through all the features that are provided and the internals backing them.
Morane Gruenpeter's Software Preservation: A Stepping Stone for Software Citation is an excellent explanation of the role that Software Heritage's archive plays in enabling researchers to cite software:
In recent years software has become a legitimate product of research gaining more attention from the scholarly ecosystem than ever before, and researchers feel increasingly the need to cite the software they use or produce. Unfortunately, there is no well established best practice for doing this, and in the citations one sees used quite often ephemeral URLs or other identifiers that offer little or no guarantee that the cited software can be found later on.

But for software to be findable, it must have been preserved in the first place: hence software preservation is actually a prerequisite of software citation.
The importance of preserving software, and in particular open source software, is something I've been writing about for nearly a decade. My initial post about the Software Heritage Foundation started:
Back in 2009 I wrote:
who is to say that the corpus of open source is a less important cultural and historical artifact than, say, romance novels.
Back in 2013 I wrote:
Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.
Please support this important work by donating to the Software Heritage Foundation.

Thursday, December 21, 2017

Science Friday's "File Not Found"

Science Friday's Lauren Young has a three-part series on digital preservation:
  1. Ghosts In The Reels is about magnetic tape.
  2. The Librarians Saving The Internet is about Web archiving.
  3. Data Reawakening is about the search for a quasi-immortal medium.
Clearly, increasing public attention to the problem of preserving digital information is a good thing, but I have reservations about these posts. Below the fold, I lay them out.

Tuesday, March 21, 2017

The Amnesiac Civilization: Part 5

Part 2 and Part 3 of this series established that, for technical, legal and economic reasons there is much Web content that cannot be ingested and preserved by Web archives. Part 4 established that there is much Web content that can currently be ingested and preserved by public Web archives that, in the near future, will become inaccessible. It will be subject to Digital Rights Management (DRM) technologies which will, at least in most countries, be illegal to defeat. Below the fold I look at ways, albeit unsatisfactory, to address these problems.

Tuesday, November 29, 2016

Talks at the Library of Congress Storage Architecture Meeting

Slides from the talks at last September's Library of Congress Storage Architecture meeting are now on-line. Below the fold, links to and commentary on three of them.

Thursday, March 17, 2016

Dr. Pangloss loves technology roadmaps

Its nearly three years since we last saw the renowned Dr. Pangloss chuckling with glee at the storage industry's roadmaps. But last week he was browsing Slashdot and found something much to his taste. Below the fold, an explanation of what the good Doctor enjoyed so much.

Tuesday, October 20, 2015

Storage Technology Roadmaps

At the recent Library of Congress Storage Architecture workshop, Robert Fontana of IBM gave an excellent overview of the roadmaps for tape, disk, optical and NAND flash (PDF) storage technologies in terms of bit density and thus media capacity. His slides are well worth studying, but here are his highlights for each technology:
  • Tape has a very credible roadmap out to LTO10 with 48TB/cartridge somewhere around 2022.
  • Optical's roadmap shows increases from the current 100GB/disk to 200, 300, 500 and 1000GB/disk, but there are no dates on them. At least two of those increases will encounter severe difficulties making the physics work.
  • The hard disk roadmap shows the slow increase in density that has prevailed for the last 4 years continuing until 2017, when it accelerates to 30%/yr. The idea is that in 2017 Heat Assisted Magnetic Recording (HAMR) will be combined with shingling, and then in 2021 Bit Patterned Media (BPM) will take over, and shortly after be combined with HAMR.
  • The roadmap for NAND flash is for density to increase in the near term by 2-3X and over the next 6-8 years by 6-8X. This will require significant improvements in processing technology but "processing is a core expertise of the semiconductor industry so success will follow".
Below the fold, my comments.

Tuesday, October 6, 2015

Another good prediction

After patting myself on the back about one good prediction, here is another. Ever since Dave Anderson's presentation to the 2009 Storage Architecture meeting at the Library of Congress, I've been arguing that for flash to displace disk as the bulk storage medium would require flash vendors to make such enormous investments in new fab capacity that there would be no possibility of making an adequate return on the investments. Since the vendors couldn't make money on the investment, they wouldn't make it, and flash would not displace disk. 6 years later, despite the arrival of 3D flash that is still the case.

Source: Gartner & Stifel
Chris Mellor at The Register has the story in a piece entitled Don't want to fork out for NAND flash? You're not alone. Disk still rules. Its summed up in this graph, showing the bytes shipped by flash and disk vendors.It shows that the total bytes shipped is growing rapidly, but the proportion that is flash is about stable. Flash is:
expected to account for less than 10 per cent of the total storage capacity the industry will need by 2020.
Stifel estimates that:
Samsung is estimated to be spending over $23bn in capex on its 3D NAND for for an estimated ~10-12 exabytes of capacity.
If it is fully ramped-in by 2018 it will make about 1% of what the disk manufacturers will that year. So the investment to replace that capacity would be $2.3T, which clearly isn't going to happen. Unless the investment to make a petabyte of flash per year is much less than the investment to make a petabyte of disk, disk will remain the medium of choice for bulk storage.


Friday, September 11, 2015

Prediction: "Security will be an on-going challenge"

The Library of Congress' Storage Architectures workshop asked gave a group of us each 3 minutes to respond to a set of predictions for 2015 and questions accumulated at previous instances of this fascinating workshop. Below the fold, the brief talk in which I addressed one of the predictions. At the last minute, we were given 2 minutes more, so I made one of my own.

Friday, June 5, 2015

Archiving games

This is just a quick note to flag two good recent posts on important but extremely difficult problem of archiving computer games.  Gita Jackson at Boing-Boing in The vast, unplayable history of video games describes the importance to scholars of archiving games. Kyle Orland at Ars Technica in The quest to save today’s gaming history from being lost forever covers the technical reasons why it is so difficult in considerable detail, including quotes from many of the key players in the space.

My colleagues at the Stanford Libraries are actively working to archive games. Back in 2013, on the Library of Congress' The Signal digital preservation blog Trevor Owens interviewed Stanford's Henry Lowood, who curates our games collection.

Wednesday, January 21, 2015

New Yorker on Web Archiving

Do not hesitate, do not pass Go, right now please read Jill Lepore's really excellent New Yorker article Cobweb: can the Web be archived?

Monday, November 10, 2014

Gossip protocols: a clarification

a subtype of “gossip” protocols" and cites LOCKSS as an example, saying:
Not coincidentally, LOCKSS “consists of a large number of independent, low-cost, persistent Web caches that cooperate to detect and repair damage to their content by voting in “opinion polls” (PDF). In other words, gossip and anti-entropy.
The main use for gossip protocols is to disseminate information in a robust, randomized way, by having each peer forward information it receives from other peers to a random selection of other peers. As the function of LOCKSS boxes is to act as custodians of copyright information, this would be a very bad thing for them to do.

It is true that LOCKSS peers communicate via an anti-entropy protocol, and it is even true that the first such protocol they used, the one I implemented for the LOCKSS prototype, was a gossip protocol in the sense that peers forwarded hashes of content to each other. Alas, that protocol was very insecure. Some of the ways in which it was insecure related directly to its being a gossip protocol.

An intensive multi-year research effort in cooperation with Stanford's CS department to create a more secure anti-entropy protocol led to the current  protocol, which won "Best Paper" at the 2003 Symposium on Operating System Principles. It is not a gossip protocol in any meaningful sense (see below the fold for details). Peers never forward information they receive from other peers, all interactions are strictly pair-wise and private.

For the TRAC audit of the CLOCKSS Archive we provided an overview of the operation of the LOCKSS anti-entropy protocol; if you are interested in the details of the protocol this, rather than the long and very detailed paper in ACM Transactions on Computer Systems (PDF), is the place to start.

Thursday, September 25, 2014

Plenary Talk at 3rd EUDAT Conference

I gave a plenary talk at the 3rd EUDAT Conference's session on sustainability entitled Economic Sustainability of Digital Preservation. Below the fold is an edited text with links to the sources.

Tuesday, September 23, 2014

A Challenge to the Storage Industry

I gave a brief talk at the Library of Congress Storage Architecture meeting, pulling together themes from a number of recent blog posts. My goal was twofold:
  • to outline the way in which current storage architectures fail to meet the needs of long-term archives,
  • and to set out what an architecture that would meet those needs would look like.
Below the fold is an edited text with links to the earlier posts here that I was condensing.