Showing posts with label government information. Show all posts
Showing posts with label government information. Show all posts

Monday, April 7, 2025

Paul Evan Peters Award Lecture

At the Spring 2025 Membership Meeting of the Coalition for Networked Information, Vicky and I received the Paul Evan Peters Award.

You can tell this is an extraordinary honor from the list of previous awardees, and the fact that it is the first time it has been awarded in successive years. Part of the award is the opportunity to make an extended presentation to open the meeting. Our talk was entitled Lessons From LOCKSS, and the abstract was:
Vicky and David will look back over their two decades with the LOCKSS Program. Vicky will focus on the Program's initial goals and how they evolved as the landscape of academic communication changed. David will focus on the Program's technology, how it evolved, and how this history reveals a set of seductive, persistent but impractical ideas.
CNI has posted the video of the entire opening plenary to YouTube. Don Waters' generous introduction starts at 14:28 and Vicky starts talking at 20:00.

Below the fold is the text with links to the sources, information that appeared on slides but was not spoken, and much additional information in footnotes.

Tuesday, May 5, 2020

Carl Malamud Wins (Mostly)

In Supreme Court rules Georgia can’t put the law behind a paywall Timothy B. Lee writes:
A narrowly divided US Supreme Court on Monday upheld the right to freely share the official law code of Georgia. The state claimed to own the copyright for the Official Code of Georgia Annotated and sued a nonprofit called Public.Resource.Org for publishing it online. Monday's ruling is not only a victory for the open-government group, it's an important precedent that will help secure the right to publish other legally significant public documents.

"Officials empowered to speak with the force of law cannot be the authors of—and therefore cannot copyright—the works they create in the course of their official duties," wrote Chief Justice John Roberts in an opinion that was joined by four other justices on the nine-member court.
Below the fold, commentary on various reports of the decision, and more.

Thursday, January 9, 2020

Library of Congress Storage Architecture Meeting

.The Library of Congress has finally posted the presentations from the 2019 Designing Storage Architectures for Digital Collections workshop that took place in early September, I've greatly enjoyed the earlier editions of this meeting, so I was sorry I couldn't make it this time. Below the fold, I look at some of the presentations.

Thursday, February 7, 2019

Cloud For Preservation

Imagine you're responsible for preserving the long-established digital collection at a large research or national library. It is currently preserved in home-grown software, or off-the-shelf software that's been extensively customized, that you are responsible for running on hardware run by your institution's IT department. You are probably not a large customer of theirs. They are probably laying down the law, saying "cloud first", especially as you are looking at a looming hardware refresh. Below the fold, I examine a set of issues that need to be clarified in the decision-making process.

Tuesday, January 22, 2019

Trump's Shutdown Impacts Information Access

Source
Government shutdown causing information access problems by James A. Jacobs and James R. Jacobs is important. It documents the effect of the Trump government shutdown on access to globally important information:
Twitter and newspapers are buzzing with complaints about widespread problems with access to government information and data (see for example, Wall Street Journal (paywall 😐 ), ZDNet News, Pew Center, Washington Post, Scientific American, TheVerge, and FedScoop to name but a few).
Matthew Green, a professor at Johns Hopkins, said “It’s worrying that every single US cryptography standard is now unavailable to practitioners.” He was responding to the fact that he could not get the documents he needed from the National Institute of Standards and Technology (NIST) or its branch, the Computer Security Resource Center (CSRC). The government shutdown is the direct cause of these problems.
They point out how this illustrates the importance of libraries collecting and preserving web-published information:
Regardless of who you (or your user communities) blame for the shutdown itself, this loss of access was entirely foreseeable and avoidable. It was foreseeable because it has happened before. It was avoidable because libraries can select, acquire, organize, and preserve these documents and provide access to them and services for them whether the government is open or shut-down.
Go read the whole thing, and weep for the way libraries have abandoned their centuries-long mission of safeguarding information for future readers.

Tuesday, March 27, 2018

Bad Blockchain Content

A Quantitative Analysis of the Impact of Arbitrary Blockchain Content on Bitcoin by Roman Matzutt et al examines the stuff in the Bitcoin blockchain that isn't a monetary transaction. They:
provide the first systematic analysis of the benefits and threats of arbitrary blockchain content. Our analysis shows that certain content, e.g., illegal pornography, can render the mere possession of a blockchain illegal. Based on these insights, we conduct a thorough quantitative and qualitative analysis of unintended content on Bitcoin's blockchain. Although most data originates from benign extensions to Bitcoin's protocol, our analysis reveals more than 1600 files on the blockchain, over 99% of which are texts or images.
Below the fold, some details.

Wednesday, February 14, 2018

Tuesday, September 5, 2017

Long-Lived Scientific Observations

By BabelStone, CC BY-SA 3.0
Source
Keeping scientific data, especially observations that are not repeatable, for the long term is important. In our 2006 Eurosys paper we used an example from China. During the Shang dynasty:
astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.
Last week we had another, if only one-fifth as old, example of the value of long-ago scientific observations. Korean astronomers' records of a nova in 1437 provide strong evidence that:
1473 nova remains
"cataclysmic binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested. After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.
How were these 580-year-old records preserved? Follow me below the fold.

Thursday, June 8, 2017

Public Resource Audits Scholarly Literature

I (from personal experience), and others, have commented previously on the way journals paywall articles based on spurious claims that they own the copyright, even when there is clear evidence that they know that these claims are false. This is copyfraud, but:
While falsely claiming copyright is technically a criminal offense under the Act, prosecutions are extremely rare. These circumstances have produced fraud on an untold scale, with millions of works in the public domain deemed copyrighted, and countless dollars paid out every year in licensing fees to make copies that could be made for free.
The clearest case of journal copyfraud is when journals claim copyright on articles authored by US federal employees:
Work by officers and employees of the government as part of their official duties is "a work of the United States government" and, as such, is not entitled to domestic copyright protection under U.S. law. So, inside the US there is no copyright to transfer, and outside the US the copyright is owned by the US government, not by the employee. It is easy to find papers that apparently violate this, such as James Hansen et al's Global Temperature Change. It carries the statement "© 2006 by The National Academy of Sciences of the USA" and states Hansen's affiliation as "National Aeronautics and Space Administration Goddard Institute for Space Studies".
Perhaps the most compelling instance is the AMA falsely claiming to own the copyright on United States Health Care Reform: Progress to Date and Next Steps by one Barack Obama.

Now, Carl Malamud tweets:
Public Resource has been conducting an intensive audit of the scholarly literature. We have focused on works of the U.S. government. Our audit has determined that 1,264,429 journal articles authored by federal employees or officers are potentially void of copyright.
They extracted metadata from Sci-Hub and found:
Of the 1,264,429 government journal articles I have metadata for, I am now able to access 1,141,505 files (90.2%) for potential release.
This is already extremely valuable work. But in addition:
2,031,359 of the articles in my possession are dated 1923 or earlier. These 2 categories represent 4.92% of scihub. Additional categories to examine include lapsed copyright registrations, open access that is not, and author-retained copyrights.
It is long past time for action against the rampant copyfraud by academic journals.

Tip of the hat to James R. Jacobs.

Tuesday, August 9, 2016

Correlated Distraction

It is 11:44AM Pacific and I'm driving, making a left on to Central Expressway in Mountain View, CA and trying to avoid another vehicle whose driver isn't paying attention when an ear-splitting siren goes off in my car. After a moment of panic I see "Connected" on the infotainment system display. Its the emergency alert system. When it is finally safe to stop and check, I see this message:
Emergency Alert: Dust Storm Warning in this area until 12:00PM MST. Avoid travel. Check local media - NWS.
WTF? Where to even begin with this stupidity? Well, here goes:
  • "this area" - what area? In the Bay Area we have earthquakes, wildfires, flash floods, but we don't yet have dust storms. Why does the idiot who composed the message think they know where everyone who will read it is?
  • Its 11:44AM Pacific, or 18:44UTC. That's 12:44PM Mountain. Except we're both on daylight savings time. So did the message mean 12:00PM MDT, in which case the message was already 44 minutes too late? Or did the message mean 12:00MST, or 19:00UTC, in which case it had 16 minutes to run? Why send a warning 44 minutes late or use the wrong time zone?
  • A dust storm can be dangerous, so giving people 16 minutes (but not -44 minutes) warning could save some lives. Equally, distracting everyone in "this area" who is driving, operating machinery, performing surgery, etc. could cost some lives. Did anyone balance the upsides and downsides of issuing this warning, even assuming it only reached people in "this area"?
  • I've written before about the importance and difficulty of modelling correlated failures. Now that essentially every driver is carrying (but hopefully not talking on) a cellphone, the emergency alert system is a way to cause correlated distraction of every driver across the entire nation. Correlated distraction caused by rubbernecking at accidents is a well-known cause of additional accidents. But at least that is localized in space. Who thought that building a system to cause correlated distraction of every driver in the nation was a good idea?
  • Who has authority to trigger the distraction? Who did trigger the distraction? Can we get that person fired?
  • This is actually the third time the siren has gone off while I'm driving. The previous two were Amber alerts. Don't get me wrong. I think getting drivers to look out for cars that have abducted children is a good idea, and I'm glad to see the overhead signs on freeways used for that purpose. But it isn't a good enough idea to justify the ear-splitting siren and consequent distraction. So I had already followed instructions to disable Amber alerts. I've now also disabled Emergency alerts.
So, once again, because no-one thought What Could Possibly Go Wrong?, a potentially useful system has crashed and burned.

Tuesday, July 5, 2016

The Major Threat is Economic

I've frequently said that the major threat to digital preservation is economic; back in 2013 I posted The Major Threat is Economic. We are reminded of this by the announcement last March that:
The future of the Trove online database is in doubt due to funding cuts to the National Library of Australia.
Trove is the National Library's system:
In 2014, the database's fifth year, an estimated 70,000 people were using the website each day.

Australia Library and Information Association chief executive Sue McKarracher said Trove was a visionary move by the library and had turned into a world-class resource.
...
"If you look at things like the digital public libraries in the United States, really a lot of that came from looking at our Trove and seeing what a nation could do investing in a platform that would hold museum, gallery and library archives collections and make them accessible to the world."

Tuesday, April 5, 2016

The Curious Case of the Outsourced CA

I took part in the Digital Preservation of Federal Information Summit, a pre-meeting of the CNI Spring Membership Meeting. Preservation of government information is a topic that the LOCKSS Program has been concerned with for a long time; my first post on the topic was nine years ago. In the second part of the discussion I had to retract a proposal I made in the first part that had seemed obvious. The reasons why the obvious was in fact wrong are interesting. The explanation is below the fold.

Friday, March 11, 2016

Talk on Evolving the LOCKSS Technology at PASIG

At the PASIG meeting in Prague I gave a brief update on the ongoing evolution of the LOCKSS technology. Below the fold, an edited text of the talk with links to the sources.

Thursday, March 10, 2016

Talk on Private LOCKSS Networks at PASIG

I stood in for Vicky Reich to give an overview of Private LOCKSS Networks to the PASIG meeting. Below the fold, an edited text of the talk with links to the sources.

Thursday, February 11, 2016

James Jacobs on Looking Forward

Government documents have long been a field that the LOCKSS Program has been involved in. Recent history, such as that of the Harper administration in Canada, is full of examples of Winston Smith style history editing by governments. This makes it essential that copies of government documents are maintained outside direct government custody, and several private LOCKSS networks are doing this for various kinds of government documents. Below the fold, a look at the US Federal Depository Library Program, which has been doing this in the paper world for a long time, and the state of its gradual transition to the digital world.

Wednesday, September 23, 2015

Canadian Government Documents

Eight years ago, in the sixth post to this blog, I was writing about the importance of getting copies of government information out of the hands of the government:
Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception. Government documents are routinely recalled from the FDLP, and some are re-issued after alteration.
Anne Kingston at Maclean's has a terrifying article, Vanishing Canada: Why we’re all losers in Ottawa’s war on data, about the Harper administration's crusade to prevent anyone finding out what is happening as they strip-mine the nation. They don't even bother rewriting, they just delete, and prevent further information being gathered. The article mentions the desperate struggle Canadian government documents librarians have been waging using the LOCKSS technology to stay ahead of the destruction for the last three years. They won this year's CLA/OCLC Award for Innovative Technology, and details of the network are here.

Read the article and weep.

Saturday, February 28, 2015

Don't Panic

I was one of the crowd of people who reacted to Wednesday's news that Argonne National Labs would shut down the NEWTON Ask A Scientist service, on-line since 1991, this Sunday by alerting Jason Scott's ArchiveTeam. Jason did what I should have done before flashing the bat-signal. He fed the URL into the Internet Archive's Save Page Now, to be told "relax, we're all over it". The site has been captured since 1996 and the most recent capture before the announcement was Feb 7th. Jason arranged for captures Thursday and today.

As you can see by these examples, the Wayback Machine has a pretty good copy of the final state of the service and, as the use of Memento spreads, it will even remain accessible via its original URL.

Thursday, December 11, 2014

"Official" Senate CIA Torture Report

Please go and read James Jacobs' post The Official Senate CIA Torture Report to understand the challenges government documents librarians face. You would think that a document generating such worldwide interest would be easy to find and preserve. In your dreams, as it turns out.

Friday, January 10, 2014

Alex Stamos at EE380

Alex Stamos gave an excellent talk yesterday in Stanford's EE380 course. The video is linked from the EE380 schedule page. His title was Building a Trustworthy Business in the Post-Snowden Era, and the talk was based on analyzing the source material that has been released, rather than the media interpretation of those materials. The video is well worth your time to watch because, as Alex says, even if you are sure you will never do anything to attract the attention of the NSA:
  • You have to assume that, in a few years, many of the capabilities the NSA has today will be available in the market for exploits and be usable by the average bad guy.
  • Among the few products whose markets the US still dominates are Internet services and networking hardware. Success in these markets depends heavily on trust, and the revelations have destroyed this trust.
  • In particular, you have to assume that much of the software on which the integrity of your archive depends have backdoors inserted at the request of the three-letter agencies.
More generally, Robert Puttnam in Making Democracy Work and Bowling Alone has shown the vast difference in economic success between high-trust and low-trust societies. The way the revelations have been able to repeatedly disprove successive Government denials is, together with the too-big-to-jail banksters, a serious threat to the US and other developed nations remaining high-trust societies. So even if you think you don't care about this stuff, you do.

Matt Blaze's piece in The Guardian is well worth a read too.

Thursday, December 12, 2013

UK National Archive

Joe Fay at The Register has an interesting piece about a tour of the UK National Archive.

The archive has an excellent and comprehensive approach to preserving the UK government's Web presence:
It uses a crawler to trawl the UK government’s web estate, aiming to hit sites every six months. With the government looking to shutter many obscure or unloved sites, the pressure is on. The web archive currently stands at around 80TB, with the crawler pulling in 1.6TB a month. At time of writing, there are 3 billion urls in the archive, with 1 billion captured last year alone.But does anyone really care? Seems like they do. Espley said the archive gets around 15 to 20 million page views a month. This often maps to current events - the assumption being that visitors are often cross checking current government positions/statements against previous positions.
One must hope that the cross-checking doesn't turn up anything embarrassing  enough to imperil the Archive's budget ...