Showing posts with label fault tolerance. Show all posts
Showing posts with label fault tolerance. Show all posts

Tuesday, June 14, 2022

Where Did The Number 3 Come From?

The Keepers Registry, which tracks the preservation of academic journals by various "keepers" (preservation agencies), currently says:
20,127 titles are being ‘kept safe’ by 3 or more Keepers
The registry backs this up with this page, showing the number of journals being preserved by N keepers.
Source
The NDSA Levels of Digital Preservation: An Explanation and Uses from 2013 is still in wide use as a guide to preserving digital content. It uses specifies the number of independent copies as 2 for "Level 1" and 3 for "Levels 2-4".

Alicia Wise of CLOCKSS asked "where did the number 3 come from?" Below the fold I discuss the backstory.

Tuesday, June 8, 2021

Unreliability At Scale

Thomas Claiburn's FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof discusses two recent papers that are relevant to the extraordinary levels of reliability needed in long-term digital preservation at scale. Below the fold some commentary on both papers.

Tuesday, March 16, 2021

Correlated Failures

The invaluable statistics published by Backblaze show that, despite being built from technologies close to the physical limits (Heat-Assisted Magnetic Recording, 3D NAND Flash), modern digital storage media are extraordinarily reliable. However, I have long believed that the models that attempt to project the reliability of digital storage systems from the statistics of media reliability are wildly optimistic. They ignore foreseeable causes of data loss such as Coronal Mass Ejections and ransomware attacks, which cause correlated failures among the media in the system. No matter how many they are, if all replicas are destroyed or corrupted the data is irrecoverable.

Modelling these "black swan" events is clearly extremely difficult, but much less dramatic causes are in practice important too. It has been known at least since Talagala's 1999 Ph.D. thesis that media failures in storage systems are significantly correlated, and at least since Jiang et al's 2008 Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics that only about half the failures in storage systems are traceable to media failures. The rest happen in the pipeline from the media to the CPU. Because this typically aggregates data from many media components, it naturally causes correlations.

As I wrote in 2015's Disk reliability, discussing Backblaze's experience of a 40% Annual Failure Rate (AFR) in over 1,100 Seagate 3TB drives:
Alas, there is a long history of high failure rates among particular batches of drives. An experience similar to Backblaze's at Facebook is related here, with an AFR over 60%. My first experience of this was nearly 30 years ago in the early days of Sun Microsystems. Manufacturing defects, software bugs, mishandling by distributors, vibration resonance, there are many causes for these correlated failures.
Despite plenty of anecdotes, there is little useful data on which to base models of correlated failures in storage systems. Below the fold I summarize and comment on an important paper by a team from the Chinese University of Hong Kong and Alibaba that helps remedy this.

Thursday, February 18, 2021

Blast Radius

Last December Simon Sharwood reported on an "Infrastructure Keynote" by Amazon's Peter DeSantis in AWS is fed up with tech that wasn’t built for clouds because it has a big 'blast radius' when things go awry:
Among the nuggets he revealed was that AWS has designed its own uninterruptible power supplies (UPS) and that there’s now one in each of its racks. AWS decided on that approach because the UPS systems it needed were so big they required a dedicated room to handle the sheer quantity of lead-acid batteries required to keep its kit alive. The need to maintain that facility created more risk and made for a larger “blast radius” - the extent of an incident's impact - in the event of failure or disaster.

AWS is all about small blast radii, DeSantis explained, and in the past the company therefore wrote its own UPS firmware for third-party products.

“Software you don’t own in your infrastructure is a risk,” DeSantis said, outlining a scenario in which notifying a vendor of a firmware problem in a device commences a process of attempting to replicate the issue, followed by developing a fix and then deployment.

“It can take a year to fix an issue,” he said. And that’s many months too slow for AWS given a bug can mean downtime for customers.
This is a remarkable argument for infrastructure based on open source software, but that isn't what this post is about. Below the fold is a meditation on the concept of "blast radius", the architectural dilemma it poses, and its relevance to recent outages and compromises.

Tuesday, December 1, 2020

737 MAX Ungrounding

My post 737 MAX: The Case Against Boeing is a year old and has accumulated 58 updates in comments. Now the aircraft is returning to service, it is time for a new post. Below the fold, Bjorn Fehrm has two interesting posts about the ungrounding.

Thursday, July 9, 2020

Inefficiency Is Good!

Back in 2015 I wrote Brittle systems and Pushing back against network effects, among other things about the need for resilient systems and the importance of antitrust enforcement in getting them:
All over this blog (e.g. here) you will find references to W. Brian Arthur's Increasing Returns and Path Dependence in the Economy because it pointed out the driving forces, often called network effects, that cause technology markets to be dominated by one, or at most a few, large players. This is a problem for digital preservation, and for society in general, for both economic and technical reasons. The economic reason is that these natural but unregulated monopolies extract rents from their customers. The technical reason is that they make the systems upon which society depends brittle, subject to sudden, catastrophic and hard-to-recover-from failures.
Now, the pandemic has inspired two writers to address the bigger version of the same problem, Bruce Schneier in The Security Value of Inefficiency and Jonathan Aldred in This pandemic has exposed the uselessness of orthodox economics. Below the fold, some commentary.

Tuesday, March 24, 2020

More On Failures From FAST 2020

A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al, which I discussed in Enterprise SSD Reliability, wasn't the only paper at this year's Usenix FAST conference about storage failures. Below the fold I comment on one specifically about hard drives rather than SSDs, making it more relevant to archival storage.

Tuesday, July 23, 2019

Not To Pick On Toyota

Just under five years ago Prof. Phil Koopman gave a talk entitled A Case Study of Toyota Unintended Acceleration and Software Safety (slides, video). I only just discovered it, and its an extraordinarily valuable resource for understanding the risks of embedded software. Especially the risks of embedded software in life-critical products, and the processes needed to avoid failures such as those that caused deaths from sudden unintended acceleration (SUA) of Toyota cars, and from unintended pitch-down of Boeing 737 MAX aircraft. I doubt Toyota is an outlier in this respect, and I would expect that the multi-billion dollar costs of the problems Koopman describes have motivated much improvement in their processes. Follow me below the fold for the details.

Tuesday, March 26, 2019

FAST 2019

I wasn't able to attend this year's FAST conference in Boston, and reading through the papers I didn't miss much relevant to long-term storage. Below the fold a couple of quick notes and a look at the one really relevant paper.

Tuesday, January 22, 2019

Trump's Shutdown Impacts Information Access

Source
Government shutdown causing information access problems by James A. Jacobs and James R. Jacobs is important. It documents the effect of the Trump government shutdown on access to globally important information:
Twitter and newspapers are buzzing with complaints about widespread problems with access to government information and data (see for example, Wall Street Journal (paywall 😐 ), ZDNet News, Pew Center, Washington Post, Scientific American, TheVerge, and FedScoop to name but a few).
Matthew Green, a professor at Johns Hopkins, said “It’s worrying that every single US cryptography standard is now unavailable to practitioners.” He was responding to the fact that he could not get the documents he needed from the National Institute of Standards and Technology (NIST) or its branch, the Computer Security Resource Center (CSRC). The government shutdown is the direct cause of these problems.
They point out how this illustrates the importance of libraries collecting and preserving web-published information:
Regardless of who you (or your user communities) blame for the shutdown itself, this loss of access was entirely foreseeable and avoidable. It was foreseeable because it has happened before. It was avoidable because libraries can select, acquire, organize, and preserve these documents and provide access to them and services for them whether the government is open or shut-down.
Go read the whole thing, and weep for the way libraries have abandoned their centuries-long mission of safeguarding information for future readers.

Tuesday, October 23, 2018

Gini Coefficients Of Cryptocurrencies

The Gini coefficient expresses a system's degree of inequality or, in the blockchain context, centralization. It therefore factors into arguments, like mine, that claims of blockchains' decentralization are bogus.

In his testimony to the US Senate Committee on Banking, Housing and Community Affairs' hearing on “Exploring the Cryptocurrency and Blockchain Ecosystem" entitled Crypto is the Mother of All Scams and (Now Busted) Bubbles While Blockchain Is The Most Over-Hyped Technology Ever, No Better than a Spreadsheet/Database, Nouriel Roubini wrote:
wealth in crypto-land is more concentrated than in North Korea where the inequality Gini coefficient is 0.86 (it is 0.41 in the quite unequal US): the Gini coefficient for Bitcoin is an astonishing 0.88.
The link is to Joe Weisenthal's How Bitcoin Is Like North Korea from nearly five years ago, which was based upon a Stack Exchange post, which in turn was based upon a post by the owner of the Bitcoinica exchange from 2011! Which didn't look at all holdings of Bitcoin, let alone the whole of crypto-land, but only at Bitcoinica's customers!

Follow me below the fold as I search for more up-to-date and comprehensive information. I'm not even questioning how Roubini knows the Gini coefficient of North Korea to two decimal places.

Tuesday, October 2, 2018

Bitcoin's Academic Pedigree

Bitcoin's Academic Pedigree (also here) by Arvind Narayanan and Jeremy Clark starts:
If you've read about bitcoin in the press and have some familiarity with academic research in the field of cryptography, you might reasonably come away with the following impression: Several decades' worth of research on digital cash, beginning with David Chaum, did not lead to commercial success because it required a centralized, banklike server controlling the system, and no banks wanted to sign on. Along came bitcoin, a radically different proposal for a decentralized cryptocurrency that didn't need the banks, and digital cash finally succeeded. Its inventor, the mysterious Satoshi Nakamoto, was an academic outsider, and bitcoin bears no resemblance to earlier academic proposals.
They comprehensively debunk this view, showing that each of the techniques Nakamoto used had been developed over the preceding three decades of academic research, and that Nakamoto's brilliant contribution was:
the specific, complex way in which the underlying components are put together.
Below the fold, details on the specific techniques.

Thursday, June 28, 2018

Rate limits

Andrew Marantz writes in Reddit and the Struggle to Detoxify the Internet:
[On 2017's] April Fools’, instead of a parody announcement, Reddit unveiled a genuine social experiment. It was called r/Place, and it was a blank square, a thousand pixels by a thousand pixels. In the beginning, all million pixels were white. Once the experiment started, anyone could change a single pixel, anywhere on the grid, to one of sixteen colors. The only restriction was speed: the algorithm allowed each redditor to alter just one pixel every five minutes. “That way, no one person can take over—it’s too slow,” Josh Wardle, the Reddit product manager in charge of Place, explained. “In order to do anything at scale, they’re gonna have to coöperate."
The r/Place experiment successfully forced coöperation, for example with r/AmericanFlagInPlace drawing a Stars and Stripes, or r/BlackVoid trying to rub out everything:
Toward the end, the square was a dense, colorful tapestry, chaotic and strangely captivating. It was a collage of hundreds of incongruous images: logos of colleges, sports teams, bands, and video-game companies; a transcribed monologue from “Star Wars”; likenesses of He-Man, David Bowie, the “Mona Lisa,” and a former Prime Minister of Finland. In the final hours, shortly before the experiment ended and the image was frozen for posterity, BlackVoid launched a surprise attack on the American flag. A dark fissure tore at the bottom of the flag, then overtook the whole thing. For a few minutes, the center was engulfed in darkness. Then a broad coalition rallied to beat back the Void; the stars and stripes regained their form, and, in the end, the flag was still there.
What is important about the r/Place experiment? Follow me below the fold for an explanation.

Thursday, July 27, 2017

Decentralized Long-Term Preservation

Lambert Heller is correct to point out that:
name allocation using IPFS or a blockchain is not necessarily linked to the guarantee of permanent availability, the latter must be offered as a separate service.
Storage isn't free, and thus the "separate services" need to have a viable business model. I have demonstrated that increasing returns to scale mean that the "separate service" market will end up being dominated by a few large providers just as, for example, the Bitcoin mining market is. People who don't like this conclusion often argue that, at least for long-term preservation of scholarly resources, the service will be provided by a consortium of libraries, museums and archives. Below the fold I look into how this might work.

Thursday, July 6, 2017

Archive vs. Ransomware

Archives perennially ask the question "how few copies can we get away with?"
This is a question I've blogged about in 2016 and 2011 and 2010, when I concluded:
  • The number of copies needed cannot be discussed except in the context of a specific threat model.
  • The important threats are not amenable to quantitative modeling.
  • Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".
I've also written before about the immensely profitable business of ransomware. Recent events, such as WannaCrypt, NotPetya and the details of NSA's ability to infect air-gapped computers should convince anyone that ransomware is a threat to which archives are exposed. Below the fold I look into how archives should be designed to resist this credible threat.

Thursday, May 12, 2016

The Future of Storage

My preparation for a workshop on the future of storage included giving a talk at Seagate and talking to the all-flash advocates. Below the fold I attempt to organize into a coherent whole the results of these discussions and content from a lot of earlier posts.

Tuesday, March 1, 2016

The Cloudy Future of Disk Drives

For many years, following Dave Anderson of Seagate, I've been pointing out that the constraints of manufacturing capacity mean that the only medium available on which to store the world's bulk data is hard disk. Eric Brewer's fascinating FAST2016 keynote, entitled Spinning Disks and their Cloudy Future and Google's associated white paper, start from this premise:
The rise of portable devices and services in the Cloud has the consequence that (spinning) hard disks will be deployed primarily as part of large storage services housed in data centers. Such services are already the fastest growing market for disks and will be the majority market in the near future.
Eric's argument is that since cloud storage will shortly be the majority of the market, and that other segments are declining, the design of hard drives no longer needs to be a compromise suitable for a broad range of uses, but should be optimized for the Cloud. Below the fold, I look into some details of the optimizations and provide some supporting evidence.

Sunday, October 4, 2015

Pushing back against network effects

I've had occasion to note the work of Steve Randy Waldman before. Today, he has a fascinating post up entitled 1099 as Antitrust that may not at first seem relevant to digital preservation. Below the fold I trace the important connection.

Tuesday, June 2, 2015

Brittle systems

In my recent rant on the Internet of Things, I linked to Mike O'Dell's excellent post to Dave Farber's IP list, Internet of Obnoxious Things, and suggested you read it. I'm repeating that advice as, below the fold, I start from a different part of Mike's post.