Showing posts with label storage failures. Show all posts
Showing posts with label storage failures. Show all posts

Tuesday, October 28, 2025

The Bathtub Curve

The economics of long-term data storage are critically dependent not just upon the Kryder rate, the rate at which the technology improves cost per byte, but also upon the reliability of the media over time. You want to replace media because they are no longer economic, not because they are no longer reliable despite still being economic.

Source
For more than a decade Backblaze has been providing an important public service by publishing data on the reliability of their hard drives, and more recently their SSDs. Below the fold I comment on this month's post from their Drive Stats Team, Are Hard Drives Getting Better? Let’s Revisit the Bathtub Curve.

Wikipedia defines the Bathtub Curve as a common concept in reliability engineering:
The 'bathtub' refers to the shape of a line that curves up at both ends, similar in shape to a bathtub. The bathtub curve has 3 regions:
  1. The first region has a decreasing failure rate due to early failures.
  2. The middle region is a constant failure rate due to random failures.
  3. The last region is an increasing failure rate due to wear-out failures.

Tuesday, August 19, 2025

2025 Optical Media Durability Update

Seven years ago I posted Optical Media Durability and discovered:
Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary.
Here are the subsequent annual updates:
It is time once again for the mind-numbing process of feeding 45 disks through the readers to verify their checksums, and yet again this year every single MD5 was successfully verified. Below the fold, the details.

Thursday, January 18, 2024

A Lesson Learned

You know how backups work great until you really need them? Below the fold, a lesson learned from my recent example of this phenomenon.

Thursday, August 17, 2023

Optical Media Durability Update

Five years ago I posted Optical Media Durability and discovered:
Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary.
Four years ago I repeated the mind-numbing process of feeding 45 disks through the reader and verifying their checksums. Three years ago I did it again, and then again two years ago, and then again a year ago.

It is time again for this annual chore, and yet again this year every single MD5 was successfully verified. Below the fold, the details.

Tuesday, June 14, 2022

Where Did The Number 3 Come From?

The Keepers Registry, which tracks the preservation of academic journals by various "keepers" (preservation agencies), currently says:
20,127 titles are being ‘kept safe’ by 3 or more Keepers
The registry backs this up with this page, showing the number of journals being preserved by N keepers.
Source
The NDSA Levels of Digital Preservation: An Explanation and Uses from 2013 is still in wide use as a guide to preserving digital content. It uses specifies the number of independent copies as 2 for "Level 1" and 3 for "Levels 2-4".

Alicia Wise of CLOCKSS asked "where did the number 3 come from?" Below the fold I discuss the backstory.

Thursday, June 9, 2022

Backblaze On Hard Disk Reliability

It has been a long time since I blogged about the invaluable hard drive reliability data that Backblaze has been publishing quarterly since 2015, so I checked their blog and found Andy Klein's Star Wars themed Backblaze Drive Stats for Q1 2022, as well as his fascinating How Long Do Disk Drives Last?. Below the fold I comment on both.

Tuesday, March 22, 2022

Storage Update: Part 2

This is part 2 of my latest update on storage technology. Part 1, covering developments in DNA as a storage medium is here. This part was sparked by a paper at Usenix's File And Storage Technologies conference from Bianca Schroeder's group at U. Toronto and NetApp on the performanmce of SSDs at scale. It followed on from their 2020 FAST "Best Paper" that I discussed in Enterprise SSD Reliability, and it prompted me to review the literature of this area. The result is below the fold.

Thursday, August 19, 2021

Optical Media Durability Update

Three years ago I posted Optical Media Durability and discovered:
Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary.
Two years ago I repeated the mind-numbing process of feeding 45 disks through the reader and verifying their checksums. A year ago I did it again.

It is time again for this annual chore, and yet again this year I failed to find any errors. Below the fold, the details.

Tuesday, June 8, 2021

Unreliability At Scale

Thomas Claiburn's FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof discusses two recent papers that are relevant to the extraordinary levels of reliability needed in long-term digital preservation at scale. Below the fold some commentary on both papers.

Thursday, March 25, 2021

Internet Archive Storage

The Internet Archive is a remarkable institution, which has become increasingly important during the pandemic. It has been for many years in the world's top 300 Web sites and is currently ranked #209, sustaining almost 60Gb/s outbound bandwidth from its collection of almost half a trillion archived Web pages and much other content. It does this on a budget of under $20M/yr, yet maintains 99.98% availability.

Jonah Edwards, who runs the Core Infrastructure team, gave a presentation on the Internet Archive's storage infrastructure to the Archive's staff. Below the fold, some details and commentary.

Tuesday, March 16, 2021

Correlated Failures

The invaluable statistics published by Backblaze show that, despite being built from technologies close to the physical limits (Heat-Assisted Magnetic Recording, 3D NAND Flash), modern digital storage media are extraordinarily reliable. However, I have long believed that the models that attempt to project the reliability of digital storage systems from the statistics of media reliability are wildly optimistic. They ignore foreseeable causes of data loss such as Coronal Mass Ejections and ransomware attacks, which cause correlated failures among the media in the system. No matter how many they are, if all replicas are destroyed or corrupted the data is irrecoverable.

Modelling these "black swan" events is clearly extremely difficult, but much less dramatic causes are in practice important too. It has been known at least since Talagala's 1999 Ph.D. thesis that media failures in storage systems are significantly correlated, and at least since Jiang et al's 2008 Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics that only about half the failures in storage systems are traceable to media failures. The rest happen in the pipeline from the media to the CPU. Because this typically aggregates data from many media components, it naturally causes correlations.

As I wrote in 2015's Disk reliability, discussing Backblaze's experience of a 40% Annual Failure Rate (AFR) in over 1,100 Seagate 3TB drives:
Alas, there is a long history of high failure rates among particular batches of drives. An experience similar to Backblaze's at Facebook is related here, with an AFR over 60%. My first experience of this was nearly 30 years ago in the early days of Sun Microsystems. Manufacturing defects, software bugs, mishandling by distributors, vibration resonance, there are many causes for these correlated failures.
Despite plenty of anecdotes, there is little useful data on which to base models of correlated failures in storage systems. Below the fold I summarize and comment on an important paper by a team from the Chinese University of Hong Kong and Alibaba that helps remedy this.

Thursday, August 20, 2020

Optical Media Durability: Update

Two years ago I posted Optical Media Durability and discovered:
Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary.
A year ago I repeated the mind-numbing process of feeding 45 disks through the reader and verifying their checksums. It is time again for this annual chore, and once again this year I failed to find any errors. Below the fold, the details.

Tuesday, March 24, 2020

More On Failures From FAST 2020

A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al, which I discussed in Enterprise SSD Reliability, wasn't the only paper at this year's Usenix FAST conference about storage failures. Below the fold I comment on one specifically about hard drives rather than SSDs, making it more relevant to archival storage.

Tuesday, March 10, 2020

Enterprise SSD Reliability

I couldn't attend this year's USENIX FAST conference. Because of the COVID-19 outbreak the normally high level of participation from Asia was greatly reduced, with many registrants and even some presenters unable to make it. But I've been reading the papers, and below the fold I have commentary on an extremely interesting one about the reliability of SSD media in enterprise applications.

Tuesday, September 17, 2019

Interesting Articles From Usenix

Unless you're a member of Usenix (why aren't you?) you'll have to wait a year to read two of three interesting preservation-related articles in the Fall 2019 issue of ;login:. Below the fold is a little taste of each of them, with links to the full papers if you don't want to wait a year:

Thursday, August 22, 2019

Optical Media Durability: Update

A year ago I posted Optical Media Durability and discovered:
Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary.
It is time to repeat the mind-numbing process of feeding 45 disks through the reader and verifying their checksums. Below the fold, this year's results.

Tuesday, March 26, 2019

FAST 2019

I wasn't able to attend this year's FAST conference in Boston, and reading through the papers I didn't miss much relevant to long-term storage. Below the fold a couple of quick notes and a look at the one really relevant paper.

Tuesday, March 19, 2019

Compression vs. Preservation

An archive is in a hardware refresh cycle and they have asked me to comment on concerns arising because their favored storage hardware uses data compression, which may not be possible to disable even if doing so were a good idea. This is an issue I wrote about two years ago in Threats to stored data.

Because similar concerns keep re-appearing in discussions of digital preservation, I decided this time to discuss it in the same way as Cloud for Preservation, writing a post with a general discussion of the issues without referring to a specific institution. Below the fold, the details.

Tuesday, February 26, 2019

Economic Models Of Long-Term Storage

My work on the economics of long-term storage with students at the UC Santa Cruz Center for Research in Storage Systems stopped about six years ago some time after the funding from the Library of Congress ran out. Last year to help with some work at the Internet Archive I developed a much simplified economic model, which runs on a Raspberry Pi.

Two recent developments provide alternative models:
  • Last year, James Byron, Darrell Long, and Ethan Miller's Using Simulation to Design Scalable and Cost-Efficient Archival Storage Systems (also here) reported on a vastly more sophisticated model developed at the Center. It includes both much more detailed historical data about, for example, electricity cost, and covers various media types including tape, optical, and SSDs.
  • At the recent PASIG Julian Morley reported on the model being used at the Stanford Digital Repository, a hybrid local and cloud system, and he has made the spreadsheet available for use.
Below the fold some commentary on all three models.

Tuesday, September 11, 2018

What Does Data "Durability" Mean

Source
In What Does 11 Nines of Durability Really Mean? David Friend writes:
No amount of nines can prevent data loss.

There is one very important and inconvenient truth about reliability: Two-thirds of all data loss has nothing to do with hardware failure.

The real culprits are a combination of human error, viruses, bugs in application software, and malicious employees or intruders. Almost everyone has accidentally erased or overwritten a file. Even if your cloud storage had one million nines of durability, it can’t protect you from human error.
Friend may be right that these are the top 5 causes of data loss, but over the timescale of preservation as opposed to storage they are far from the only ones. In Requirements for Digital Preservation Systems: A Bottom-Up Approach we listed 13 of them. Below the fold, some discussion of the meaning and usefulness of durability claims.