This AWS outage opinion is different, I swear (I think): AWS was down for 12 hours yesterday. To be clear, that’s still about 99.5% uptime. That’s a pretty good uptime. In fact, for most… | Cory ODaniel

CEO/Co-Founder @ Massdriver | OpenTofu co-founder

2w Edited

This AWS outage opinion is different, I swear (I think): AWS was down for 12 hours yesterday. To be clear, that’s still about 99.5% uptime. That’s a pretty good uptime. In fact, for most companies, it’s fine. Annoying, yes… but fine. The problem is what happens next. Everyone freaks out. Execs start yelling “this can never happen again.” Engineers start spinning up multi-region or multi-cloud plans, and a few months later the CFO soils themselves. (JK: they come looking for necks to choke and you buy a cloud cost tool 🙃) Multi-region is a good first step. It’ll get you more nines without going nuclear — but it’s still not cheap. What’s missing is the conversation in the middle. Instead of engineering out of panic, we should be educating through it. When execs ask for never, we need to educate them. Having some downtime might actually be the smart financial move. It’s a cost-risk tradeoff, not a “never again” problem. Some napkin math from when I was a cloud consultant, based on a 95% base SLA: 99.9% uptime → ~9 hours downtime / year → maybe 2–3× cost 99.99% → ~1 hour → 5–10× cost 99.999% → ~5 minutes → 20×+ cost What’s that uptime worth to the business? Because you can absolutely buy more nines, but they come with way more zero$. Aside, fun little anecdote: I worked for a company that had a ~90-minute partial outage that cost one of our customers about $500k/hour. Their CEO called our CEO … bilateral rage ensued. Our CEO said never again. I explained that “never” is very expensive. “What do you mean by never?” I asked. “If it happens again we lose their business and you all lose your jobs.” (Very important logo) Well, we can’t guarantee never. “Try,” I was told. A cloud cost increase of double-digit percent of revenue later, we were told to roll it back.

24 Comments

Nathan Wade

Senior AWS Architect | Building Agentic Workflows | AI Educator Teaching Production Configs & GRC

Was having a similar conversation with my architect colleagues yesterday. We were wondering how many of the affected “big name” companies were in that position because of the strategic choice to avoid cost / complexity of cross-region failover for a rare occurrence vs just being caught out. It’s also worth understanding the exact nature of the DNS issue. Sometimes it’s hard to defend against that. And finally, even if you’re smart about AWS in the first-party sense, your platform could still be impacted by some other AWS-dependent vendor who wasn’t as clever.

2 Reactions

Mutha Nagavamsi

Cloud, Kubernetes & Devops. DM if you want to talk 😊 92420+ all Socials.

this is such a brilliant read and I totally agree with it. understanding the financial commitment needed to achieve a specific sla is very important. budget preferences can change over time, with it sla's should change too. in most cases the later won't happen and thats where all the problems start.

6 Reactions

Glen Kendell, CISSP

Private Cloud & Infrastructure | Network Security Expert, Cybersecurity | Host of "Data & Confused" Podcast | Speaker, Leader, Entrepreneur.

Spot on! Every additional "9" can cost an order of magnitude more. Everyone has outages. Even the sun has an eclipse every now and then.

1 Reaction

Sam Hatoum

Founder & CEO @ Auto | Founder & CEO @ Xolvio

The fallacy of absolutes vs the reality of tradeoffs

3 Reactions

Andy Caine

Making AWS security frictionless for high-velocity teams

💯 I was about to post something similar but you've said it much better than I could have. Before building an architecture that can survive a full region outage, or even a full cloud provider outage, you need to ask yourself "how often does that event actually happen?". Is that extra resiliency really worth the cost? I think most orgs will be better served by a good single-region, multi AZ architecture.

4 Reactions

David Hay

0xDEADBEEF Architect

Accepting lower levels of uptime is wildly cheaper and more effective than attempting to accomplish almost no uptime. I spent about a decade building some of the most reliable systems on the planet. These are things that went in space, which were supposed to "never fail". And they didn't. They had three to five independent systems internally, failed over in less than a second, never lost internal state, and would gang up on any offending nodes and shoot it. They were relentlessly tested, painstakingly developed, and every edge case was considered. The best hardware was thrown at the problem. Huge compromises were made for performance and capacity. Once it was configured, it was never changed. It was entirely independent, and relied on no external service whatsoever. Could you imagine being chained to that shit for deploying your applications? What a nightmare. Better to reduce expectations, so that people downstream of your software can write their software to address failures. Overall, this is a lot better than assuming something is never going to fail. Because it's going to fail. Even after you do all of that crazy shit I mentioned above, it's still going to fail. Because none of that stops someone from fucking it up.

3 Reactions

Jon Warren 💻📷

Someone who gets the job done, be it Programming or Photography.

I'm still trying to figure out what made this AWS outage any different from the many before. Was it longer? Perhaps. Did it cost more? Probably, but only due to inflation. Will it happen again? Most definitely. Is it preventable? Heck no! Hardware failures aren't predictable. MTBF figures are just that, a mean time, not a guaranteed time. But having enough redundancy can make losing one drive, or one processor, or one of whatever, less of an issue. But they're still not going to be predictable or preventable.

2 Reactions

Jérémie Tarot

FOSS, DevOps, IT/Cloud Architecture, Training, Tech Docs

"a few months later the CFO soils themselves"... Now to wipe my coffee off my screen 😂 Laughter passed, excellent post. I believe/hope many of us can relate to this kind of situation, at any scale. Mine was much smaller, with a local hospital radiology center infrastructure, fewer zeros, but still lives potentially at stake. Sadly, even there, in the end, trade-offs had to be made 🤷♂️

Om prakash kumar

Worth reading 💡

1 Reaction

Wes Bailey

Founder • Builder • Maverick • Physicist

I just posted the same thing last night but you put more math behind it. It is not worth the cost when your customers are only inconvenienced by an every 2-4 year event.

1 Reaction

See more comments

To view or add a comment, sign in

LinkedIn respects your privacy

Cory ODaniel’s Post

Explore content categories