SlideShare a Scribd company logo
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resiliency and Availability Design
Patterns for the Cloud
B A R 4
K Y I V
11.06.2019
{
"name": "Sébastien Stormacq",
"role": ”Technical Evangelist",
"company": "Amazon Web Services”,
"twitter": ”@sebsto”,
”github": ”sebsto”
}
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Can you guess what will happen?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failures are a given and
everything will eventually fail
over time.
Werner Vogels
CTO – Amazon.com
“ “
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Systems
are hard
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Complex systems
Amazon Twitter Netflix
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resiliency: Ability for a system to handle and
eventually recover from unexpected conditions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Partial failure mode
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do we build resilient software
systems?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People
Application
Network & Data
Infrastructure
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about Availability
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability in parallel
A = 1 – (1 – Ax)2
Part X
Part X
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability in parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Component redundancy increases availability
significantly!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fully-scaled Availability Zone
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Highly redundant regional network
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Region and availability zones
Region
Availability zone a Availability zone b Availability zone c
data center
data center
data center
1 or more data centers per AZ
2 or more AZs per region (new regions min 3)
data center
data center
data center
data center
data center
data center
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about Multi-AZ
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-AZ architecture
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-AZ architecture
X
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-AZ architecture
X
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-AZ architecture
X
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
new master
Elastic Load
Balancing (ELB)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-AZ architecture
• Enables fault-tolerant applications
• AWS regional services designed to
withstand AZ failures
• Leveraged by AWS regional
services such as Amazon S3,
Amazon DynamoDB, Amazon
Aurora, Amazon ELBs, etc.
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about auto scaling
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto-Scaling
FixedVariable
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability zone 1
Auto Scaling group
AWS Region
Availability zone 2
Auto-scaling for self-healing
Elastic Load
Balancing (ELB)
X
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about decoupling and async
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process A Process B Process A Process B
Synchronous Asynchronous
Waiting
Working
Continues
get or fetch resultGet result
Decoupling with async pattern
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
{JobID: 0001, Result: bar}
Cache node
Worker
Instance
Worker
Instance
Queue/Streaming
API
Instance
API
Instance
API
Instance
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Notification
User
Worker
Instance
Worker
Instance
API
Instance
API
Instance
Cache node
Fetch results
API
Instance
Queue/Streaming
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Degrade & prioritize traffic
with queues
Worker
Instance
Worker
Instance
API
Instance
API
Instance
API
Instance
HighPriorityQueue
LowPriorityQueue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about the failures in
distributed systems
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recommendation Engine
Service
Service
Service
Preserve
at all cost
Preventing failures
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some of the most important things to think about
Recommendation Engine
Service
Service
Service
Preserve
at all cost
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about timeouts, backoff &
retries!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Users
App
DB
Conn
Pool
INSERT
INSERT
INSERT
INSERT
What happens if the DB “slows down”?
Timeout client side Timeout backend side ??
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
User 1
App
DB
Conn
Pool
INSERT
Timeout client side = 10s Timeout backend side = default = Infinite
Retry INSERT
Retry INSERT
ERROR: Failed to get connection from pool
Retry
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
@timeout_decorator.timeout(5, timeout_exception=StopIteration)
def timed_get(url):
return requests.get(url)
https://pypi.org/project/timeout-decorator/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set the timeouts!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How else could we have prevented the error?
User 1
DB
Conn
Pool
INSERT
Retry INSERT
Retry INSERT
Retry
ERROR: Failed to get connection from pool
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
User 1
DB
Conn
Pool
INSERT
Timeout client side = 10s Timeout backend side = 10s
Wait 2s before Retry
INSERT
INSERT
Wait 4s before Retry
Wait 8s before Retry
Wait 16s before Retry
Backing off between retries
Releasing connectionsBackoff
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No jitter With jitter
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Simple Exponential Backoff is not enough: Add Jitter
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adding Jitter
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: add jitter 0-1000ms
def get_item(self, url, n=1):
MAX_TRIES = 12
try:
res = requests.get(url)
except:
if n > MAX_TRIES:
return None
n += 1
time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0))
return self.get_item(url, n)
else:
return res
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
@backoff.on_exception(backoff.full_jitter, max_time=60)
def poll_for_message(queue):
return queue.get()
https://pypi.org/project/backoff/
As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the
AWS Architecture Blog’s Exponential Backoff And Jitter post.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Idempotent operation
No additional effect if it is called more than
once with the same input parameters.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Circuit Breaker
• Wrap a protected function
call in a circuit breaker
object, which monitors for
failures.
• If failures reach a certain
threshold, the circuit
breaker trips.
Producer Circuit Breaker Consumer
Connection
Monitoring
Timeouts
Breaking Circuit
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://github.com/Netflix/Hystrix
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://spring.io/guides/gs/circuit-breaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about health checking!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto Scaling group
Service A
Availability zone 1
Auto Scaling group
AWS Region
Service A
Availability zone 2
Service BService B
database Email
Probing for health
Cluster
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shallow health check
Instance
Cache node
Email
database
Cluster
Are you healthy?
yes
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shallow health check
Instance
Cache node
Email
database
Cluster
Are you healthy?
yes
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep health check
Instance
Cache node
Email
database
Cluster
Are you healthy?
yes
Are you healthy?
yes
yes
yes
yes
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep health check
Instance
Cache node
Email
database
Cluster
Are you healthy?
no
Are you healthy?
no
yes
yes
yes
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prioritize shallow health checks during
hard times.
Cache.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about load shedding.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cheaply reject excess work
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Be careful when selecting the right
metric
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Don’t be overly optimistic and take on more than you can.
Find an operational metric to reject what you cannot take in.
Favor cached and static content
Prioritize ELB health check (shallow) pings
In an overload situation you have precious resources, do not
let any of it go to waste.
Load Shedding
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service Degradation & Fallbacks
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.https://twitter.com/redditstatus/status/1116204502703493120
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about shuffle sharding.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
X X X X X X XX
♤♡♢ ⚀ ⚁ ⚂ ⚃♧♢
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measure for this: blast radius
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Blast radius
• How many customers?
• What functionality?
• How many locations?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cell-based architecture
XX
♤♡♢ ⚀ ⚁ ⚂ ⚃♧♢
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shuffle sharding
XX
♤♡♢ ⚀ ⚁⚂ ⚃♡ ♤ ♧♢ ⚀⚂♧ ⚁⚃♢ ♢
♡ ♧♢
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shuffle sharding
Nodes = 8
Shard size = 2
Combinations = 28
Overlap % customers
0 53.6%
1 42.8%
2 3.6%
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shuffle sharding
Nodes = 100
Shard size = 5
Combinations = 75 million!
Overlap % customers
0 77%
1 21%
2 1.8%
3 0.06%
4 0.0006%
5 0.0000013%
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shuffle sharding
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about chaos!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fire Drills
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GameDay at Amazon
Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
https://github.com/Netflix/SimianArmy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the system’s
capability to withstand turbulent conditions in
production.”
http://principlesofchaos.org
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks
• “Paul” attack
https://www.gremlin.comhttps://github.com/Netflix/SimianArmy https://chaostoolkit.org
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bananas for Monkeys
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to DDoS yourself
~ wrk -t12 -c400 -d30s http://127.0.0.1/api/health
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adding delay to the network
~ tc qdisc add dev eth0 root netem delay 200ms
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://github.com/Netflix/SimianArmy
Set of scheduled agent:
• shuts down services randomly
• slows down performances
• checks conformity
• breaks an entire region
• Integrates with spinnaker (CI/CD)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about operational resiliency
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Value realized by example
Operational
resilience
1. Scaled to handle a 400% increase in page views (Kurt Geiger)
2. Improved security posture (CapitalOne)
3. 8600 transactions/second (McDonalds)
4. Transfer of over 750 TB of data from pipeline inspection machinery (GE)
5. Processing over 75 billion market events daily (FINRA)
6. Critical applications run in multiple AZs, x-Regions for robust disaster recovery (Expedia)
7. Supports over 300,000 requests per minute to its API (Easy Taxi)
8. 60% reduced downtime (Trainline)
9. Migration of SAP on Oracle to AWS with zero unplanned downtime across five countries
(Kellogg’s)
10. SAP availability boosted to 100% (MacMillan)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational Resilience
Operational
resilience
Critical workloads run in Multiple
AZs and Regions for robust DR
(Expedia)
Benefit of improving SLAs and reducing
unplanned outages
What is it?
Example
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The cost of downtime
Annual Fortune
1000 application
downtime costs
(IDC)
$1.25 to
$2.5B
Average cost of
a data breach
(Ponemon
Institute)
$3.6M
Cost/hr of a
critical
application
failure (IDC)
$500K
to $1M
Average cost/hr
of downtime
(Ponemon
Institute)
$474K
Average cost per
lost or stolen
record
(Ponemon
Institute)
$141
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational resilience: Quantifying cost
Cost Category % of Total Definition
Third Parties 1.3%
The cost of contractors, consultants, auditors and other specialists engaged to
help resolve unplanned outages.
Equipment 1.3% The cost of new equipment purchases and repairs, including refurbishment.
Ex-post Activities 1.1%
All after-the-fact incidental costs associated with business
disruption and recovery.
Recovery 2.9%
Activities and associated costs that relate to bringing the organization’s
networks and core systems back to a state of readiness.
Detection 3.6%
Activities associated with the initial discovery and subsequent investigation
of the partial or complete outage incident.
IT Productivity 8.4% The lost time and related expenses associated with IT personnel downtime.
End-user Productivity 18.7% The lost time and related expenses associated with end-user downtime.
Lost Revenue 28.2%
The total revenue loss from customers and potential customers because of
their inability to access core systems during the outage period.
Business disruption 34.6%
Additional economic loss of the outage, including reputational damages,
customer churn and lost business opportunities.
TOTAL 100.0%
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational resilience: Case studies
Migrated to AWS in 6 weeks with
no downtime and improved
availability to 99.99%+
Migrated all workloads to AWS to
reduce downtime by 60% with an
annual savings of £1.2M
Rebuilt patient engagement portal
on AWS and reduced downtime
from 120 to <5 min / month
Using AWS, Travelstart has seized
opportunities in emerging markets
and has cut operational costs by
43% and downtime by 25%
With its on-premises setup, the
availability of its system ran to 98%, but
on its cloud infrastructure, this has risen
to 99.965%
Three 9’s to five 9’s
“We no longer need to worry about data
center, server, or hypervisor
security…which allows us to focus our
attention on securing our applications.”
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
And before we go.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DON’T blame people for failure…
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Quality is not an act, it is a habit”
Aristotle, some time around 350BC
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://aws.amazon.com/wellarchitected
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://medium.com/@adhorn
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
{
"name": "Sébastien Stormacq",
"role": ”Technical Evangelist",
"company": "Amazon Web Services”,
"twitter": ”@sebsto”,
”github": ”sebsto”
}
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

Similar to "Resiliency and Availability Design Patterns for the Cloud", Sebastien Stormacq, AWS Dev Day Kyiv 2019 (16)

PDF
GraphQL backend with AWS AppSync & AWS Lambda
Aleksandr Maklakov
 
PDF
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
Provectus
 
PPTX
DevConZM - Modern Applications Development in the Cloud
Cobus Bernard
 
PDF
Budget management with Cloud Economics | AWS Summit Tel Aviv 2019
AWS Summits
 
PPTX
Tools for building your Startup on AWS
Rob De Feo
 
PDF
AWS Lambda 내부 동작 방식 및 활용 방법 자세히 살펴 보기 - 김일호 솔루션즈 아키텍트 매니저, AWS :: AWS Summit ...
Amazon Web Services Korea
 
PPTX
AWS Startup Garage - Building your MVP on AWS
Cobus Bernard
 
PDF
Continuous Delivery on AWS with Zero Downtime
Casey Lee
 
PDF
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
Provectus
 
PPTX
Serverless APIs and you
James Beswick
 
PDF
India cloudsummit Bangalore - Advanced Container Use-cases on AWS Container S...
Mani Chandrasekaran
 
PDF
Containers on AWS
Reham Maher El-Safarini
 
PDF
From 0 to Blue-Green deployments on AWS Fargate
Massimo Ferre'
 
PDF
AWS Startup Day Bogotá - Tools for Building Your Startup
Amazon Web Services LATAM
 
PDF
AWS DevDay Berlin 2019 - Going Global With Serverless
Darko Mesaroš
 
PDF
AWS DevDay Berlin 2019 - Simplify your Web & Mobile apps with cloud-based ser...
Darko Mesaroš
 
GraphQL backend with AWS AppSync & AWS Lambda
Aleksandr Maklakov
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
Provectus
 
DevConZM - Modern Applications Development in the Cloud
Cobus Bernard
 
Budget management with Cloud Economics | AWS Summit Tel Aviv 2019
AWS Summits
 
Tools for building your Startup on AWS
Rob De Feo
 
AWS Lambda 내부 동작 방식 및 활용 방법 자세히 살펴 보기 - 김일호 솔루션즈 아키텍트 매니저, AWS :: AWS Summit ...
Amazon Web Services Korea
 
AWS Startup Garage - Building your MVP on AWS
Cobus Bernard
 
Continuous Delivery on AWS with Zero Downtime
Casey Lee
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
Provectus
 
Serverless APIs and you
James Beswick
 
India cloudsummit Bangalore - Advanced Container Use-cases on AWS Container S...
Mani Chandrasekaran
 
Containers on AWS
Reham Maher El-Safarini
 
From 0 to Blue-Green deployments on AWS Fargate
Massimo Ferre'
 
AWS Startup Day Bogotá - Tools for Building Your Startup
Amazon Web Services LATAM
 
AWS DevDay Berlin 2019 - Going Global With Serverless
Darko Mesaroš
 
AWS DevDay Berlin 2019 - Simplify your Web & Mobile apps with cloud-based ser...
Darko Mesaroš
 

More from Provectus (20)

PPTX
Choosing the right IDP Solution
Provectus
 
PPTX
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Provectus
 
PPTX
Choosing the Right Document Processing Solution for Healthcare Organizations
Provectus
 
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
PPTX
Feature Store as a Data Foundation for Machine Learning
Provectus
 
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
PPTX
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Provectus
 
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
Provectus
 
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Provectus
 
PDF
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
Provectus
 
PDF
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
Provectus
 
PDF
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
Provectus
 
PDF
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
PDF
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
Provectus
 
PPTX
How to implement authorization in your backend with AWS IAM
Provectus
 
PDF
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Provectus
 
PDF
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
Provectus
 
PDF
Modern word embeddings | Andrei Kulagin | Kazan ODSC Meetup
Provectus
 
Choosing the right IDP Solution
Provectus
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Provectus
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Provectus
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Provectus
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
Provectus
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Provectus
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
Provectus
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
Provectus
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
Provectus
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
Provectus
 
How to implement authorization in your backend with AWS IAM
Provectus
 
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Provectus
 
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
Provectus
 
Modern word embeddings | Andrei Kulagin | Kazan ODSC Meetup
Provectus
 
Ad

Recently uploaded (20)

PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
short term internship project on Data visualization
JMJCollegeComputerde
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
Ad

"Resiliency and Availability Design Patterns for the Cloud", Sebastien Stormacq, AWS Dev Day Kyiv 2019

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency and Availability Design Patterns for the Cloud B A R 4 K Y I V 11.06.2019 { "name": "Sébastien Stormacq", "role": ”Technical Evangelist", "company": "Amazon Web Services”, "twitter": ”@sebsto”, ”github": ”sebsto” }
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Can you guess what will happen?
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed Systems are hard
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Complex systems Amazon Twitter Netflix
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partial failure mode
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do we build resilient software systems?
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about Availability
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Component redundancy increases availability significantly!
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fully-scaled Availability Zone
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Highly redundant regional network
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center 1 or more data centers per AZ 2 or more AZs per region (new regions min 3) data center data center data center data center data center data center
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about Multi-AZ
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance new master Elastic Load Balancing (ELB)
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by AWS regional services such as Amazon S3, Amazon DynamoDB, Amazon Aurora, Amazon ELBs, etc. Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about auto scaling
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling FixedVariable
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability zone 1 Auto Scaling group AWS Region Availability zone 2 Auto-scaling for self-healing Elastic Load Balancing (ELB) X
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about decoupling and async
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result Decoupling with async pattern
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue/Streaming API Instance API Instance API Instance
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Notification User Worker Instance Worker Instance API Instance API Instance Cache node Fetch results API Instance Queue/Streaming
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance HighPriorityQueue LowPriorityQueue
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about the failures in distributed systems
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recommendation Engine Service Service Service Preserve at all cost Preventing failures
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some of the most important things to think about Recommendation Engine Service Service Service Preserve at all cost
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about timeouts, backoff & retries!
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ??
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default = Infinite Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set the timeouts!
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connectionsBackoff
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adding Jitter
  • 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res
  • 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.
  • 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  • 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  • 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/Netflix/Hystrix
  • 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://spring.io/guides/gs/circuit-breaker/
  • 54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about health checking!
  • 55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service BService B database Email Probing for health Cluster
  • 56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  • 57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  • 58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? yes Are you healthy? yes yes yes yes
  • 59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? no Are you healthy? no yes yes yes
  • 60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prioritize shallow health checks during hard times. Cache.
  • 61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about load shedding.
  • 62. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 63. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 64. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cheaply reject excess work
  • 65. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 66. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Be careful when selecting the right metric
  • 67. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Don’t be overly optimistic and take on more than you can. Find an operational metric to reject what you cannot take in. Favor cached and static content Prioritize ELB health check (shallow) pings In an overload situation you have precious resources, do not let any of it go to waste. Load Shedding
  • 68. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
  • 69. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.https://twitter.com/redditstatus/status/1116204502703493120
  • 70. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about shuffle sharding.
  • 71. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 72. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. X X X X X X XX ♤♡♢ ⚀ ⚁ ⚂ ⚃♧♢
  • 73. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measure for this: blast radius
  • 74. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blast radius • How many customers? • What functionality? • How many locations?
  • 75. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 76. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell-based architecture XX ♤♡♢ ⚀ ⚁ ⚂ ⚃♧♢
  • 77. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding XX ♤♡♢ ⚀ ⚁⚂ ⚃♡ ♤ ♧♢ ⚀⚂♧ ⚁⚃♢ ♢ ♡ ♧♢
  • 78. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%
  • 79. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%
  • 80. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding
  • 81. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about chaos!
  • 82. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fire Drills
  • 83. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  • 84. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  • 85. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  • 86. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack https://www.gremlin.comhttps://github.com/Netflix/SimianArmy https://chaostoolkit.org
  • 87. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bananas for Monkeys
  • 88. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to DDoS yourself ~ wrk -t12 -c400 -d30s http://127.0.0.1/api/health
  • 89. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adding delay to the network ~ tc qdisc add dev eth0 root netem delay 200ms
  • 90. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD)
  • 91. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about operational resiliency
  • 92. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Value realized by example Operational resilience 1. Scaled to handle a 400% increase in page views (Kurt Geiger) 2. Improved security posture (CapitalOne) 3. 8600 transactions/second (McDonalds) 4. Transfer of over 750 TB of data from pipeline inspection machinery (GE) 5. Processing over 75 billion market events daily (FINRA) 6. Critical applications run in multiple AZs, x-Regions for robust disaster recovery (Expedia) 7. Supports over 300,000 requests per minute to its API (Easy Taxi) 8. 60% reduced downtime (Trainline) 9. Migration of SAP on Oracle to AWS with zero unplanned downtime across five countries (Kellogg’s) 10. SAP availability boosted to 100% (MacMillan)
  • 93. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operational Resilience Operational resilience Critical workloads run in Multiple AZs and Regions for robust DR (Expedia) Benefit of improving SLAs and reducing unplanned outages What is it? Example
  • 94. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The cost of downtime Annual Fortune 1000 application downtime costs (IDC) $1.25 to $2.5B Average cost of a data breach (Ponemon Institute) $3.6M Cost/hr of a critical application failure (IDC) $500K to $1M Average cost/hr of downtime (Ponemon Institute) $474K Average cost per lost or stolen record (Ponemon Institute) $141
  • 95. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operational resilience: Quantifying cost Cost Category % of Total Definition Third Parties 1.3% The cost of contractors, consultants, auditors and other specialists engaged to help resolve unplanned outages. Equipment 1.3% The cost of new equipment purchases and repairs, including refurbishment. Ex-post Activities 1.1% All after-the-fact incidental costs associated with business disruption and recovery. Recovery 2.9% Activities and associated costs that relate to bringing the organization’s networks and core systems back to a state of readiness. Detection 3.6% Activities associated with the initial discovery and subsequent investigation of the partial or complete outage incident. IT Productivity 8.4% The lost time and related expenses associated with IT personnel downtime. End-user Productivity 18.7% The lost time and related expenses associated with end-user downtime. Lost Revenue 28.2% The total revenue loss from customers and potential customers because of their inability to access core systems during the outage period. Business disruption 34.6% Additional economic loss of the outage, including reputational damages, customer churn and lost business opportunities. TOTAL 100.0%
  • 96. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operational resilience: Case studies Migrated to AWS in 6 weeks with no downtime and improved availability to 99.99%+ Migrated all workloads to AWS to reduce downtime by 60% with an annual savings of £1.2M Rebuilt patient engagement portal on AWS and reduced downtime from 120 to <5 min / month Using AWS, Travelstart has seized opportunities in emerging markets and has cut operational costs by 43% and downtime by 25% With its on-premises setup, the availability of its system ran to 98%, but on its cloud infrastructure, this has risen to 99.965% Three 9’s to five 9’s “We no longer need to worry about data center, server, or hypervisor security…which allows us to focus our attention on securing our applications.”
  • 97. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. And before we go.
  • 98. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. DON’T blame people for failure…
  • 99. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Quality is not an act, it is a habit” Aristotle, some time around 350BC
  • 100. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/wellarchitected
  • 101. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://medium.com/@adhorn
  • 102. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. { "name": "Sébastien Stormacq", "role": ”Technical Evangelist", "company": "Amazon Web Services”, "twitter": ”@sebsto”, ”github": ”sebsto” }
  • 103. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.