SlideShare a Scribd company logo
Designing Apps for Resiliency
Masashi Narumoto
Principal Lead PM
AzureCAT patterns & practices
Agenda
• What is ’resiliency’?
• Why it’s so important?
• Process to improve resiliency
• Resiliency checklist
What is ‘Resiliency’?
• Resiliency is the ability to recover from failures and continue to
function. It's not about avoiding failures, but responding to failures in
a way that avoids downtime or data loss.
• High availability is the ability of the application to keep running in a
healthy state, without significant downtime.
• Disaster recovery is the ability to recover from rare but major incidents:
Non-transient, wide-scale failures, such as service disruption that affects an
entire region.
Why it’s so important?
• More transient faults in the cloud
• Dependent service may go down
• SLA < 100% means something could go wrong at some point
• More focus on MTTR rather than MTBF
Process to improve resiliency
Plan Design Implement Test Deploy Monitor Respond
Define
requirements
Identify
failures
Implement
recovery
strategies
Inject failures
Simulate FO
Deploy apps in a
reliable manner
Monitor
failures
Take actions
to fix issues
Defining resiliency requirements
Major incident occurs Service recoveredData backupData backupData backup
Recovery Time Objective
(RTO)
Recovery Point Objective
(RPO)
RPO: The maximum time period in which data might be lost
RTO: Duration of time in which the service must be restored after an incident
Business recovered
Maximum Tolerable Outage (MTO)
SLA (Service Level Agreement)
Composite SLA
Composite SLA = ? Composite SLA = ?
Cache
Fallback action:
Return data from local cache
99.94% 99.95%99.95%
99.95% x 99.99% = 99.94%
1.0 − (0.0001 × 0.001) = 99.99999%
Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA
1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899
Designing for resiliency
Reading data from SQL Server fails
A web server goes down
A NVA goes down
1. Identify possible failures
2. Rate risk of each failure
(impact x likelihood)
3. Design resiliency strategy
- Detection
- Recovery
- Diagnostics
Failure mode analysis
https://azure.microsoft.com/en-us/documentation/articles/guidance-resiliency-failure-mode-analysis/
Rack awareness
Web tier
Availability set
Middle tier
Availability set
Data tier
Availability set
Fault domain 1
Replica #1
Replica #1
Replica #2
Fault domain 2 Fault domain 3
Shard #2Shard #1
Load balance multiple instances
Application gateway for
- L7 routing
- SSL termination
Failover / Failback
Traffic manager
Priority routing method
Web
Application
Data
Web
Application
Data
Automatedfailover
Manualfailback
Primary region
Secondary region (regional pair)
WebWebWeb
Data
ApplicationApplication
Data
Data replication Azure storage
Geo replica (RA-GRS)
LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly
Periodically check
If it’s back online
Retry transient failures
See ‘Azure retry guidance’ for more details
< E2E latency requirement
Circuit Breaker
Remote service
Your application
User
Hold resources while retrying operation
Lead to cascading failures
Failed
Circuit Breaker
https://github.com/App-vNext/Polly
Bulkhead
Service A Service B Service C
Thread pool Thread pool Thread pool
Workload 1 Workload 2
Thread pool Thread poolThread pool
Workload 1 Workload 2
Memory
CPU
Disk
Thread pool
Connection pool
Network connection
Other design patterns for resiliency
• Compensating transaction
• Scheduler-agent-supervisor
• Throttling
• Load leveling
• Leader election
See ‘Cloud design patterns’
Principles of chaos engineering
• Build hypothesis around steady state behavior
• Vary real-world events
• Run experiments in production
• Automate experiments to run consistently
http://principlesofchaos.org/
Control Group
Experimental Group
HW/SW failures
Spike in traffic
Verify difference
In terms of steady state
Feed production traffic
Testing for resiliency
• Fault injection testing
• Shut down VM instances
• Crash processes
• Expire certificates
• Change access keys
• Shut down the DNS service on domain controllers
• Limit available system resources, such as RAM or number of threads
• Unmount disks
• Redeploy a VM
• Load testing
• Use production data as much you can
• VSTS, JMeter
• Soak testing
• Longer period under normal production load
Blue/Green and Canary release
Web App DB
Web App DB
Blue/Green Deployment
Web App DB
Web App DB
Canary release
90%
10%
Current version
New version
Current version
New version
LoadBalancer
ReverseProxy
Deployment slots at App Service
Dark launching
New feature
Toggle enable/disable
User Interface
Production environment
Resiliency checklist
• https://azure.microsoft.com/en-us/documentation/articles/guidance-
resiliency-checklist/
Other resources
http://docs.microsoft.com/Azure
Resiliency / High Availability / Disaster Recovery
Throttling
Circuit breaker
Zero downtime deployment
Eventual consistency
Data restore
Retry
Graceful degradation
Geo-replica
Multi-region deployment

More Related Content

PPTX
Designing microservices
Masashi Narumoto
 
PPTX
Modern Cloud Fundamentals: Misconceptions and Industry Trends
Christopher Bennage
 
PPTX
Designing microservices part2
Masashi Narumoto
 
PPTX
Azure reference architectures
Masashi Narumoto
 
PPTX
Modeling microservices using DDD
Masashi Narumoto
 
PPTX
Microservices design patterns
Masashi Narumoto
 
PDF
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Continuent
 
PDF
Microservices, Monoliths, SOA and How We Got Here
Lightbend
 
Designing microservices
Masashi Narumoto
 
Modern Cloud Fundamentals: Misconceptions and Industry Trends
Christopher Bennage
 
Designing microservices part2
Masashi Narumoto
 
Azure reference architectures
Masashi Narumoto
 
Modeling microservices using DDD
Masashi Narumoto
 
Microservices design patterns
Masashi Narumoto
 
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Continuent
 
Microservices, Monoliths, SOA and How We Got Here
Lightbend
 

What's hot (20)

PDF
Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
HostedbyConfluent
 
PDF
Nine Neins - where Java EE will never take you
Markus Eisele
 
PDF
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
Lightbend
 
PDF
Achieving scale and performance using cloud native environment
Rakuten Group, Inc.
 
PDF
The 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
Lightbend
 
PPTX
Going Reactive in Java with Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
PDF
Project Sherpa: How RightScale Went All in on Docker
RightScale
 
PDF
What is reactive
Lightbend
 
PDF
Evolution of unix environments and the road to faster deployments
Rakuten Group, Inc.
 
PDF
Digital Transformation with Kubernetes, Containers, and Microservices
Lightbend
 
PDF
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
PDF
Cloudstate - Towards Stateful Serverless
Lightbend
 
PDF
Caching for Microservices Architectures: Session I
VMware Tanzu
 
PDF
How to Migrate to Cloud with Complete Confidence and Trust
Apcera
 
PDF
Introduction to architectural patterns
Georgy Podsvetov
 
PPTX
Containerization: The DevOps Revolution
SoftServe
 
PDF
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
HostedbyConfluent
 
PPTX
3 migration
ROSHNI PRADHAN
 
PPTX
Webinar: Eventual Consistency != Hopeful Consistency
DataStax
 
PPTX
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Legacy Typesafe (now Lightbend)
 
Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
HostedbyConfluent
 
Nine Neins - where Java EE will never take you
Markus Eisele
 
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
Lightbend
 
Achieving scale and performance using cloud native environment
Rakuten Group, Inc.
 
The 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
Lightbend
 
Going Reactive in Java with Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
Project Sherpa: How RightScale Went All in on Docker
RightScale
 
What is reactive
Lightbend
 
Evolution of unix environments and the road to faster deployments
Rakuten Group, Inc.
 
Digital Transformation with Kubernetes, Containers, and Microservices
Lightbend
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
Cloudstate - Towards Stateful Serverless
Lightbend
 
Caching for Microservices Architectures: Session I
VMware Tanzu
 
How to Migrate to Cloud with Complete Confidence and Trust
Apcera
 
Introduction to architectural patterns
Georgy Podsvetov
 
Containerization: The DevOps Revolution
SoftServe
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
HostedbyConfluent
 
3 migration
ROSHNI PRADHAN
 
Webinar: Eventual Consistency != Hopeful Consistency
DataStax
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Legacy Typesafe (now Lightbend)
 
Ad

Viewers also liked (17)

PDF
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
Randy Bias
 
PDF
L02 What is Software Architecture?
Ólafur Andri Ragnarsson
 
PPTX
Resiliency jenna-2013
Jenna Martin
 
PDF
Manueverable architecture
Michael Nygard
 
PDF
The Big Red Button
Michael Nygard
 
PDF
Where to put_my_data
Michael Nygard
 
PDF
Tempo, Maneuverability, and Initiative
Michael Nygard
 
PDF
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppDynamics
 
PPTX
Resilience engineering
Sumanth Chinthagunta
 
PDF
FORUM PA 2015 - Microservices with IBM Bluemix
gjuljo
 
PDF
Fault tolerance made easy
Uwe Friedrichsen
 
PPTX
Azure Reference Architectures
Christopher Bennage
 
PDF
Architecture without an end state
Michael Nygard
 
PDF
Patterns of resilience
Uwe Friedrichsen
 
PPTX
Resiliency through failure @ QConNY 2013
Ariel Tseitlin
 
PDF
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
AINOW
 
PDF
Resilient Architecture
Matt Stine
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
Randy Bias
 
L02 What is Software Architecture?
Ólafur Andri Ragnarsson
 
Resiliency jenna-2013
Jenna Martin
 
Manueverable architecture
Michael Nygard
 
The Big Red Button
Michael Nygard
 
Where to put_my_data
Michael Nygard
 
Tempo, Maneuverability, and Initiative
Michael Nygard
 
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppDynamics
 
Resilience engineering
Sumanth Chinthagunta
 
FORUM PA 2015 - Microservices with IBM Bluemix
gjuljo
 
Fault tolerance made easy
Uwe Friedrichsen
 
Azure Reference Architectures
Christopher Bennage
 
Architecture without an end state
Michael Nygard
 
Patterns of resilience
Uwe Friedrichsen
 
Resiliency through failure @ QConNY 2013
Ariel Tseitlin
 
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
AINOW
 
Resilient Architecture
Matt Stine
 
Ad

Similar to Designing apps for resiliency (20)

PPTX
Building Resilient Azure Solutions for Office 365 - SharePoint Saturday Atlan...
Josh Carlisle
 
PPTX
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
PPT
Design patterns and plan for developing high available azure applications
Himanshu Sahu
 
PPTX
Cloud architecture
Mahmoud Moussa
 
PDF
"Resilient by Design: Strategies for Building Robust Architecture at Uklon", ...
Fwdays
 
PDF
Resisting to The Shocks
Stefano Fago
 
PDF
Architecting for Failures in micro services: patterns and lessons learned
Bhakti Mehta
 
PDF
MS Cloud Design Patterns Infographic 2015
James Tramel
 
PDF
Ms cloud design patterns infographic 2015
Kesavan Munuswamy
 
PDF
Reliability and Resilience Patterns
Dmitry Chornyi
 
PPTX
Resiliency for Cloud Deployed Applications
Ajay Chebbi
 
PPTX
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
PPTX
Managing High Availability with Low Cost
DataLeader.io
 
PPTX
Designing Resilient Applications on Microsoft Azure/Disaster Recovery of Appl...
WinWire Technologies Inc
 
PDF
[WSO2Con EU 2017] Resilience Patterns with Ballerina
WSO2
 
PPTX
Cloud First Architecture
Cameron Vetter
 
PDF
Azure Application Architecture Guide ~Design principles for Azure application...
Naoki (Neo) SATO
 
PPTX
Azure architecture design patterns - proven solutions to common challenges
Ivo Andreev
 
PPTX
High Availability in the Cloud - Architectural Best Practices
RightScale
 
PDF
Mmckeown hadr that_conf
Mike McKeown
 
Building Resilient Azure Solutions for Office 365 - SharePoint Saturday Atlan...
Josh Carlisle
 
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
Design patterns and plan for developing high available azure applications
Himanshu Sahu
 
Cloud architecture
Mahmoud Moussa
 
"Resilient by Design: Strategies for Building Robust Architecture at Uklon", ...
Fwdays
 
Resisting to The Shocks
Stefano Fago
 
Architecting for Failures in micro services: patterns and lessons learned
Bhakti Mehta
 
MS Cloud Design Patterns Infographic 2015
James Tramel
 
Ms cloud design patterns infographic 2015
Kesavan Munuswamy
 
Reliability and Resilience Patterns
Dmitry Chornyi
 
Resiliency for Cloud Deployed Applications
Ajay Chebbi
 
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
Managing High Availability with Low Cost
DataLeader.io
 
Designing Resilient Applications on Microsoft Azure/Disaster Recovery of Appl...
WinWire Technologies Inc
 
[WSO2Con EU 2017] Resilience Patterns with Ballerina
WSO2
 
Cloud First Architecture
Cameron Vetter
 
Azure Application Architecture Guide ~Design principles for Azure application...
Naoki (Neo) SATO
 
Azure architecture design patterns - proven solutions to common challenges
Ivo Andreev
 
High Availability in the Cloud - Architectural Best Practices
RightScale
 
Mmckeown hadr that_conf
Mike McKeown
 

Recently uploaded (20)

PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
Inventory management chapter in automation and robotics.
atisht0104
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Information Retrieval and Extraction - Module 7
premSankar19
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Zero Carbon Building Performance standard
BassemOsman1
 

Designing apps for resiliency

Editor's Notes

  • #3: Everybody is talking about it but the its definition is not clear. I’ll clarify what it means Why everybody is taking about it? There’s a number of reasons Main part of this topic is how to make your app resilient. I’ll show you some of the example of checklist
  • #4: DR? Data backup? These are all true statements but none of them clearly define what resiliency means. In order to be HA, it doesn’t need to go down and come back online. If you’re app is running w/ 100% uptime w/o any failures, it’s HA but you never know if it’s resilient. Once something bad happens, then it may take days to come back online which is not really resilient at all. DR needs to be a catastrophic failure such as something that could take down entire DC. For example..
  • #5: Why it’s so important? Why everybody is talking about resiliency? Transient faults because of commodity HW, networking, multi-tenant shared model Remote services could go down at any time 99.99% means 4 mins downtime a month. Do you want to sit down and wait for 4 minutes or do something else? I’d rather do something because you never know it’s going to be 4 minutes or 4 hours. Based on the assumption that anything goes wrong at some point, focus has been shifting from MTBF to MTTR
  • #6: We’re getting into more interesting part. We discussed what resiliency means, why it’s so important. Now we’re getting into ‘how’ part. This is the process to improve resiliency in your system in 7 steps from plan to respond. Let’s talk about each step. Clearly define your requirements, otherwise you don’t know what you’re aiming for Identify all possible failures you may see and Implement recovery strategies to bounce back from these failures To make sure these strategies work, you need to test them by injecting failures Deployment needs to be resilient too. Because deploying new version is the most common cause of failures Monitoring is key to QoS. Monitor errors, latency, throughputs etc. in percentile. You need to take actions quickly to mitigate the downtime
  • #7: There’re two common requirements when it comes to resiliency. RPO: defines the interval of data backup RTO: defines the requirements for hot/warm/cold stand-by MTO: how long a particular business process can be down
  • #8: If you look at well-experienced customers, they define availability requirements per each use case. Decompose your workload and define availability requirements (uptime, latency etc.) per each Higher SLA comes with cost because of redundant services/components. Measuring downtime will become an issue when you target 5’nine’s
  • #9: The fact that App Service offers 99.95% doesn’t mean that the entire system has 99.95%. Other important fact is that SLA doesn’t guarantee that it always up 99.95% of the time. You’ll get money back when it violates SLA. It’s not just a number game. This is where resiliency comes into play. SLA is not guaranteed. If we don’t meet SLA, you get money back. Definition of SLA varies depending on the service.
  • #10: In order to design your app to be resilient, you need to identify all possible failures first. Then implement resilient strategies against them,
  • #11: To help you identify all possible failures, we published list of most common failures on Azure. It has a few items per each service. 30 to 40 items in total. Let’s take a look. In the case of DocumentDB. When you fail to read data from it, the client SDK retries the operation for you. The only transient fault it retries against is throttling (429). If you constantly get 420, consider increasing its scale (RU) DocumentDB now supports geo-replica. If primary region fails, it will switch traffic to other regions in the list you configure For diagnostics, you need to log all errors at client side.
  • #12: You can think of rack as power module. If it goes down, anything belong to it go down all together. So it’s better to distribute VMs across different racks for redundancy sake. This is where availability set comes into play. Each machine in the same AS belongs to different rack. VMSS automatically put VMs in 5 FD, 5 UD but it doesn’t support data disk yet.
  • #13: Avoiding SPOF is critical for resiliency. Many customers still don’t know these basics. They deploy critical workload on a single machine. For that, you nee to have redundant components. One goes down but still others are running. In this case, put VMs in the same tier into the same availability set with LB. LB would distribute requests to VMs in backend address pool Health probe can be either Http or Tcp depending on the workload. By default it pings root path ‘/’. You may want to expose health endpoint to monitor all critical component.
  • #14: There’s a risk of data loss in FO, take a snapshot and ensure the data integrity.
  • #15: If it’s less frequent transient faults, set the property to PrimaryThenSecondary. It’ll switch to secondary region for you If it’s more frequent or non-transient faults, set the property to SecondaryOnly otherwise it keeps hitting and getting errors from primary. You need to monitor the primary region, when it comes back then set the property back to PrimaryOnly or PTS One thing to notice is that Azure storage wouldn’t failover to secondary until reginal wide disaster happens which I don’t think we have had yet. This strategy is applicable for read not write.
  • #16: Let’s take a look at a few resiliency strategies to recover from failures you identified above. Exponential back-off for non-interactive transaction Quick liner retry for interactive transaction Anti-patterns: Cascading retry (5x5 = 25) More than one immediate retry Many attempts with regular interval (Randomize interval)
  • #17: People often say don’t waste your time, let’s circuit break and fail fast That is only a part of the problem. Real issues is the cascading failures. Also by keep retrying failed operations, the remote service can’t recover from the failed state
  • #19: Type of resources to isolate are not limited to but they are most common ones.
  • #21: Given the chaotic nature of the cloud and distributed system, always something happens somewhere. it makes sense to follow chaos engineering principles. Define the steady state as the measurable output of a system, rather than internal attributes of the system Introduce real-world chaotic events such as HW-failure, SW-failure, spike in traffic etc. Best way to validate the system at production scale is to run experiment in production. Netflix at least once a month, inject faults in one of their regions to see if their system can keep up and running. Since it’s such a time consuming tasks, you should automate the experiments and run them continuously Chaos engineering is not testing, it’s validation of the system. https://www.youtube.com/watch?v=Q4nniyAarbs
  • #22: Tools = Chaos monkey/kong, ToxiProxy https://en.wikipedia.org/wiki/Soak_testing
  • #23: Deploy current and new version into two identical environments (blue, green) Do smoke test on new version then switch traffic to it. Canary release is to incrementally switches from current to new using LB. Use Akamai or equivalent to do Canary. The unique name for this environment comes from a tactic used by coal miners: they’d bring canaries with them into the coal mines to monitor the levels of carbon monoxide in the air; if the canary died, they knew that the level of toxic gas in the air was high, and they’d leave the mines. In either case you should be able to rollback if the new version doesn’t work Graceful shutdown and Switching DB/Storage are the challenge. Github route request to blue and green, compares the result from blue and green. Make sure they are identical. Dark launch: Deploy new features without enabling it to users. Make sure it won’t cause any issues in production, then enable it.
  • #24: This is how it works in App Service. You can have up to 15 deployment slots
  • #25: Deploy a new feature to prod env without enabling it to users. Make sure it works with in the prod infrustracture, no memory leaks, no nothing. Then enable it to users on UI. If something bad happens, then disable it in UI. Facebook does this.
  • #26: All other proven practices are in this doc. You can use this list when you have ADR with your customers. Give us feedback.