SlideShare a Scribd company logo
Chaos Engineering on Kubernetes
DAVID HSU
What is Chaos Engineering
- The story starts at Netflix
- Chaos as daily basis
- Build confidence
- Identify weak points in a system
- Break things on purpose
What is Chaos Engineering
What is Chaos Engineering
- The network is reliable / 網絡是可靠的
- Latency is zero / 延遲是0
- Bandwidth is infinite / 帶寬是無限的
- The network is secure / 網絡是安全的
- Topology doesn't change / 拓墣結構不會變
- There is one administrator / 存在管理員的⾓⾊
- Transport cost is zero / 傳輸成本是0
- The network is homogeneous / 網絡是同質的
Fallacies of Distributed Computing
- Game Day
- Application Failure
- I/O, Network, Response Delay and Error
- CPU/Mem Overload / Container Kills
- Security Leak
- System Capability / Stress Test
- Kubernetes Master/Node Failure
- Cloud Provider AZ/Region Failure
What Does Chaos Engineering Do
Why Do We Need Chaos
Incidents Always happened in Anytime
- What if Gateway failed while traffic coming?
- What if Service Discovery failed?
- What if Redis/Aurora failed or started a failover?
- What if the Kafka failed?
- What if any of the Core Service failed?
- What if Grafana/Prometheus/Victoria-Metrics failed?
What’ll we do? And What’ll happen?
Have You Ever Thought About…?
- CPU/MEM overload based auto-scaling
- Rate Limiting
- ECS Tasks auto-recovery / EKS Pods auto-recovery
- Redis/Aurora/RDS/Kafka auto-failover
- Monitoring and Alerting system
But, How much do you know?
Yes, You Might Have...
We actually know nothing about our system without any practice and testing.
And Practice makes Perfect, this is why we need to implement Chaos Engineering.
It’s not only to practice our systems but also to practice our confidence and flows.
Why
IMPLEMENTATION
Where to Start
1 2
3 4
Phases
- An experiment should start and end with Steady-State
- It’s to ensure the experiment won’t be affected by other unexpected events
Example:
- If we’re going to terminate Pods of a Deployment on K8S, we have to ensure the Deployment
is fully healthy before we start. Otherwise, the experiment is pointless because you won’t be
able to evaluate the result by running this experiment.
- After executed the experiment, we always expect the Deployment will back to “Steady-State”,
which means this Deployment passed the experiment, and vice versa.
Steady State
Steady State
Action
Executed
Steady State
- What if Gateway fails?
- What if website latency increases by 300ms?
- What if Kubernetes cluster fails?
- What if Aurora/Redis failover?
- What if …?
All of those questions don’t have a right answer!
Hypothesis
- Do not run experiments in PROD at the beginning
- Minimize blast radius and learning from small
- Notify everyone before execution
- Have a “STOP” button of experiments
- Have a “RollBack” Plan
Run Experiment
- Does the experiment success or failed? WHY?
- Does the monitoring and alerting system detect the failures?
- Are there any other services also affected by this experiment?
Verify
- How do we enhance the ability of resilient?
- Documents
- What did “your team” learn?
- It’s all about building confidence, blameless!
Don’t use real outage to learn, instead of running chaos regularly.
Improve
- Steady State / 穩定態定義
- Creating a Hypothesis / 假設理論
- Run Experiment / 運⾏實驗
- Verify and Learn/ 驗證並學習
- Improve / 改善並修復
How Do We Plan
Normal
Steady State
Action
Executed
Normal
Steady State
Rollback Learning
Not In
Steady State
FAILED Learning
Scenario
CoreDNS failed of a K8S cluster
Steady-State
All services in cluster are healthy
Hypothesis
Kill all CoreDNS Pods, it should auto-recovered and all service should be healthy.
Run Experiment
Notify and going to Kill all of CoreDNS Pods.
Verify
The alarm of CoreDNS fails has been triggered after 1s but one of our service is dropping error
messages even the CoreDNS has been recovered.
Improve
I feel great instead of upset because we found a weakness of system, and let’s go to fix the
service and record this experiment.
Example
TOOL
- Kubernetes compatible
- Extensible
- Observable
- Able to simulate
- Network delay
- Pod failure
- Node failure
- ...etc
Requirements
Implementation of Netflix's Chaos Monkey for Kubernetes
- Deploy: kube-monkey has a Pod to inject experiments to target
Pods who set specific labels.
- Types: Only Pod Kills.
- Configuration: Schedule can be set
https://github.com/asobti/kube-monkey
Chaos Monkey (kube-monkey)
Alibaba open source experimental injection tool
- Deploy: Kubernetes CRD and command line
- Types: Multiple
- Configuration: CRD and Yaml
https://github.com/chaosblade-io/chaosblade
ChaosBlade
https://aws.amazon.com/fis
AWS Fault Injection Simulator (FIS)
https://github.com/chaos-mesh/chaos-mesh
Chaos Mesh - Kubernetes
Gremlin (SaaS)
https://github.com/chaostoolkit/chaostoolkit-kubernetes
Chaos Toolkit - kubernetes
Chaos Toolkit - kubernetes
Steady State
Action
Executed
Steady State SUCCESS Learning
Not In
Steady State
FAILED Learning
CRD
Customized Service Account
and Security Context
Fetch Experiment JSON from ConfigMap
K8S CronJob to Run Experiment
Check Steady State Hypothesis
Method Executes
Check Steady State Hypothesis
Experiment
Detailed Logs
Steady State Check is True
Production
ECS
ASG
PROS AND CONS
Pros Cons
- CRD/CRO and JSON is really handy.
- All basic experiments almost support by
native such as -- kill pod/node, network
delay.
- Use K8S CronJob to run experiments.
- Documents are so much incompleted,
which cost lots of time to explore and test
on our own.
- Notification and Report are weak.
- No UI or Web console.
- The method of Emergency STOP is to kill
the CRO, which not easy to operate on
PROD.
Principles
- Chaos engineering is not about breaking things in production only — it's a journey and it’s
about confidence.
- Before injecting failure, remember that it is essential to have an excellent monitoring and
alerting program in place.
- Minimize blast radius and learning from small.
- Not just about Infrastructure, or even just the Technical.
- No Lucky
Principles
You don’t choose the moment, the moment
chooses you. You only choose how prepared
you are when it does.
不是你選擇那⼀刻,⽽是那⼀刻選擇你。⽽你唯⼀
的選擇就是隨時做好準備。
Thanks

More Related Content

Similar to DevOps - Chaos Engineering on Kubernetes (20)

PDF
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
PDF
Chaos Engineering, When should you release the monkeys?
Thoughtworks
 
PPTX
Chaos Engineering with Containers - QCon SF 2018
Ana Medina
 
PDF
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
epotedjala25
 
PDF
Applying principles of chaos engineering to serverless (reinvent DVC305)
Yan Cui
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
Chaos engineering - The art of breaking stuff in production on purpose
Geert van der Cruijsen
 
ODP
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
PPTX
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
PPTX
ChaosEngineeringITEA.pptx
JenniferBergstrom10
 
PDF
The case for chaos testing
Peter Lamar
 
PDF
SRECon Europe - Chaos Engineering Bootcamp | August 2018
Ana Medina
 
PDF
Chaos Engineering
Yury Roa
 
PDF
Chaos engineering intro
Shantanu Deshpande
 
PPTX
Chaos Engineering on Cloud Foundry
Karun Chennuri
 
PPTX
CNCF App-Delivery SIG Presentation - Litmus Chaos Engineering
Umasankar Mukkara
 
PDF
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Nils Meder
 
PDF
Chaos Engineering: Site reliability through controlled disruption 1st Edition...
kaskasavlav2
 
PDF
The Case for Chaos Testing
All Things Open
 
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
Chaos Engineering, When should you release the monkeys?
Thoughtworks
 
Chaos Engineering with Containers - QCon SF 2018
Ana Medina
 
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
epotedjala25
 
Applying principles of chaos engineering to serverless (reinvent DVC305)
Yan Cui
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
Chaos engineering - The art of breaking stuff in production on purpose
Geert van der Cruijsen
 
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
ChaosEngineeringITEA.pptx
JenniferBergstrom10
 
The case for chaos testing
Peter Lamar
 
SRECon Europe - Chaos Engineering Bootcamp | August 2018
Ana Medina
 
Chaos Engineering
Yury Roa
 
Chaos engineering intro
Shantanu Deshpande
 
Chaos Engineering on Cloud Foundry
Karun Chennuri
 
CNCF App-Delivery SIG Presentation - Litmus Chaos Engineering
Umasankar Mukkara
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Nils Meder
 
Chaos Engineering: Site reliability through controlled disruption 1st Edition...
kaskasavlav2
 
The Case for Chaos Testing
All Things Open
 

Recently uploaded (20)

PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Distribution reservoir and service storage pptx
dhanashree78
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Design Thinking basics for Engineers.pdf
CMR University
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Ad

DevOps - Chaos Engineering on Kubernetes

  • 1. Chaos Engineering on Kubernetes DAVID HSU
  • 2. What is Chaos Engineering
  • 3. - The story starts at Netflix - Chaos as daily basis - Build confidence - Identify weak points in a system - Break things on purpose What is Chaos Engineering
  • 4. What is Chaos Engineering
  • 5. - The network is reliable / 網絡是可靠的 - Latency is zero / 延遲是0 - Bandwidth is infinite / 帶寬是無限的 - The network is secure / 網絡是安全的 - Topology doesn't change / 拓墣結構不會變 - There is one administrator / 存在管理員的⾓⾊ - Transport cost is zero / 傳輸成本是0 - The network is homogeneous / 網絡是同質的 Fallacies of Distributed Computing
  • 6. - Game Day - Application Failure - I/O, Network, Response Delay and Error - CPU/Mem Overload / Container Kills - Security Leak - System Capability / Stress Test - Kubernetes Master/Node Failure - Cloud Provider AZ/Region Failure What Does Chaos Engineering Do
  • 7. Why Do We Need Chaos
  • 9. - What if Gateway failed while traffic coming? - What if Service Discovery failed? - What if Redis/Aurora failed or started a failover? - What if the Kafka failed? - What if any of the Core Service failed? - What if Grafana/Prometheus/Victoria-Metrics failed? What’ll we do? And What’ll happen? Have You Ever Thought About…?
  • 10. - CPU/MEM overload based auto-scaling - Rate Limiting - ECS Tasks auto-recovery / EKS Pods auto-recovery - Redis/Aurora/RDS/Kafka auto-failover - Monitoring and Alerting system But, How much do you know? Yes, You Might Have...
  • 11. We actually know nothing about our system without any practice and testing. And Practice makes Perfect, this is why we need to implement Chaos Engineering. It’s not only to practice our systems but also to practice our confidence and flows. Why
  • 15. - An experiment should start and end with Steady-State - It’s to ensure the experiment won’t be affected by other unexpected events Example: - If we’re going to terminate Pods of a Deployment on K8S, we have to ensure the Deployment is fully healthy before we start. Otherwise, the experiment is pointless because you won’t be able to evaluate the result by running this experiment. - After executed the experiment, we always expect the Deployment will back to “Steady-State”, which means this Deployment passed the experiment, and vice versa. Steady State Steady State Action Executed Steady State
  • 16. - What if Gateway fails? - What if website latency increases by 300ms? - What if Kubernetes cluster fails? - What if Aurora/Redis failover? - What if …? All of those questions don’t have a right answer! Hypothesis
  • 17. - Do not run experiments in PROD at the beginning - Minimize blast radius and learning from small - Notify everyone before execution - Have a “STOP” button of experiments - Have a “RollBack” Plan Run Experiment
  • 18. - Does the experiment success or failed? WHY? - Does the monitoring and alerting system detect the failures? - Are there any other services also affected by this experiment? Verify
  • 19. - How do we enhance the ability of resilient? - Documents - What did “your team” learn? - It’s all about building confidence, blameless! Don’t use real outage to learn, instead of running chaos regularly. Improve
  • 20. - Steady State / 穩定態定義 - Creating a Hypothesis / 假設理論 - Run Experiment / 運⾏實驗 - Verify and Learn/ 驗證並學習 - Improve / 改善並修復 How Do We Plan Normal Steady State Action Executed Normal Steady State Rollback Learning Not In Steady State FAILED Learning
  • 21. Scenario CoreDNS failed of a K8S cluster Steady-State All services in cluster are healthy Hypothesis Kill all CoreDNS Pods, it should auto-recovered and all service should be healthy. Run Experiment Notify and going to Kill all of CoreDNS Pods. Verify The alarm of CoreDNS fails has been triggered after 1s but one of our service is dropping error messages even the CoreDNS has been recovered. Improve I feel great instead of upset because we found a weakness of system, and let’s go to fix the service and record this experiment. Example
  • 22. TOOL
  • 23. - Kubernetes compatible - Extensible - Observable - Able to simulate - Network delay - Pod failure - Node failure - ...etc Requirements
  • 24. Implementation of Netflix's Chaos Monkey for Kubernetes - Deploy: kube-monkey has a Pod to inject experiments to target Pods who set specific labels. - Types: Only Pod Kills. - Configuration: Schedule can be set https://github.com/asobti/kube-monkey Chaos Monkey (kube-monkey)
  • 25. Alibaba open source experimental injection tool - Deploy: Kubernetes CRD and command line - Types: Multiple - Configuration: CRD and Yaml https://github.com/chaosblade-io/chaosblade ChaosBlade
  • 30. Chaos Toolkit - kubernetes Steady State Action Executed Steady State SUCCESS Learning Not In Steady State FAILED Learning
  • 31. CRD Customized Service Account and Security Context Fetch Experiment JSON from ConfigMap K8S CronJob to Run Experiment
  • 32. Check Steady State Hypothesis Method Executes Check Steady State Hypothesis Experiment
  • 33. Detailed Logs Steady State Check is True
  • 35. ECS
  • 36. ASG
  • 37. PROS AND CONS Pros Cons - CRD/CRO and JSON is really handy. - All basic experiments almost support by native such as -- kill pod/node, network delay. - Use K8S CronJob to run experiments. - Documents are so much incompleted, which cost lots of time to explore and test on our own. - Notification and Report are weak. - No UI or Web console. - The method of Emergency STOP is to kill the CRO, which not easy to operate on PROD.
  • 39. - Chaos engineering is not about breaking things in production only — it's a journey and it’s about confidence. - Before injecting failure, remember that it is essential to have an excellent monitoring and alerting program in place. - Minimize blast radius and learning from small. - Not just about Infrastructure, or even just the Technical. - No Lucky Principles
  • 40. You don’t choose the moment, the moment chooses you. You only choose how prepared you are when it does. 不是你選擇那⼀刻,⽽是那⼀刻選擇你。⽽你唯⼀ 的選擇就是隨時做好準備。