SlideShare a Scribd company logo
Chaos Engineering
Injecting failure for building
resilience in systems
Nice to meet you
YURY NIÑO
Software Engineer and Chaos
Engineer Advocate.
Loves building software applications, solving
resilience issues and teaching. Passionate about
reading, writing and cycling.
Agenda
● Resilience vs Reliability
● Why the world needs Resilience and
Reliability?
● Chaos Engineering
● Principles of Chaos
● Chaos in Practice
● Game Days
How many of you
Have encountered a crash of
your systems on production?
A recognition for ...
This talk is dedicated to
the #SystemAdministrators well
caffeinated, who get woken up in the
middle of the night when “things go
bump”.
#EngineeringTeam #DigitalFactory
@jnhernandz @
What is a
Resilient System?
A resilient system can maintain an acceptable level
of service in the face of failure.
A resilient system can weather the storm such a
large scale natural disaster or a controlled chaos
engineering.
Tammy Bütow Principal SRE at Gremlin
https://securethegrid.com
A distributed system on production needs to be
resilient in order to be reliable and this is precisely
a target that we Software Engineers, Systems
Engineers, Site Reliability Engineers and Chaos
Engineers always aim.
Mine :)
Why the world needs
Resilient Systems?
Because ...
We are surrounded by
distributed systems.
When we read the news in our
cellphones, send an email or buy our
lunch ...
We do not tolerate that
they fail!
Chaos Engineering: Injecting Failure for Building Resilience in Systems
February 28th, 2017 will be remembered
● Simple Storage Service (S3) went down in US-EAST.
● Outage lasted about 4 hrs.
● > 100.000 websites across the world were impacted.
Me :(
The World is Chaotic!
● Distributed systems contains moving
parts.
● Many things can go wrong.
○ Hard disks can fail.
○ The network can go down.
○ Customer traffic can overload.
How many of you know
What is Chaos
Engineering?
Chaos Engineering
It is the discipline of experimenting in
production on a distributed system in
order to reveal their weakness and to
build confidence in their resilience
capability.
https://principlesofchaos.org/
Chaos Engineering
It is deliberately inducing stress or
fault into software and/or hardware as
a way of learning/verifying things
about systems.
https://www.gremlin.com
Chaos Engineering is about
● Simulating the failure of a datacenter.
● Injecting latency between services.
● Randomly causing exceptions.
● Changing time travel.
● Emulating I/O errors.
http://principlesofchaos.org/
2008
Chaos Engineering
began at Netflix
2010
Chaos Monkey was
launched
2018
A lot of resources for
Chaos Engineering.
2014
Role of Chaos
Engineer was created.
History of Chaos Engineering
Kolton Andrus
Chaos in Practice
Principles of
Chaos
https://principlesofchaos.org/
1. Steady Stead
Chaos Engineering: Injecting Failure for Building Resilience in Systems
2. Hypothesis:
Circuit
Breaker
builds
Resilience
2. Hypothesis:
Circuit
Breaker
builds
Resilience
4. Run the Experiment
Application
Name Finer Observability DataDog
Hypothesis Circuit Breaker works
Environment My Home Results
Duration 5 - 10 seconds
Load 1 request
Actions
4. Run the Experiment
Application
Name Finer Observability DataDog
Hypothesis Circuit Breaker works
Facing latencies > 5 seconds between
dashboard_api and smart_api to open
the circuit.
Environment My Home Results
Duration 20 milliseconds
Load 1 request
Issue #4356
Configure the proper hystrix parameters
according the results.
Implement a fallback.
Actions
Game Days
Game Day: Roles
Master of Disaster First on-call Team
https://www.pinterest.es/pin/824299538021645731/
Game Days can Transform our Teams
Even though Game Days are not real! they
make Engineers gain confidence.
Since we, Engineers are experiencing the failure as part
of our job, we should start designing for failure.
Me :)
The best time to learn about fire
is when you’re on fire.
—Jen Hammond, New Relic engineering manager
How to begin ...
https://chaosengineering.slack.com
https://github.com/dastergon/awesome-chaos
-engineering
https://www.infoq.com/chaos-engineering
@yurynino

More Related Content

What's hot (20)

PDF
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
 
PPTX
Site (Service) Reliability Engineering
Mark Underwood
 
PDF
Dev ops
Eman Abdelmohsen
 
PDF
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
PDF
DevSecOps Jenkins Pipeline -Security
n|u - The Open Security Community
 
PDF
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
PDF
DevOps for beginners
Pradeep Patel, PMP®
 
ODP
Presentation on Agile Testing
1Solutions Pvt. Ltd.
 
PDF
How to implement DevOps in your Organization
Dalibor Blazevic
 
PDF
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
PDF
Introduction to DevOps
Yosef Tavin
 
PDF
DevSecOps | DevOps Sec
Rubal Jain
 
PPTX
SRE 101 (Site Reliability Engineering)
Hussain Mansoor
 
PDF
DevOps & DevSecOps in Swiss Banking
Aarno Aukia
 
PPTX
11 steps of testing process - By Harshil Barot
Harshil Barot
 
PDF
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
John Allspaw
 
PPTX
Static code analysis with sonar qube
Hayi Nukman
 
PDF
Implementing Vulnerability Management
Argyle Executive Forum
 
PDF
Contract Testing
kloia
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
 
Site (Service) Reliability Engineering
Mark Underwood
 
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
DevSecOps Jenkins Pipeline -Security
n|u - The Open Security Community
 
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
DevOps for beginners
Pradeep Patel, PMP®
 
Presentation on Agile Testing
1Solutions Pvt. Ltd.
 
How to implement DevOps in your Organization
Dalibor Blazevic
 
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
Introduction to DevOps
Yosef Tavin
 
DevSecOps | DevOps Sec
Rubal Jain
 
SRE 101 (Site Reliability Engineering)
Hussain Mansoor
 
DevOps & DevSecOps in Swiss Banking
Aarno Aukia
 
11 steps of testing process - By Harshil Barot
Harshil Barot
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
John Allspaw
 
Static code analysis with sonar qube
Hayi Nukman
 
Implementing Vulnerability Management
Argyle Executive Forum
 
Contract Testing
kloia
 

Similar to Chaos Engineering: Injecting Failure for Building Resilience in Systems (20)

PPTX
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
PDF
Chaos Engineering – why we should all practice breaking things on purpose by ...
Alex Cachia
 
PDF
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
PROIDEA
 
PDF
Architectural Patterns of Resilient Distributed Systems
Ines Sombra
 
PDF
chaos-engineering-Knolx
Knoldus Inc.
 
PDF
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
PDF
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
PDF
Stability anti patterns in cloud-native applications
Ana-Maria Mihalceanu
 
PDF
Chaos is a ladder !
Haggai Philip Zagury
 
PDF
Availability in a cloud native world v1.6 (Feb 2019)
removed_414e600f33c7539c2e1b596a774aaebd
 
PPTX
Containers and Why They Matter
Ray Lukas
 
PPTX
Designing Cloud Backup to reduce DR downtime for IT Professionals
Storage Switzerland
 
PPTX
CS5032 Lecture 2: Failure
John Rooksby
 
PDF
Unleash The Monkeys
Jacob Duijzer
 
PPT
ppt_rs.jpg
webhostingguy
 
PDF
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Jorge Cardoso
 
PPTX
Webinar_DevOps_Nov10_D2
Phil Christensen
 
PDF
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Tim Kirby
 
PDF
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Estevan McCalley
 
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Alex Cachia
 
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
PROIDEA
 
Architectural Patterns of Resilient Distributed Systems
Ines Sombra
 
chaos-engineering-Knolx
Knoldus Inc.
 
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
Stability anti patterns in cloud-native applications
Ana-Maria Mihalceanu
 
Chaos is a ladder !
Haggai Philip Zagury
 
Availability in a cloud native world v1.6 (Feb 2019)
removed_414e600f33c7539c2e1b596a774aaebd
 
Containers and Why They Matter
Ray Lukas
 
Designing Cloud Backup to reduce DR downtime for IT Professionals
Storage Switzerland
 
CS5032 Lecture 2: Failure
John Rooksby
 
Unleash The Monkeys
Jacob Duijzer
 
ppt_rs.jpg
webhostingguy
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Jorge Cardoso
 
Webinar_DevOps_Nov10_D2
Phil Christensen
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Tim Kirby
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Estevan McCalley
 
Ad

Recently uploaded (20)

PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Ad

Chaos Engineering: Injecting Failure for Building Resilience in Systems

  • 1. Chaos Engineering Injecting failure for building resilience in systems
  • 2. Nice to meet you YURY NIÑO Software Engineer and Chaos Engineer Advocate. Loves building software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling.
  • 3. Agenda ● Resilience vs Reliability ● Why the world needs Resilience and Reliability? ● Chaos Engineering ● Principles of Chaos ● Chaos in Practice ● Game Days
  • 4. How many of you Have encountered a crash of your systems on production?
  • 5. A recognition for ... This talk is dedicated to the #SystemAdministrators well caffeinated, who get woken up in the middle of the night when “things go bump”. #EngineeringTeam #DigitalFactory @jnhernandz @
  • 7. A resilient system can maintain an acceptable level of service in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. Tammy Bütow Principal SRE at Gremlin
  • 9. A distributed system on production needs to be resilient in order to be reliable and this is precisely a target that we Software Engineers, Systems Engineers, Site Reliability Engineers and Chaos Engineers always aim. Mine :)
  • 10. Why the world needs Resilient Systems?
  • 11. Because ... We are surrounded by distributed systems. When we read the news in our cellphones, send an email or buy our lunch ... We do not tolerate that they fail!
  • 13. February 28th, 2017 will be remembered ● Simple Storage Service (S3) went down in US-EAST. ● Outage lasted about 4 hrs. ● > 100.000 websites across the world were impacted.
  • 14. Me :(
  • 15. The World is Chaotic! ● Distributed systems contains moving parts. ● Many things can go wrong. ○ Hard disks can fail. ○ The network can go down. ○ Customer traffic can overload.
  • 16. How many of you know What is Chaos Engineering?
  • 17. Chaos Engineering It is the discipline of experimenting in production on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  • 18. Chaos Engineering It is deliberately inducing stress or fault into software and/or hardware as a way of learning/verifying things about systems. https://www.gremlin.com
  • 19. Chaos Engineering is about ● Simulating the failure of a datacenter. ● Injecting latency between services. ● Randomly causing exceptions. ● Changing time travel. ● Emulating I/O errors. http://principlesofchaos.org/
  • 20. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey was launched 2018 A lot of resources for Chaos Engineering. 2014 Role of Chaos Engineer was created. History of Chaos Engineering Kolton Andrus
  • 27. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis Circuit Breaker works Environment My Home Results Duration 5 - 10 seconds Load 1 request Actions
  • 28. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis Circuit Breaker works Facing latencies > 5 seconds between dashboard_api and smart_api to open the circuit. Environment My Home Results Duration 20 milliseconds Load 1 request Issue #4356 Configure the proper hystrix parameters according the results. Implement a fallback. Actions
  • 30. Game Day: Roles Master of Disaster First on-call Team https://www.pinterest.es/pin/824299538021645731/
  • 31. Game Days can Transform our Teams Even though Game Days are not real! they make Engineers gain confidence.
  • 32. Since we, Engineers are experiencing the failure as part of our job, we should start designing for failure. Me :) The best time to learn about fire is when you’re on fire. —Jen Hammond, New Relic engineering manager
  • 33. How to begin ... https://chaosengineering.slack.com https://github.com/dastergon/awesome-chaos -engineering https://www.infoq.com/chaos-engineering @yurynino