From	Resilient	to	Antifragile	
Chaos	Engineering	Primer
By	@Sergiu_Bodiu	
Solution	Architect
@Sergiu_Bodiu2
From	Resilient	to	Antifragile	
Chaos	Engineering	Primer
By	@Sergiu_Bodiu	
Solution	Architect				
DevOpsDays
Singapore

Conference
Singapore Spring
User Group
@Sergiu_Bodiu
what	is	an	ARCHITECT
3 https://www.thekua.com/atwork/2016/11/the-well-rounded-architect/@patkua
@Sergiu_Bodiu
Risk	management
4
The	new	normal:	
from RESILIENT
to ANTIFRAGILE
@Sergiu_Bodiu
A	new	way	to	look	at	organizations
5
Fragile:		At	risk	of	total	failure	/	financial	ruin	
Resilient:	Takes	damage,	avoids	total	failure,	
recovers	
Robust:	Absorbs	uncertainty,	repels	blows,	
avoids	damage	
Antifragile:	Responds	to	stress	by	mutating,	
maintains	fitness	for	purpose.	Identity	Change.
@Sergiu_Bodiu
Blueprint	for	living	in	a	Black	Swan	world.
6
Antifragile
and only
the
Antifragile,
will Make it.
@Sergiu_Bodiu
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
7
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
@Sergiu_Bodiu
Software	is	Single	Point	of	Failure
8
Root Cause Analysis: While component failures such as
NETWORK, STORAGE, SERVER, HARDWARE, and POWER
failures are anticipated and thus guarded with extra
redundancies.
@Sergiu_Bodiu
Distributed	Systems	Complexity
9
Complexity is
like
Addiction…
Case study: How
complexity creeps in
- @jasonfried
https://m.signalvnoise.com/case-study-how-complexity-creeps-in-cba48023e6a1
From Resilient to Antifragile Chaos Engineering Primer
@Sergiu_Bodiu
Chaos	Engineering
11
Discipline of experimenting on
a distributed system in order to
build confidence in the
system’s capability to withstand
turbulent conditions in
production.
NETFLIX
http://principlesofchaos.org
@Sergiu_Bodiu
Some	outages	in	the	Region
12
SingTel fined a record $6m for Bukit
Panjang exchange fire;
Telstra goes down again, people
can't drink beer or catch Ubers
Amazon Web Services outage
causes Australian website chaos
@Sergiu_Bodiu
Backups
13
"Backups always succeed.
It's the restores that fail.
Test your backups by practicing
restores!"
Using Chaos Monkey
@Sergiu_Bodiu
Netflix	Simian	Army
14
Suite of tools for
keeping your
cloud operating
in top form.
https://github.com/Netflix/
SimianArmy
@Sergiu_Bodiu
Chaos	Monkey
15
1.Active during normal working
hours
2.Break things in production
3.Design better software
services
4.Embracing failure
http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html
@Sergiu_Bodiu
https://github.com/Netflix/security_monkey
16 https://github.com/Netflix/security_monkey
Monitor AWS and GCP
accounts for policy changes
and alerts on insecure
configurations.
Security Monkey can be
extended with custom
account types, custom
watchers, custom auditors,
and custom alerters.
@Sergiu_Bodiu
Other	Monkeys
17
•Latency Monkey
•Janitor Monkey
•Conformity
Monkey
•Doctor Monkey
@Sergiu_Bodiu18
@Sergiu_Bodiu
PRINCIPLES	>	TOOLS	
Why	do	we	do	>	
What	we	do
19
@Sergiu_Bodiu
Dejirafication
20Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process
@Sergiu_Bodiu
Principles	of	Chaos
21
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Hypothesize
22
> sudo watch
• Start with steady state behavior.
• Monitor metrics that are visible
• Capture an interaction between the users and the system.
TIP: Utilisationis Virtually Useless as a Metric!
@Sergiu_Bodiu
Vary	Events	
23
> sudo halt
• Terminate virtual machine instances
• Inject latency into requests between services
• Fail requests between services
• Fail an internal microservice
• Make an entire region unavailable
TIP: Select only a subset of users
@Sergiu_Bodiu
Experiment
24
• End to end TESTING (Expensive)
• Process is slow
• Configuration Drfit from Production
• 92% ERRORS could be prevented (Simple)
TIP: Customersdon't behave as your JMeter
script.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
@Sergiu_Bodiu
Automate
25
> sudo while (1)
• Distributed systems changes continuously over time.
• Engineers modify the behavior of existing services, add new
services.
• Engineers are changing runtime configuration parameters,
upgrading and patching systems
TIP: Depending on the context, changethe rate of
each experiment.
@Sergiu_Bodiu
Principles	of	Chaos
26
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
TIP: Intentionally breakthings, compare
measured with expected impact, and correct any
problems uncovered this way.
Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Reference	Architecture	for	Cloud	Native	Platform
27 https://content.pivotal.io/white-papers/the-upside-down-economics-of-building-your-own-platform
@Sergiu_Bodiu
Pivotal	Cloud	Foundry
28
@Sergiu_Bodiu
Chaos	Lemur	demo
29
Chaos Lemur =
Chaos Monkey + PCF
@Sergiu_Bodiu
Locust	demo
30
Locust	is	an	open-source	Python	load	testing	
framework.	
• Define	user	behaviour	in	code	
• Can	execute	end-to-end	user	test	with	sessions	and	
cookies.	
• Expands	to	multiple	slaves	to	increase	load	capacity	
• Allows	for	distributed	user	paths	based	on	
percentages	
		
Gatling	is	an	open-source	Scala	load	testing	framework	
• High	performance	
• Ready-to-present	HTML	reports	
• Scenario	recorder	and	developer-friendly	DSL
@Sergiu_Bodiu
Lessons	Learned
31
• Systematic approach to Chaos Testing
• This is incredible hard under pressure.
• Don’t wait so long to start load testing.
• The conversations drive new requirements.
• Changing architecture last minute is extremely
dangerous.
• Join the community
• Build relation with Networking Team, Database Team,
Third Party Partners, Vendors etc..
• Make everything Asynchronous (Embrace Failure,
Background Tasks, Retry, Idempotence)
@Sergiu_Bodiu
The	importance	of	reliability
32
Don't trust claims systems make
about themselves & their
dependencies.
Verify by breaking.
@Sergiu_Bodiu
Clean	your	process
33
Culture	>	Principles	>	
Tools	
> Post Mortem
> sudo halt

Incident Start

Impact
@Sergiu_Bodiu
Testing	Pyramid
34https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/
@Sergiu_Bodiu
Further	Reading
35
https://www.infoq.com/br/presentations/exercising-failure-at-
netflix	
https://www.infoq.com/podcasts/failure-as-a-service	
https://www.infoq.com/articles/chaos-engineering	
@Ops_Engineering	https://www.youtube.com/watch?
v=CZ3wIuvmHeM	
@caseyrosenthal	https://www.youtube.com/watch?
v=Q4nniyAarbs	
Peter	Alvaro:	Orchestrated	Chaos:	Applying	Failure	Testing	
Research	at	Scale	
Adrian	Colyer	Simple	Testing	Can	Prevent	Most	Critical	Failures
Thank You
@sergiu_bodiu
Questions
@Sergiu_Bodiu
Principles
37
Any	developer	building	applications	
which	run	as	a	service.	Ops	engineers	who	
deploy	or	manage	such	applications.	
https://12factor.net:	
Anyone	working	in	software	that	writes	tests	or	
maintains	continuous	integration	
pipelines.	
http://www.10factor.ci

More Related Content

PDF
Flowchain: A case study on building a Blockchain for the IoT
PDF
Releasing a Distribution in the Age of DevOps.
PDF
Status of Embedded Linux
PDF
GPU Acceleration for Containers on Intel Processor Graphics
PDF
Libvirt API Certification
PDF
Linuxcon secureefficientcontainerimagemanagementharbor
PDF
Obstacles & Solutions for Livepatch Support on ARM64 Architecture
Flowchain: A case study on building a Blockchain for the IoT
Releasing a Distribution in the Age of DevOps.
Status of Embedded Linux
GPU Acceleration for Containers on Intel Processor Graphics
Libvirt API Certification
Linuxcon secureefficientcontainerimagemanagementharbor
Obstacles & Solutions for Livepatch Support on ARM64 Architecture

Viewers also liked (20)

PDF
Hyperledger Technical Community in China.
PDF
PDF
Linux Kernel Development
PDF
Simplify Networking for Containers
PDF
Running Legacy Applications with Containers
PDF
OpenDaylight OpenStack Integration
PDF
See what happened with real time kvm when building real time cloud pezhang@re...
PDF
Policy-based Resource Placement
PDF
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
PDF
There is NO Open Source Business Model
PDF
Quickly Debug VM Failures in OpenStack
PDF
How Open Source Communities do Standardization
PDF
SecurityPI - Hardening your IoT endpoints in Home.
PDF
Is there still room for innovation in container orchestration and scheduling
PDF
Get a Taste of 1 k+ Nodes by a Handful of Servers
PDF
Build Robust Blockchain Services with Hyperledger and Containers
PDF
Fully automated kubernetes deployment and management
PDF
The Open vSwitch and OVN Projects
PDF
64-bit ARM Unikernels on uKVM
Hyperledger Technical Community in China.
Linux Kernel Development
Simplify Networking for Containers
Running Legacy Applications with Containers
OpenDaylight OpenStack Integration
See what happened with real time kvm when building real time cloud pezhang@re...
Policy-based Resource Placement
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
There is NO Open Source Business Model
Quickly Debug VM Failures in OpenStack
How Open Source Communities do Standardization
SecurityPI - Hardening your IoT endpoints in Home.
Is there still room for innovation in container orchestration and scheduling
Get a Taste of 1 k+ Nodes by a Handful of Servers
Build Robust Blockchain Services with Hyperledger and Containers
Fully automated kubernetes deployment and management
The Open vSwitch and OVN Projects
64-bit ARM Unikernels on uKVM
Ad

Similar to From Resilient to Antifragile Chaos Engineering Primer (20)

PDF
From resilient to antifragile - Chaos Engineering Primer DevSecCon
PDF
DevSecCon Asia 2017 Sergiu Bodiu: From resilient to antifragile
PDF
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
PDF
Chaos Engineering and Systems Reliability
PDF
Using security to drive chaos engineering - April 2018
PDF
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
PPTX
Chaos Engineering on Cloud Foundry
PDF
Chaos Engineering: Site reliability through controlled disruption 1st Edition...
PDF
Chaos Engineering
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PDF
Practical Chaos Engineering
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
PPTX
Introduction to Chaos Engineering
PDF
DevOps - Chaos Engineering on Kubernetes
PDF
Using security to drive chaos engineering
PDF
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
From resilient to antifragile - Chaos Engineering Primer DevSecCon
DevSecCon Asia 2017 Sergiu Bodiu: From resilient to antifragile
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
Chaos Engineering and Systems Reliability
Using security to drive chaos engineering - April 2018
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
Chaos Engineering on Cloud Foundry
Chaos Engineering: Site reliability through controlled disruption 1st Edition...
Chaos Engineering
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Practical Chaos Engineering
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Chaos Engineering: Why the World Needs More Resilient Systems
Introduction to Chaos Engineering
DevOps - Chaos Engineering on Kubernetes
Using security to drive chaos engineering
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Ad

More from LinuxCon ContainerCon CloudOpen China (12)

PDF
kdump: usage and_internals
PDF
Building a Better Thermostat
PDF
Scale Kubernetes to support 50000 services
PDF
Secure Containers with EPT Isolation
PDF
Open Source Software Business Models Redux
PDF
Introduction to OCI Image Technologies Serving Container
PDF
Rebuild - Simplifying Embedded and IoT Development Using Linux Containers
PDF
PDF
PDF
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
kdump: usage and_internals
Building a Better Thermostat
Scale Kubernetes to support 50000 services
Secure Containers with EPT Isolation
Open Source Software Business Models Redux
Introduction to OCI Image Technologies Serving Container
Rebuild - Simplifying Embedded and IoT Development Using Linux Containers
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...

Recently uploaded (20)

PDF
CloudStack 4.21: First Look Webinar slides
PDF
STKI Israel Market Study 2025 version august
PPT
Geologic Time for studying geology for geologist
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
DOCX
search engine optimization ppt fir known well about this
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Architecture types and enterprise applications.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
CloudStack 4.21: First Look Webinar slides
STKI Israel Market Study 2025 version august
Geologic Time for studying geology for geologist
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Consumable AI The What, Why & How for Small Teams.pdf
A proposed approach for plagiarism detection in Myanmar Unicode text
search engine optimization ppt fir known well about this
A contest of sentiment analysis: k-nearest neighbor versus neural network
Architecture types and enterprise applications.pdf
Enhancing emotion recognition model for a student engagement use case through...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Getting started with AI Agents and Multi-Agent Systems
Custom Battery Pack Design Considerations for Performance and Safety
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Abstractive summarization using multilingual text-to-text transfer transforme...
NewMind AI Weekly Chronicles – August ’25 Week III
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
The influence of sentiment analysis in enhancing early warning system model f...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...

From Resilient to Antifragile Chaos Engineering Primer