SlideShare a Scribd company logo
Applying Chaos Engineering to build resilient
serverless applications
Emrah Şamdan
(@emrahsamdan)
4/25/2019
Who am I?
● Developer for 6+ years
● Product guy for 2 years
● VP of Product for Thundra
● Organizing committee
● Serverlessdays İstanbul
On October 11st!
Agenda
● What’s chaos engineering?
● Why chaos testing on serverless?
● Best practices on chaos testing for serverless
● How to apply chaos testing on AWS Lambda
● How to apply silence in a world of chaos
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
Applying Chaos Engineering to Build Resilient Serverless Applications
Applying Chaos Engineering to Build Resilient Serverless Applications
Your third party API slows down so badly..
Some part of your system becomes unreachable.
Your cache/DB is down so you can’t load your data.
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
http://principlesofchaos.org/
Chaos Engineering is
● Like injecting vaccine to your system to make it more
immune
● To improve your system’s resilience by uncovering
weaknesses.
● Identifying failures before they become outages.
● To understand the steady state of your system and
challenge it.
Chaos Engineering is not
● Breaking down production for purpose.
● For blaming a group of people.
● Surprising your colleagues with partial outages.
● Taking down all the system at the same time.
Applying Chaos Engineering to Build Resilient Serverless Applications
History of chaos engineering?
2010 2011 2014 2019
Companies applying Chaos Engineering
States of chaos engineering
● Define steady state
● Hypothesis on steady state of the system with the designed failure
● Run your experiment
○ Define blast radius
○ Define halting condition
○ Have a rollback plan!
● Verify & Learn
○ If your system breaks you understood an issue before it causes an outage. Go fix it!
○ If it is resilient, congrats! Now, inject some other failure!
Don’t break on purpose!
● Start experimenting with the first row, the
leftmost cell: Known-knowns.
● Blast radius: The effect will make the
smallest effect.
● Put a stop button somewhere!
● Plan how you learn.
● You don’t need to do it on production for
the first time.
● The most important Let the other people
know! Surprising chaos is not funny. No, at
all!
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Result: People experiences timeouts while waiting for results.
Applying Chaos Engineering to Build Resilient Serverless Applications
You never fail!
Chaos when everything is more granular.
SERVERLESS
More Granular Functions
More Granular Functions
More Granular Functions
Every service has its own failure mode
Lots of managed intermediate service which has its own bad-day
characteristics.
Different throttling, different retry mechanisms for different services.
Every function has its own configuration
● Timeouts
● IAM Roles
Applying Chaos Engineering to Build Resilient Serverless Applications
What would you do when your region is down?
Applying Chaos Engineering to Build Resilient Serverless Applications
Common weaknesses in serverless
● Nested functions with improper timeouts
Common weaknesses in serverless
● Unhandled errors from upstream services
Common weaknesses in serverless
● Failures in resources
Chaos experiments in serverless
● Inject latency to downstream services
● Inject failure to resources
Injecting latency
● Don’t attack your system.
● You don’t need to do on prod
first.
● There is no point to inject
latency to async calls.
Hypothesis: Entry point Lambda will
degrade gracefully when the
downstream Lambda times out or turns
really late.
Where else to inject?
Inject latency to resources, too.
How to inject latency
Injecting Latency to resources by Yan Cui
How to inject latency with Thundra
Injecting Error
● Connection errors with third party services
● Cache down
● AWS Resource is unreachable
What if we lose the connection to Redis?
Let’s inject error to Redis with Thundra
Common fixes
● Exponential backoff
● Properly tunes timeouts
● Circuit breakers
● Use async communication when possible
Don’t forget! Aim is
● Not to break but to improve
● Not to blame people but to give them room to fix
● Not to surprise your colleagues but to make your system resilient
Thank you !

More Related Content

PPTX
Antifragility and testing for distributed systems failure
DiUS
 
PPT
The Why and How of Continuous Delivery
Nigel McNie
 
PDF
It Sounded Good on Paper - Lessons Learned with Puppet
Jeffery Smith
 
PDF
Kanban in 4 easy steps
Shore Labs
 
PDF
London Atlassian User Group - February 2014
Steve Smith
 
PDF
130511 stop wasting_your_time
Henning Blohm
 
PDF
Performant Django - Ara Anjargolian
Hakka Labs
 
ODP
Fast track to higher productivity with online Kanban boards
Shore Labs
 
Antifragility and testing for distributed systems failure
DiUS
 
The Why and How of Continuous Delivery
Nigel McNie
 
It Sounded Good on Paper - Lessons Learned with Puppet
Jeffery Smith
 
Kanban in 4 easy steps
Shore Labs
 
London Atlassian User Group - February 2014
Steve Smith
 
130511 stop wasting_your_time
Henning Blohm
 
Performant Django - Ara Anjargolian
Hakka Labs
 
Fast track to higher productivity with online Kanban boards
Shore Labs
 

What's hot (20)

PDF
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB
 
PDF
iOS Scroll Performance
Kyle Sherman
 
PDF
Scrum Gathering 2012 Shanghai_敏捷测试与质量管理分会场演讲话题:getting to done by testing at ...
LetAgileFly
 
PDF
Faster to Master without Disaster
Pat Hermens
 
PPTX
Kanban Methodology
Sudhanva Ramesh
 
PPTX
Kanban presentation
Bijo Joseph
 
PDF
Practical Continuous Deployment - Atlassian - London AUG 18 Feb 2014
Matthew Cobby
 
PPTX
Automate Everything! (No stress development/Tallinn)
Arto Santala
 
PDF
DevOps: Building by feature with immutable infrastructure at Serv.sg
Nicolas Mas
 
PDF
Kanban stand-up meetings
Miroslav Bajtoš
 
PDF
DevOps: Getting Started with Puppet on Windows
Rob Reynolds
 
PDF
Scaffolding a legacy app with BDD scenarios using SpecFlow/Cucumber (HUSTEF 2...
Gáspár Nagy
 
PPT
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Eric Ries
 
PDF
Raise the bar! Reloaded
Alessandro Franceschi
 
PPTX
Agile_SDLC_Node.js@Paypal_ppt
Hitesh Kumar
 
PDF
Cypress testing
Vladyslav Romanchenko
 
PPTX
Developer day - AWS: Fast Environments = Fast Deployments
Matthew Cwalinski
 
PDF
Kanban - A Crash Course
Sam McAfee
 
PDF
Kanban Basics for Beginners Revised
Zsolt Fabok
 
PDF
Building software by feature with immutable infrastructures on AWS
Nicolas Mas
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB
 
iOS Scroll Performance
Kyle Sherman
 
Scrum Gathering 2012 Shanghai_敏捷测试与质量管理分会场演讲话题:getting to done by testing at ...
LetAgileFly
 
Faster to Master without Disaster
Pat Hermens
 
Kanban Methodology
Sudhanva Ramesh
 
Kanban presentation
Bijo Joseph
 
Practical Continuous Deployment - Atlassian - London AUG 18 Feb 2014
Matthew Cobby
 
Automate Everything! (No stress development/Tallinn)
Arto Santala
 
DevOps: Building by feature with immutable infrastructure at Serv.sg
Nicolas Mas
 
Kanban stand-up meetings
Miroslav Bajtoš
 
DevOps: Getting Started with Puppet on Windows
Rob Reynolds
 
Scaffolding a legacy app with BDD scenarios using SpecFlow/Cucumber (HUSTEF 2...
Gáspár Nagy
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Eric Ries
 
Raise the bar! Reloaded
Alessandro Franceschi
 
Agile_SDLC_Node.js@Paypal_ppt
Hitesh Kumar
 
Cypress testing
Vladyslav Romanchenko
 
Developer day - AWS: Fast Environments = Fast Deployments
Matthew Cwalinski
 
Kanban - A Crash Course
Sam McAfee
 
Kanban Basics for Beginners Revised
Zsolt Fabok
 
Building software by feature with immutable infrastructures on AWS
Nicolas Mas
 
Ad

Similar to Applying Chaos Engineering to Build Resilient Serverless Applications (20)

PDF
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
PROIDEA
 
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Yan Cui
 
PDF
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Yan Cui
 
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Codemotion
 
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Codemotion
 
PPTX
Chaos engineering
Alberto Acerbis
 
PDF
Chaos Engineering
Yury Roa
 
PPTX
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Agile Testing Alliance
 
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
PDF
Applying principles of chaos engineering to serverless (reinvent DVC305)
Yan Cui
 
PDF
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
PDF
Chaos Engineering 101: A Field Guide
matthewbrahms
 
PDF
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
PDF
Applying principles of chaos engineering to serverless (CodeMesh)
Yan Cui
 
PDF
The case for chaos testing
Peter Lamar
 
PDF
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
Ana Medina
 
PPTX
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
PPTX
Chaos engineering - The art of breaking stuff in production on purpose
Geert van der Cruijsen
 
PDF
The Case for Chaos Testing
All Things Open
 
ODP
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
PROIDEA
 
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Yan Cui
 
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Yan Cui
 
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Codemotion
 
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Codemotion
 
Chaos engineering
Alberto Acerbis
 
Chaos Engineering
Yury Roa
 
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Agile Testing Alliance
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
Applying principles of chaos engineering to serverless (reinvent DVC305)
Yan Cui
 
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
Chaos Engineering 101: A Field Guide
matthewbrahms
 
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
Applying principles of chaos engineering to serverless (CodeMesh)
Yan Cui
 
The case for chaos testing
Peter Lamar
 
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
Ana Medina
 
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
Chaos engineering - The art of breaking stuff in production on purpose
Geert van der Cruijsen
 
The Case for Chaos Testing
All Things Open
 
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
Ad

Recently uploaded (20)

PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 

Applying Chaos Engineering to Build Resilient Serverless Applications

  • 1. Applying Chaos Engineering to build resilient serverless applications Emrah Şamdan (@emrahsamdan) 4/25/2019
  • 2. Who am I? ● Developer for 6+ years ● Product guy for 2 years ● VP of Product for Thundra ● Organizing committee ● Serverlessdays İstanbul On October 11st!
  • 3. Agenda ● What’s chaos engineering? ● Why chaos testing on serverless? ● Best practices on chaos testing for serverless ● How to apply chaos testing on AWS Lambda ● How to apply silence in a world of chaos
  • 4. Why chaos engineering? Unit Tests ● My function is running properly and meets the expectations. Integration Tests ● My system is running properly and meets the expectations. UI/UX Tests ● It is like a charm!
  • 5. Why chaos engineering? Unit Tests ● My function is running properly and meets the expectations. Integration Tests ● My system is running properly and meets the expectations. UI/UX Tests ● It is like a charm!
  • 8. Your third party API slows down so badly..
  • 9. Some part of your system becomes unreachable.
  • 10. Your cache/DB is down so you can’t load your data.
  • 11. Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/
  • 12. Chaos Engineering is ● Like injecting vaccine to your system to make it more immune ● To improve your system’s resilience by uncovering weaknesses. ● Identifying failures before they become outages. ● To understand the steady state of your system and challenge it.
  • 13. Chaos Engineering is not ● Breaking down production for purpose. ● For blaming a group of people. ● Surprising your colleagues with partial outages. ● Taking down all the system at the same time.
  • 15. History of chaos engineering? 2010 2011 2014 2019
  • 17. States of chaos engineering ● Define steady state ● Hypothesis on steady state of the system with the designed failure ● Run your experiment ○ Define blast radius ○ Define halting condition ○ Have a rollback plan! ● Verify & Learn ○ If your system breaks you understood an issue before it causes an outage. Go fix it! ○ If it is resilient, congrats! Now, inject some other failure!
  • 18. Don’t break on purpose! ● Start experimenting with the first row, the leftmost cell: Known-knowns. ● Blast radius: The effect will make the smallest effect. ● Put a stop button somewhere! ● Plan how you learn. ● You don’t need to do it on production for the first time. ● The most important Let the other people know! Surprising chaos is not funny. No, at all!
  • 19. Chaos examples ● Your system keeps records on the DB. ● DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.
  • 20. Chaos examples ● Your system keeps records on the DB. ● DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.
  • 23. Chaos when everything is more granular. SERVERLESS
  • 27. Every service has its own failure mode Lots of managed intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.
  • 28. Every function has its own configuration ● Timeouts ● IAM Roles
  • 30. What would you do when your region is down?
  • 32. Common weaknesses in serverless ● Nested functions with improper timeouts
  • 33. Common weaknesses in serverless ● Unhandled errors from upstream services
  • 34. Common weaknesses in serverless ● Failures in resources
  • 35. Chaos experiments in serverless ● Inject latency to downstream services ● Inject failure to resources
  • 36. Injecting latency ● Don’t attack your system. ● You don’t need to do on prod first. ● There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.
  • 37. Where else to inject? Inject latency to resources, too.
  • 38. How to inject latency
  • 39. Injecting Latency to resources by Yan Cui
  • 40. How to inject latency with Thundra
  • 41. Injecting Error ● Connection errors with third party services ● Cache down ● AWS Resource is unreachable
  • 42. What if we lose the connection to Redis?
  • 43. Let’s inject error to Redis with Thundra
  • 44. Common fixes ● Exponential backoff ● Properly tunes timeouts ● Circuit breakers ● Use async communication when possible
  • 45. Don’t forget! Aim is ● Not to break but to improve ● Not to blame people but to give them room to fix ● Not to surprise your colleagues but to make your system resilient