Applying Chaos Engineering to Build Resilient Serverless Applications

Applying Chaos Engineering to build resilient
serverless applications
Emrah Şamdan
(@emrahsamdan)
4/25/2019

Who am I?
● Developer for 6+ years
● Product guy for 2 years
● VP of Product for Thundra
● Organizing committee
● Serverlessdays İstanbul
On October 11st!

Agenda
● What’s chaos engineering?
● Why chaos testing on serverless?
● Best practices on chaos testing for serverless
● How to apply chaos testing on AWS Lambda
● How to apply silence in a world of chaos

Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!

Your third party API slows down so badly..

Some part of your system becomes unreachable.

Your cache/DB is down so you can’t load your data.

Chaos Engineering is the discipline of experimenting on a system
in order to build conﬁdence in the system’s capability
to withstand turbulent conditions in production.
http://principlesofchaos.org/

Chaos Engineering is
● Like injecting vaccine to your system to make it more
immune
● To improve your system’s resilience by uncovering
weaknesses.
● Identifying failures before they become outages.
● To understand the steady state of your system and
challenge it.

Chaos Engineering is not
● Breaking down production for purpose.
● For blaming a group of people.
● Surprising your colleagues with partial outages.
● Taking down all the system at the same time.

History of chaos engineering?
2010 2011 2014 2019

Companies applying Chaos Engineering

States of chaos engineering
● Define steady state
● Hypothesis on steady state of the system with the designed failure
● Run your experiment
○ Define blast radius
○ Define halting condition
○ Have a rollback plan!
● Verify & Learn
○ If your system breaks you understood an issue before it causes an outage. Go fix it!
○ If it is resilient, congrats! Now, inject some other failure!

Don’t break on purpose!
● Start experimenting with the ﬁrst row, the
leftmost cell: Known-knowns.
● Blast radius: The effect will make the
smallest effect.
● Put a stop button somewhere!
● Plan how you learn.
● You don’t need to do it on production for
the ﬁrst time.
● The most important Let the other people
know! Surprising chaos is not funny. No, at
all!

Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.

Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Result: People experiences timeouts while waiting for results.

Chaos when everything is more granular.
SERVERLESS

Every service has its own failure mode
Lots of managed intermediate service which has its own bad-day
characteristics.
Different throttling, different retry mechanisms for different services.

Every function has its own conﬁguration
● Timeouts
● IAM Roles

What would you do when your region is down?

Common weaknesses in serverless
● Nested functions with improper timeouts

● Unhandled errors from upstream services

● Failures in resources

Chaos experiments in serverless
● Inject latency to downstream services
● Inject failure to resources

Injecting latency
● Don’t attack your system.
● You don’t need to do on prod
ﬁrst.
● There is no point to inject
latency to async calls.
Hypothesis: Entry point Lambda will
degrade gracefully when the
downstream Lambda times out or turns
really late.

Where else to inject?
Inject latency to resources, too.

Injecting Latency to resources by Yan Cui

How to inject latency with Thundra

Injecting Error
● Connection errors with third party services
● Cache down
● AWS Resource is unreachable

What if we lose the connection to Redis?

Let’s inject error to Redis with Thundra

Common ﬁxes
● Exponential backoff
● Properly tunes timeouts
● Circuit breakers
● Use async communication when possible

Don’t forget! Aim is
● Not to break but to improve
● Not to blame people but to give them room to ﬁx
● Not to surprise your colleagues but to make your system resilient

Applying Chaos Engineering to Build Resilient Serverless Applications

More Related Content

What's hot (20)

Similar to Applying Chaos Engineering to Build Resilient Serverless Applications (20)

Recently uploaded (20)

Applying Chaos Engineering to Build Resilient Serverless Applications