Why Resilience?
A primer at varying flight altitudes

Uwe Friedrichsen, codecentric AG, 2014
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Resilience? Never heard of it …
re•sil•ience (rɪˈzɪl yəns) also re•sil′ien•cy, n.

1.  the power or ability to return to the original form, position,
etc., after being bent, compressed, or stretched; elasticity.
2.  ability to recover readily from illness, depression, adversity,
or the like; buoyancy.

Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd.
Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved.


http://www.thefreedictionary.com/resilience
Resilience (IT)

The ability of an application to handle unexpected situations
-  without the user noticing it (best case)
-  with a graceful degradation of service (worst case)
Resilience is not about testing your application

(You should definitely test your application, but that‘s a different story)
public class MySUTTest {
@Test
public void shouldDoSomething() {
MySUT sut = new MySUT();
MyResult result = sut.doSomething();
assertEquals(<Some expected result>, result);
}
…
}
It‘s all about production!
Why should I care?
Business
Production
Availability
Resilience
Your web server doesn‘t look good …
The dreaded SiteTooSuccessfulException …
Reasons to care about resilience





•  Loss of lives
•  Loss of goods (manufacturing facilities)
•  Loss of money
•  Loss of reputation
Why should I care about it today?

(The risks you mention are not new)
Resilience drivers


•  Cloud-based systems
•  Highly scalable systems
•  Zero Downtime
•  IoT & Mobile
•  Social

à Reliably running distributed systems
What’s the business case?

(I don’t see any money to be made with it)
Counter question

Can you afford to ignore it?

(It’s not about making money, it’s about not loosing money)
Resilience business case

•  Identify risk scenarios

•  Calculate current occurrence probability
•  Calculate future occurrence probability

•  Calculate short-term losses
•  Calculate long-term losses

•  Assess risks and money
•  Do not forget the competitors
Let’s dive deeper into resilience
Classification attempt
Reliability: A set of attributes that bear on the capability of software to maintain its level

of performance under stated conditions for a stated period of time.
Efficiency
ISO/IEC 9126

software quality characteristics
Usability
Reliability
Portability
Maintainability
Functionality
Available with acceptable latency
Resilience goes
beyond that
How can I maximize availability?
Availability ≔ 
MTTF
MTTF + MTTR
MTTF: Mean Time To Failure
MTTR: Mean Time To Recovery
Traditional approach (robustness)
Availability ≔ 
MTTF
MTTF + MTTR
Maximize MTTF
A distributed system is one in which the failure
of a computer you didn't even know existed
can render your own computer unusable.

Leslie Lamport
Failures in todays complex, distributed,
interconnected systems are not the exception.

They are the normal case.
Contemporary approach (resilience)
Availability ≔ 
MTTF
MTTF + MTTR
Minimize MTTR
Do not try to avoid failures. Embrace them.
What kinds of failures

do I need to deal with?
Failure types



•  Crash failure
•  Omission failure
•  Timing failure
•  Response failure
•  Byzantine failure
How do I implement resilience?
Bulkheads
•  Divide system in failure units
•  Isolate failure units
•  Define fallback strategy
Redundancy
•  Elaborate use case

Minimize MTTR / scale transactions / handle response errors / …
•  Define routing & balancing strategy

Round robin / master-slave / fan-out & quickest one wins / …
•  Consider admin involvement

Automatic vs. manual / notification – monitoring / …
Loose Coupling
•  Isolate failure units (complements bulkheads)
•  Go asynchronous wherever possible
•  Use timeouts & circuit breakers
•  Make actions idempotent
Implementation Example #1

Timeouts
Timeouts (1)
// Basics
myObject.wait(); // Do not use this by default
myObject.wait(TIMEOUT); // Better use this
// Some more basics
myThread.join(); // Do not use this by default
myThread.join(TIMEOUT); // Better use this
Timeouts (2)
// Using the Java concurrent library
Callable<MyActionResult> myAction = <My Blocking Action>
ExecutorService executor = Executors.newSingleThreadExecutor();
Future<MyActionResult> future = executor.submit(myAction);
MyActionResult result = null;
try {
result = future.get(); // Do not use this by default
result = future.get(TIMEOUT, TIMEUNIT); // Better use this
} catch (TimeoutException e) { // Only thrown if timeouts are used
...
} catch (...) {
...
}
Timeouts (3)
// Using Guava SimpleTimeLimiter
Callable<MyActionResult> myAction = <My Blocking Action>
SimpleTimeLimiter limiter = new SimpleTimeLimiter();
MyActionResult result = null;
try {
result =
limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false);
} catch (UncheckedTimeoutException e) {
...
} catch (...) {
...
}
Implementation Example #2

Circuit Breaker
Circuit Breaker – concept
Client
 Resource
Circuit Breaker
Request
Resource unavailable
Resource available
Closed
 Open
Half-Open
Lifecycle
Why resilience - A primer at varying flight altitudes
Implemented patterns






•  Timeout
•  Circuit breaker
•  Load shedder
Supported patterns

•  Bulkheads

(a.k.a. Failure Units)
•  Fail fast
•  Fail silently
•  Graceful degradation of service
•  Failover
•  Escalation
•  Retry
•  ...
Hello, world!
public class HelloCommand extends HystrixCommand<String> {
private static final String COMMAND_GROUP = "default";
private final String name;
public HelloCommand(String name) {
super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP));
this.name = name;
}
@Override
protected String run() throws Exception {
return "Hello, " + name;
}
}
@Test
public void shouldGreetWorld() {
String result = new HelloCommand("World").execute();
assertEquals("Hello, World", result);
}
Source: https://github.com/Netflix/Hystrix/wiki/How-it-Works
Fallbacks
•  What will you do if a request fails?
•  Consider failure handling from the very beginning
•  Supplement with general failure handling strategies
Scalability
•  Define scaling strategy
•  Think full stack
•  Apply D-I-D rule
•  Design for elasticity
… and many more


•  Supervision patterns
•  Recovery & mitigation patterns
•  Anti-fragility patterns
•  Supporting patterns
•  A rich pattern family


Different approach than traditional

enterprise software development
How do I integrate resilience
into my
software development process?
Steps to adopt resilient software design






1.  Create awareness: 
 Go DevOps
2.  Create capability: 
 Coach your developers
3.  Create sustainability: 
 Inject errors
Related topics





Reactive
Anti-fragility
Fault-tolerant software design
Recovery-oriented computing
Wrap-up



•  Resilience is about availability
•  Crucial for todays complex systems
•  Not caring is a risk
•  Go DevOps to create awareness
Do not avoid failures. Embrace them!
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Why resilience - A primer at varying flight altitudes

More Related Content

PDF
Microservices - stress-free and without increased heart attack risk
PPTX
Resilience reloaded - more resilience patterns
PDF
Production-ready Software
PDF
The 7 quests of resilient software design
PDF
No stress with state
PDF
Resilient Functional Service Design
PDF
Real-world consistency explained
PDF
Digitization solutions - A new breed of software
Microservices - stress-free and without increased heart attack risk
Resilience reloaded - more resilience patterns
Production-ready Software
The 7 quests of resilient software design
No stress with state
Resilient Functional Service Design
Real-world consistency explained
Digitization solutions - A new breed of software

What's hot (6)

PPTX
MicroServices architecture @ Ctrip v1.1
PDF
VMUGIT UC 2013 - 04 Duncan Epping
PDF
The Economics of Scale: Promises and Perils of Going Distributed
PDF
Simple Solutions for Complex Problems
PDF
Dr. Hectic and Mr. Hype - surviving the economic darwinism
PDF
Zebras all the way down: The engineering challenges of the data path
MicroServices architecture @ Ctrip v1.1
VMUGIT UC 2013 - 04 Duncan Epping
The Economics of Scale: Promises and Perils of Going Distributed
Simple Solutions for Complex Problems
Dr. Hectic and Mr. Hype - surviving the economic darwinism
Zebras all the way down: The engineering challenges of the data path
Ad

Viewers also liked (20)

PDF
Fantastic Elastic
PDF
Devops for Developers
PDF
Self healing data
PDF
Resilience with Hystrix
PDF
The promises and perils of microservices
PDF
Patterns of resilience
PDF
Modern times - architectures for a Next Generation of IT
PDF
The Next Generation (of) IT
PDF
Conway's law revisited - Architectures for an effective IT
PDF
Towards complex adaptive architectures
PDF
Watch your communication
PDF
Life, IT and everything
PDF
DevOps is not enough - Embedding DevOps in a broader context
PPTX
Case Management in Addiction Counselling
PPT
Perception - My view of World
PDF
My Roles - My Life
PPT
Nonviolent communication poster
PDF
Stress Relief for Parents
PPTX
Reducing Stress in Families: An Intro to Family Resilience
PPTX
A Brief History of Resilience
Fantastic Elastic
Devops for Developers
Self healing data
Resilience with Hystrix
The promises and perils of microservices
Patterns of resilience
Modern times - architectures for a Next Generation of IT
The Next Generation (of) IT
Conway's law revisited - Architectures for an effective IT
Towards complex adaptive architectures
Watch your communication
Life, IT and everything
DevOps is not enough - Embedding DevOps in a broader context
Case Management in Addiction Counselling
Perception - My view of World
My Roles - My Life
Nonviolent communication poster
Stress Relief for Parents
Reducing Stress in Families: An Intro to Family Resilience
A Brief History of Resilience
Ad

Similar to Why resilience - A primer at varying flight altitudes (20)

PDF
Resisting to The Shocks
PDF
Service resiliency in microservices
PDF
Microservices Resiliency with BallerinaLang
PDF
Software Availability by Resiliency
PDF
Preparing for a Black Swan: Planning and Programming for Risk Mitigation in E...
PDF
Circuit breakers - Using Spring-Boot + Hystrix + Dashboard + Retry
PDF
Architectural Patterns of Resilient Distributed Systems
PPTX
Chaos engineering
PDF
Building resilient applications
PPTX
Fault Tolerance in Distributed System
PPTX
DS Crisis Management Foundation Risk
 
PDF
Disaster Recovery Development Strategy Business Measures Management Maintenance
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PPTX
slides.08.pptx
PPTX
The resident season 3 is a bit of a triangle
PPTX
Designing apps for resiliency
PDF
Resilience4j with Spring Boot
PPTX
Problem management foundation - IT risk
PPTX
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Resisting to The Shocks
Service resiliency in microservices
Microservices Resiliency with BallerinaLang
Software Availability by Resiliency
Preparing for a Black Swan: Planning and Programming for Risk Mitigation in E...
Circuit breakers - Using Spring-Boot + Hystrix + Dashboard + Retry
Architectural Patterns of Resilient Distributed Systems
Chaos engineering
Building resilient applications
Fault Tolerance in Distributed System
DS Crisis Management Foundation Risk
 
Disaster Recovery Development Strategy Business Measures Management Maintenance
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
slides.08.pptx
The resident season 3 is a bit of a triangle
Designing apps for resiliency
Resilience4j with Spring Boot
Problem management foundation - IT risk
Testing Safety Critical Systems (10-02-2014, VU amsterdam)

More from Uwe Friedrichsen (8)

PDF
Timeless design in a cloud-native world
PDF
Deep learning - a primer
PDF
Life after microservices
PDF
The hitchhiker's guide for the confused developer
PDF
Excavating the knowledge of our ancestors
PDF
The truth about "You build it, you run it!"
PDF
How to survive in a BASE world
PDF
Fault tolerance made easy
Timeless design in a cloud-native world
Deep learning - a primer
Life after microservices
The hitchhiker's guide for the confused developer
Excavating the knowledge of our ancestors
The truth about "You build it, you run it!"
How to survive in a BASE world
Fault tolerance made easy

Recently uploaded (20)

PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
CloudStack 4.21: First Look Webinar slides
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PPT
Geologic Time for studying geology for geologist
Convolutional neural network based encoder-decoder for efficient real-time ob...
OpenACC and Open Hackathons Monthly Highlights July 2025
Basics of Cloud Computing - Cloud Ecosystem
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
The influence of sentiment analysis in enhancing early warning system model f...
Microsoft Excel 365/2024 Beginner's training
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
sustainability-14-14877-v2.pddhzftheheeeee
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Consumable AI The What, Why & How for Small Teams.pdf
TEXTILE technology diploma scope and career opportunities
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Enhancing plagiarism detection using data pre-processing and machine learning...
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Custom Battery Pack Design Considerations for Performance and Safety
UiPath Agentic Automation session 1: RPA to Agents
CloudStack 4.21: First Look Webinar slides
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Geologic Time for studying geology for geologist

Why resilience - A primer at varying flight altitudes

  • 1. Why Resilience? A primer at varying flight altitudes Uwe Friedrichsen, codecentric AG, 2014
  • 2. @ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
  • 4. re•sil•ience (rɪˈzɪl yəns) also re•sil′ien•cy, n. 1.  the power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2.  ability to recover readily from illness, depression, adversity, or the like; buoyancy. Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd. Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved. http://www.thefreedictionary.com/resilience
  • 5. Resilience (IT) The ability of an application to handle unexpected situations -  without the user noticing it (best case) -  with a graceful degradation of service (worst case)
  • 6. Resilience is not about testing your application (You should definitely test your application, but that‘s a different story) public class MySUTTest { @Test public void shouldDoSomething() { MySUT sut = new MySUT(); MyResult result = sut.doSomething(); assertEquals(<Some expected result>, result); } … }
  • 7. It‘s all about production!
  • 8. Why should I care?
  • 10. Your web server doesn‘t look good …
  • 12. Reasons to care about resilience •  Loss of lives •  Loss of goods (manufacturing facilities) •  Loss of money •  Loss of reputation
  • 13. Why should I care about it today? (The risks you mention are not new)
  • 14. Resilience drivers •  Cloud-based systems •  Highly scalable systems •  Zero Downtime •  IoT & Mobile •  Social à Reliably running distributed systems
  • 15. What’s the business case? (I don’t see any money to be made with it)
  • 16. Counter question Can you afford to ignore it? (It’s not about making money, it’s about not loosing money)
  • 17. Resilience business case •  Identify risk scenarios •  Calculate current occurrence probability •  Calculate future occurrence probability •  Calculate short-term losses •  Calculate long-term losses •  Assess risks and money •  Do not forget the competitors
  • 18. Let’s dive deeper into resilience
  • 19. Classification attempt Reliability: A set of attributes that bear on the capability of software to maintain its level
 of performance under stated conditions for a stated period of time. Efficiency ISO/IEC 9126
 software quality characteristics Usability Reliability Portability Maintainability Functionality Available with acceptable latency Resilience goes beyond that
  • 20. How can I maximize availability?
  • 21. Availability ≔ MTTF MTTF + MTTR MTTF: Mean Time To Failure MTTR: Mean Time To Recovery
  • 22. Traditional approach (robustness) Availability ≔ MTTF MTTF + MTTR Maximize MTTF
  • 23. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
  • 24. Failures in todays complex, distributed, interconnected systems are not the exception. They are the normal case.
  • 25. Contemporary approach (resilience) Availability ≔ MTTF MTTF + MTTR Minimize MTTR
  • 26. Do not try to avoid failures. Embrace them.
  • 27. What kinds of failures
 do I need to deal with?
  • 28. Failure types •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  • 29. How do I implement resilience?
  • 31. •  Divide system in failure units •  Isolate failure units •  Define fallback strategy
  • 33. •  Elaborate use case
 Minimize MTTR / scale transactions / handle response errors / … •  Define routing & balancing strategy
 Round robin / master-slave / fan-out & quickest one wins / … •  Consider admin involvement
 Automatic vs. manual / notification – monitoring / …
  • 35. •  Isolate failure units (complements bulkheads) •  Go asynchronous wherever possible •  Use timeouts & circuit breakers •  Make actions idempotent
  • 37. Timeouts (1) // Basics myObject.wait(); // Do not use this by default myObject.wait(TIMEOUT); // Better use this // Some more basics myThread.join(); // Do not use this by default myThread.join(TIMEOUT); // Better use this
  • 38. Timeouts (2) // Using the Java concurrent library Callable<MyActionResult> myAction = <My Blocking Action> ExecutorService executor = Executors.newSingleThreadExecutor(); Future<MyActionResult> future = executor.submit(myAction); MyActionResult result = null; try { result = future.get(); // Do not use this by default result = future.get(TIMEOUT, TIMEUNIT); // Better use this } catch (TimeoutException e) { // Only thrown if timeouts are used ... } catch (...) { ... }
  • 39. Timeouts (3) // Using Guava SimpleTimeLimiter Callable<MyActionResult> myAction = <My Blocking Action> SimpleTimeLimiter limiter = new SimpleTimeLimiter(); MyActionResult result = null; try { result = limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false); } catch (UncheckedTimeoutException e) { ... } catch (...) { ... }
  • 41. Circuit Breaker – concept Client Resource Circuit Breaker Request Resource unavailable Resource available Closed Open Half-Open Lifecycle
  • 43. Implemented patterns •  Timeout •  Circuit breaker •  Load shedder
  • 44. Supported patterns •  Bulkheads
 (a.k.a. Failure Units) •  Fail fast •  Fail silently •  Graceful degradation of service •  Failover •  Escalation •  Retry •  ...
  • 46. public class HelloCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final String name; public HelloCommand(String name) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.name = name; } @Override protected String run() throws Exception { return "Hello, " + name; } } @Test public void shouldGreetWorld() { String result = new HelloCommand("World").execute(); assertEquals("Hello, World", result); }
  • 49. •  What will you do if a request fails? •  Consider failure handling from the very beginning •  Supplement with general failure handling strategies
  • 51. •  Define scaling strategy •  Think full stack •  Apply D-I-D rule •  Design for elasticity
  • 52. … and many more •  Supervision patterns •  Recovery & mitigation patterns •  Anti-fragility patterns •  Supporting patterns •  A rich pattern family Different approach than traditional
 enterprise software development
  • 53. How do I integrate resilience into my software development process?
  • 54. Steps to adopt resilient software design 1.  Create awareness: Go DevOps 2.  Create capability: Coach your developers 3.  Create sustainability: Inject errors
  • 56. Wrap-up •  Resilience is about availability •  Crucial for todays complex systems •  Not caring is a risk •  Go DevOps to create awareness
  • 57. Do not avoid failures. Embrace them!
  • 58. @ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com