SlideShare a Scribd company logo
STRICTLY CONFIDENTIAL
Application Resilience: Challenges & Good Practice
City Business Club - ISITC Europe | July 2019
Dr Aled Sage
aled@cloudsoft.io
© Cloudsoft Corporation 2019
About Aled Sage
VP Engineering @ Cloudsoft
PhD and 20 years experience with
distributed systems
Expertise in Cloud, Devops and Automation
Across different verticals, including
financial services
@aledsage
aled@cloudsoft.io
linkedin.com/in/aledsage
github.com/aledsage© Cloudsoft Corporation 2019 2
© Cloudsoft Corporation 2019
1. What is resiliency?
2. Changing landscape
3. Case study
4. Problems & solutions
Agenda
© Cloudsoft Corporation 2019 3
© Cloudsoft Corporation 2019
• Business applications
• Service level objectives (SLOs)
Service level agreements (SLAs)
• Aim is not 100% availability
• “Availability” is not binary
• Design for failure
• Proactive and reactive
What is Resiliency
Google Site Reliability Engineers (SRE)
“Striving for Imperfection: Using an error budget
to move fast without compromising high
reliability”
CRM, payroll, analysis, trading systems,
settlement systems, …
Proactive: High availability (HA)
Avoid problems
Reactive: Know when something is wrong
Know how to respond
Repair
Disaster recovery (DR)© Cloudsoft Corporation 2019 4
© Cloudsoft Corporation 2019
Business Impact
© Cloudsoft Corporation 2019 5
Gartner
Why Business Leaders Don’t Care About the Cost of Downtime
9 April 2019
© Cloudsoft Corporation 2019
• Private and public cloud
• Kubernetes and Openshift (using containers)
• Bare-metal and mainframes
• Business wants agility
• Security and reliability still vital
• All while driving down costs
Changing Landscape
• Metro Bank, Monzo, Starling
• Facebook? Amazon?
© Cloudsoft Corporation 2019 6
© Cloudsoft Corporation 2019
Comparison with Disruptors
Incumbents Disruptors
Hybrid environment Everything is cloud-native
Chaos engineering?
“No way!”
Kill servers in production
Wall Street bank: decrease VM
provisioning to 9 hours
New container in seconds;
New Virtual Machines in minute(s)
3-6 month releases Small changes
several times a day
Improve the basics:
● Improve resiliency and
agility of existing apps
● Fit with old-world and the
future
© Cloudsoft Corporation 2019
100s of applications
Many lines of business
Small product set
Legacy high-value apps No legacy apps
7
© Cloudsoft Corporation 2019
APPLICATION-CENTRIC APPROACH
✓ Empower and involve application architects and developers
✓ Codify best practices and use of the bank’s strategic tools
✓ Consistent way to automate resiliency
Self-healing applications!
Your application can automatically detect and fix problems
Case-study: Tier-1 Investment Bank
ACHIEVEMENTS UNLOCKED
✓ Application-level Resiliency at scale
✓ Meets business & regulatory needs
✓ Across multiple evolving platforms
© Cloudsoft Corporation 2019 8
© Cloudsoft Corporation 2019
Challenges
© Cloudsoft Corporation 2019 9
Wide variation in
techniques and quality;
bespoke processes.
Involves many systems.
Fiefdoms & politics.
Resiliency is a
cross-cutting concern.
Manual procedures and
tribal knowledge.
Responses to
degradation or failure
not automated.
Manual Bespoke
Operations
fragmented
Rarely tested
Sufficient to meet
auditors’ requirements.
But little practice or
variation in scenarios;
manual testing is
expensive.
© Cloudsoft Corporation 2019
Problems and Solutions
Problems Solutions
Tribal knowledge, rarely tested Gamedays
Processes, environments and applications are
complicated and ever-changing
Post mortems & continual improvement
High availability and DR needs understanding of
application, but infrastructure-focused
Application-centric approach
Bespoke / inconsistent approaches Empower the application architects;
Codify and share best practices.
© Cloudsoft Corporation 2019 10
© Cloudsoft Corporation 2019
Problems and Solutions
© Cloudsoft Corporation 2019
Problems and Solutions
Problems Solutions
Manual runbooks Automated recovery
Involves many systems Encourage APIs and automation;
“glue code” to codify their use;
focus on common application architectures.
Runbooks and CMDB out-of-date “Living model” of the application;
Deployment and in-life management codified in this model.
© Cloudsoft Corporation 2019 12
© Cloudsoft Corporation 2019
• Restart the web-servers (e.g. on an out-of-memory)
• Repave servers or clusters (e.g. when a restart fails, or an update is
required)
• Auto-scale based on latency, load or time-of-day
• Manage DR within a region and/or across regions
Automated Recovery
© Cloudsoft Corporation 2019 13
© Cloudsoft Corporation 2019
Self Managing Systems: Autonomics
© Cloudsoft Corporation 2019 14
Self configuring
• Automated configuration of components and systems follows
high-level policies. Rest of system adjusts automatically and
seamlessly
Self healing
• System automatically detects, diagnoses, and repairs software and
hardware problems
Self optimizing
• Components and systems continually seek opportunities to
improve their own performance and efficiency
Self protecting
• System automatically defends against malicious attacks or
cascading failures. It uses early warning to anticipate and prevent
system wide failures
© Cloudsoft Corporation 2019
Cloudsoft AMP
© Cloudsoft Corporation 2019 15
AMP streamlines development, operations and governance for any
application on any cloud
Rapidly Scalable
Autonomic policy-based
management
Cloud Enabling
Liberate applications
from infrastructure
Powerful Tools
Consistent in-life management
tooling and features
Improve Visibility
Unified management
view of applications
© Cloudsoft Corporation 2019
Suggested Next Steps
© Cloudsoft Corporation 2019
This week
❏ Get the Gartner report: “Why Business Leaders Don’t Care About the Cost of Downtime” free of charge
❏ Talk to person(s) in charge of application resiliency
❏ Example post mortem from the last major incident; how was this knowledge shared and used?
❏ How are application authors involved in planning for and implementing resiliency?
❏ Is there consistent monitoring and recovery strategies (focusing 80:20 rule of existing apps)?
❏ How often is it tested? Are there gamedays?
Next 90 days
❏ Identify example app(s)
❏ Get a review from an expert of an example workload (“well-architected review”)
❏ Organise gamedays
❏ Introduce post mortems
Next 6 months
❏ Consider automation software at the application level for ‘low-hanging fruit’
❏ Involve and empower your application authors
16
get report for free
cloudsoft.io/report
17
e: aled@cloudsoft.io
w: cloudsoft.io
Thank You!
Any Questions?

More Related Content

PPTX
Brad Hipps: Mastering the Modern Application Lifecycle
Software Guru
 
PPTX
Modern Databases for Modern Application Architectures: The Next Wave of Desig...
MongoDB
 
PPTX
Principles of Modern Application Architecture
Rajesh RV
 
PPTX
QAT Global Overview 2013
QAT Global
 
PPTX
Server refresh program
Tal Aviv
 
PDF
Advantages and disadvantages of cloud based manufacturing software
MRPeasy
 
PPTX
Planning a Tech Refresh with the Right Information
Viridity Software
 
PPTX
Concept of Hybrid Applications
Skytap Cloud
 
Brad Hipps: Mastering the Modern Application Lifecycle
Software Guru
 
Modern Databases for Modern Application Architectures: The Next Wave of Desig...
MongoDB
 
Principles of Modern Application Architecture
Rajesh RV
 
QAT Global Overview 2013
QAT Global
 
Server refresh program
Tal Aviv
 
Advantages and disadvantages of cloud based manufacturing software
MRPeasy
 
Planning a Tech Refresh with the Right Information
Viridity Software
 
Concept of Hybrid Applications
Skytap Cloud
 

What's hot (19)

PDF
Digital Transformation, Cloud Adoption and the Impact on SAM and Security
Flexera
 
PDF
HP Software Performance Tour 2014 - Enterprise Agility in the age of Applicat...
HP Enterprise Italia
 
PDF
Democratizing App Development in Insurance Industry
WaveMaker, Inc.
 
PDF
How to Increase Fintech Contact Center Productivity
PT Datacomm Diangraha
 
PDF
When it comes to ADCs, Perception is not Reality - Radware
aliciasyc
 
PDF
Optimizing Technology Refresh
xpmigration
 
PDF
Kespry | 3 Keys to Lowering the Cost of Roof Inspections with a Drone Program
Kespry, Inc.
 
PDF
New Model for IT: Cloud Service Provider
VMware
 
PPTX
FY11 PC Refresh Plan
Jerry Bishop
 
PDF
Higher Efficiency and IT Empowerment with VMware vSphere with Operations Mana...
VMware
 
PPTX
Progress Pacific: Contemporary App Development
Progress
 
PDF
Why Observability is Key to Solving Business and Operational Challenges
Enterprise Management Associates
 
PPTX
Unified PPM & Agile
Michelle Manimtim
 
PDF
XebiaLabs Overview Slides
XebiaLabs
 
PDF
Cloud Accounting Process Infographic
Bentleys (WA) Pty Ltd
 
PDF
SECC_Software Testing Services
SECC Egypt
 
PDF
SG MVPA Workshop Booklet Fall 2015
Josh Russ
 
PDF
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
OutSystems
 
PDF
CMMI services presentation -SECC
SECC Egypt
 
Digital Transformation, Cloud Adoption and the Impact on SAM and Security
Flexera
 
HP Software Performance Tour 2014 - Enterprise Agility in the age of Applicat...
HP Enterprise Italia
 
Democratizing App Development in Insurance Industry
WaveMaker, Inc.
 
How to Increase Fintech Contact Center Productivity
PT Datacomm Diangraha
 
When it comes to ADCs, Perception is not Reality - Radware
aliciasyc
 
Optimizing Technology Refresh
xpmigration
 
Kespry | 3 Keys to Lowering the Cost of Roof Inspections with a Drone Program
Kespry, Inc.
 
New Model for IT: Cloud Service Provider
VMware
 
FY11 PC Refresh Plan
Jerry Bishop
 
Higher Efficiency and IT Empowerment with VMware vSphere with Operations Mana...
VMware
 
Progress Pacific: Contemporary App Development
Progress
 
Why Observability is Key to Solving Business and Operational Challenges
Enterprise Management Associates
 
Unified PPM & Agile
Michelle Manimtim
 
XebiaLabs Overview Slides
XebiaLabs
 
Cloud Accounting Process Infographic
Bentleys (WA) Pty Ltd
 
SECC_Software Testing Services
SECC Egypt
 
SG MVPA Workshop Booklet Fall 2015
Josh Russ
 
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
OutSystems
 
CMMI services presentation -SECC
SECC Egypt
 
Ad

Similar to Application resilience: challenges and good practice (20)

PPTX
Cloud Applications Management Nirvana
Seema Jethani
 
PPTX
Keeping Modern Applications Performing
Lee Atchison
 
PPTX
Cloud First Architecture
Cameron Vetter
 
PDF
Understanding Cloud Application Development: A Comprehensive Introduction
Cyntexa
 
PDF
Cloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
New Relic
 
KEY
Cto cloud
Sean Hull
 
PDF
VMworld 2013: Best Practices for Application Lifecycle Management with vCloud...
VMworld
 
PDF
5 keys to high availability applications
Lee Atchison
 
PDF
How Scalable Application Development Services Drive Long-Term Business Success
Value Coders
 
PPTX
Keeping the customer in mind: a lesson for Telco's
David Strom
 
PDF
A Framework to Measure and Maximize Cloud ROI
RightScale
 
PDF
SaaS Application Scalability: Best Practices from Architecture to Cloud Infra...
riyak40
 
PDF
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
Cognizant
 
PPTX
Resiliency for Cloud Deployed Applications
Ajay Chebbi
 
PPTX
cloud Resilience
Integral university, India
 
PPTX
Cloud Adoption Best Practices with New Relic
New Relic
 
PDF
Building Elastic And Resilient Cloud Applications Jeremi Bourgault
bilatobordz
 
PPTX
Re-Platforming Applications for the Cloud
Carter Wickstrom
 
PDF
9 Steps to Creating ADM Budgets
CAST
 
PPTX
What does performance mean in the cloud
Michael Kopp
 
Cloud Applications Management Nirvana
Seema Jethani
 
Keeping Modern Applications Performing
Lee Atchison
 
Cloud First Architecture
Cameron Vetter
 
Understanding Cloud Application Development: A Comprehensive Introduction
Cyntexa
 
Cloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
New Relic
 
Cto cloud
Sean Hull
 
VMworld 2013: Best Practices for Application Lifecycle Management with vCloud...
VMworld
 
5 keys to high availability applications
Lee Atchison
 
How Scalable Application Development Services Drive Long-Term Business Success
Value Coders
 
Keeping the customer in mind: a lesson for Telco's
David Strom
 
A Framework to Measure and Maximize Cloud ROI
RightScale
 
SaaS Application Scalability: Best Practices from Architecture to Cloud Infra...
riyak40
 
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
Cognizant
 
Resiliency for Cloud Deployed Applications
Ajay Chebbi
 
cloud Resilience
Integral university, India
 
Cloud Adoption Best Practices with New Relic
New Relic
 
Building Elastic And Resilient Cloud Applications Jeremi Bourgault
bilatobordz
 
Re-Platforming Applications for the Cloud
Carter Wickstrom
 
9 Steps to Creating ADM Budgets
CAST
 
What does performance mean in the cloud
Michael Kopp
 
Ad

Recently uploaded (20)

PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PDF
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Exploring AI Agents in Process Industries
amoreira6
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
oapresentation.pptx
mehatdhavalrajubhai
 
Activate_Methodology_Summary presentatio
annapureddyn
 

Application resilience: challenges and good practice

  • 1. STRICTLY CONFIDENTIAL Application Resilience: Challenges & Good Practice City Business Club - ISITC Europe | July 2019 Dr Aled Sage [email protected]
  • 2. © Cloudsoft Corporation 2019 About Aled Sage VP Engineering @ Cloudsoft PhD and 20 years experience with distributed systems Expertise in Cloud, Devops and Automation Across different verticals, including financial services @aledsage [email protected] linkedin.com/in/aledsage github.com/aledsage© Cloudsoft Corporation 2019 2
  • 3. © Cloudsoft Corporation 2019 1. What is resiliency? 2. Changing landscape 3. Case study 4. Problems & solutions Agenda © Cloudsoft Corporation 2019 3
  • 4. © Cloudsoft Corporation 2019 • Business applications • Service level objectives (SLOs) Service level agreements (SLAs) • Aim is not 100% availability • “Availability” is not binary • Design for failure • Proactive and reactive What is Resiliency Google Site Reliability Engineers (SRE) “Striving for Imperfection: Using an error budget to move fast without compromising high reliability” CRM, payroll, analysis, trading systems, settlement systems, … Proactive: High availability (HA) Avoid problems Reactive: Know when something is wrong Know how to respond Repair Disaster recovery (DR)© Cloudsoft Corporation 2019 4
  • 5. © Cloudsoft Corporation 2019 Business Impact © Cloudsoft Corporation 2019 5 Gartner Why Business Leaders Don’t Care About the Cost of Downtime 9 April 2019
  • 6. © Cloudsoft Corporation 2019 • Private and public cloud • Kubernetes and Openshift (using containers) • Bare-metal and mainframes • Business wants agility • Security and reliability still vital • All while driving down costs Changing Landscape • Metro Bank, Monzo, Starling • Facebook? Amazon? © Cloudsoft Corporation 2019 6
  • 7. © Cloudsoft Corporation 2019 Comparison with Disruptors Incumbents Disruptors Hybrid environment Everything is cloud-native Chaos engineering? “No way!” Kill servers in production Wall Street bank: decrease VM provisioning to 9 hours New container in seconds; New Virtual Machines in minute(s) 3-6 month releases Small changes several times a day Improve the basics: ● Improve resiliency and agility of existing apps ● Fit with old-world and the future © Cloudsoft Corporation 2019 100s of applications Many lines of business Small product set Legacy high-value apps No legacy apps 7
  • 8. © Cloudsoft Corporation 2019 APPLICATION-CENTRIC APPROACH ✓ Empower and involve application architects and developers ✓ Codify best practices and use of the bank’s strategic tools ✓ Consistent way to automate resiliency Self-healing applications! Your application can automatically detect and fix problems Case-study: Tier-1 Investment Bank ACHIEVEMENTS UNLOCKED ✓ Application-level Resiliency at scale ✓ Meets business & regulatory needs ✓ Across multiple evolving platforms © Cloudsoft Corporation 2019 8
  • 9. © Cloudsoft Corporation 2019 Challenges © Cloudsoft Corporation 2019 9 Wide variation in techniques and quality; bespoke processes. Involves many systems. Fiefdoms & politics. Resiliency is a cross-cutting concern. Manual procedures and tribal knowledge. Responses to degradation or failure not automated. Manual Bespoke Operations fragmented Rarely tested Sufficient to meet auditors’ requirements. But little practice or variation in scenarios; manual testing is expensive.
  • 10. © Cloudsoft Corporation 2019 Problems and Solutions Problems Solutions Tribal knowledge, rarely tested Gamedays Processes, environments and applications are complicated and ever-changing Post mortems & continual improvement High availability and DR needs understanding of application, but infrastructure-focused Application-centric approach Bespoke / inconsistent approaches Empower the application architects; Codify and share best practices. © Cloudsoft Corporation 2019 10
  • 11. © Cloudsoft Corporation 2019 Problems and Solutions
  • 12. © Cloudsoft Corporation 2019 Problems and Solutions Problems Solutions Manual runbooks Automated recovery Involves many systems Encourage APIs and automation; “glue code” to codify their use; focus on common application architectures. Runbooks and CMDB out-of-date “Living model” of the application; Deployment and in-life management codified in this model. © Cloudsoft Corporation 2019 12
  • 13. © Cloudsoft Corporation 2019 • Restart the web-servers (e.g. on an out-of-memory) • Repave servers or clusters (e.g. when a restart fails, or an update is required) • Auto-scale based on latency, load or time-of-day • Manage DR within a region and/or across regions Automated Recovery © Cloudsoft Corporation 2019 13
  • 14. © Cloudsoft Corporation 2019 Self Managing Systems: Autonomics © Cloudsoft Corporation 2019 14 Self configuring • Automated configuration of components and systems follows high-level policies. Rest of system adjusts automatically and seamlessly Self healing • System automatically detects, diagnoses, and repairs software and hardware problems Self optimizing • Components and systems continually seek opportunities to improve their own performance and efficiency Self protecting • System automatically defends against malicious attacks or cascading failures. It uses early warning to anticipate and prevent system wide failures
  • 15. © Cloudsoft Corporation 2019 Cloudsoft AMP © Cloudsoft Corporation 2019 15 AMP streamlines development, operations and governance for any application on any cloud Rapidly Scalable Autonomic policy-based management Cloud Enabling Liberate applications from infrastructure Powerful Tools Consistent in-life management tooling and features Improve Visibility Unified management view of applications
  • 16. © Cloudsoft Corporation 2019 Suggested Next Steps © Cloudsoft Corporation 2019 This week ❏ Get the Gartner report: “Why Business Leaders Don’t Care About the Cost of Downtime” free of charge ❏ Talk to person(s) in charge of application resiliency ❏ Example post mortem from the last major incident; how was this knowledge shared and used? ❏ How are application authors involved in planning for and implementing resiliency? ❏ Is there consistent monitoring and recovery strategies (focusing 80:20 rule of existing apps)? ❏ How often is it tested? Are there gamedays? Next 90 days ❏ Identify example app(s) ❏ Get a review from an expert of an example workload (“well-architected review”) ❏ Organise gamedays ❏ Introduce post mortems Next 6 months ❏ Consider automation software at the application level for ‘low-hanging fruit’ ❏ Involve and empower your application authors 16 get report for free cloudsoft.io/report