SlideShare a Scribd company logo
High Availability Explained
Maciej Lasyk
Kraków, devOPS meetup #2
2014-01-28

Maciej Lasyk, High Availability Explained

1/14
“Anything that can go wrong, will go wrong”
Murphy's law

Maciej Lasyk, High Availability Explained

2/14
“Anything that can go wrong, will go wrong”
Murphy's law

Maciej Lasyk, High Availability Explained

2/14
“Anything that can go wrong, will go wrong”
Murphy's law

An electrical explosion and fire Saturday at a Houston data
center operated by The Planet has taken the entire facility offline.
The company claimed power to the facility was interrupted when a
transformer exploded. Official reports that three walls were blown
down causing a fire.

Maciej Lasyk, High Availability Explained

2/14
“Anything that can go wrong, will go wrong”
Murphy's law

An electrical explosion and fire Saturday at a Houston data
center operated by The Planet has taken the entire facility offline.
The company claimed power to the facility was interrupted when a
transformer exploded. Official reports that three walls were blown
down causing a fire.

Three walls of the electrical equipment room on the first floor
blew several feet from their original position, and the underground
cabling that powers the first floor of H1 was destroyed.

Maciej Lasyk, High Availability Explained

2/14
High Availability is in the eye of the beholder

Maciej Lasyk, High Availability Explained

3/14
High Availability is in the eye of the beholder
CEO: we don't loose sales

Maciej Lasyk, High Availability Explained

3/14
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level

Maciej Lasyk, High Availability Explained

3/14
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)

Maciej Lasyk, High Availability Explained

3/14
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
Developers: we can be proud – our services are working ;)

Maciej Lasyk, High Availability Explained

3/14
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
Developers: we can be proud – our services are working ;)
System engineers: we can sleep well (and fsck, we love to!)

Maciej Lasyk, High Availability Explained

3/14
High Availability is in the eye of the beholder
CEO: we don't loose sales
Sales: we can extend our offer basing on HA level
Accounts managers: we don't upset our customers (that often)
Developers: we can be proud – our services are working ;)
System engineers: we can sleep well (and fsck, we love to!)
Technical support: no calls? Back to WoW then.. ;)

Maciej Lasyk, High Availability Explained

3/14
So how many 9's?

Maciej Lasyk, High Availability Explained

4/14
So how many 9's?

Maciej Lasyk, High Availability Explained

4/14
So how many 9's?

Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability

Maciej Lasyk, High Availability Explained

4/14
So how many 9's?

Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability
Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability

Maciej Lasyk, High Availability Explained

4/14
So how many 9's?

Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability
Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability
Availability

Downtime (year)

Downtime (month)

90% (“one nine”)

36.5 days

72 hours

95%

18.25 days

36 hours

97%

10.96 days

21.6 hours

98%

7.30 days

14.4 hours

99% (“two nines”)

3.65 days

7.2 hours

99.5%

1.83 days

3.6 hours

99.8%

17.52 hours

86.23 minutes

99.9% (“three nines”)

4.38 hours

21.56 minutes

99.99 (“four nines”)

52.56 minutes

4.32 minutes

99.999 (“five nines”)

5.26 minutes

25.9 seconds

Maciej Lasyk, High Availability Explained

4/14
So how many 9's?

https://jazz.net/wiki/bin/view/Deployment/HighAvailability

Maciej Lasyk, High Availability Explained

4/14
HA terminology
RPO: Recovery Point Objective; how much data can we loose?

Maciej Lasyk, High Availability Explained

5/14
HA terminology
RPO: Recovery Point Objective; how much data can we loose?
RTO: Recovery Time Objective; how long does it take to recover?

Maciej Lasyk, High Availability Explained

5/14
HA terminology
RPO: Recovery Point Objective; how much data can we loose?
RTO: Recovery Time Objective; how long does it take to recover?
MTBF: Mean-Times-Between-Failures; time between failures
(density fnc -> reliability fnc)

https://en.wikipedia.org/wiki/Mean_time_between_failures

Maciej Lasyk, High Availability Explained

5/14
HA terminology
SLA: Service Level Agreement;
formal definitions (customer <-> provider)

Maciej Lasyk, High Availability Explained

5/14
HA terminology
SLA: Service Level Agreement;
formal definitions (customer <-> provider)
OLA: Operational Level Agreement; definitions within organization;
help us keeping provided SLAs

Maciej Lasyk, High Availability Explained

5/14
SLAs..
So what is written in SLAs?
Availability

Downtime (year)

Downtime (month)

90%

36.5 days

72 hours

95%

18.25 days

36 hours

97%

10.96 days

21.6 hours

98%

7.30 days

14.4 hours

99%

3.65 days

7.2 hours

99.5% (EC2, EBS)

1.83 days

3.6 hours

99.8%

17.52 hours

86.23 minutes

99.9% (SoftLayer, IBM)

4.38 hours

21.56 minutes

99.99

52.56 minutes

4.32 minutes

99.999

5.26 minutes

25.9 seconds

Maciej Lasyk, High Availability Explained

5/14
SLAs..
So what is written in SLAs?
Availability

Downtime (year)

Downtime (month)

90%

36.5 days

72 hours

95%

18.25 days

36 hours

97%

10.96 days

21.6 hours

98%

7.30 days

14.4 hours

99%

3.65 days

7.2 hours

99.5% (EC2, EBS)

1.83 days

3.6 hours

99.8%

17.52 hours

86.23 minutes

99.9% (SoftLayer, IBM)

4.38 hours

21.56 minutes

99.99

52.56 minutes

4.32 minutes

99.999

5.26 minutes

25.9 seconds

http://aws.amazon.com/ec2/sla/
http://www.softlayer.com/about/service-level-agreement
Maciej Lasyk, High Availability Explained

5/14
SLAs..

Availability mentioned in SLAs are only goals of service provider
Usually when it's not met than company pays off the fees

Maciej Lasyk, High Availability Explained

5/14
How deep is this hole?
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
So we would like to achieve 99,9999% which is about 30s of downtime per year
Maciej Lasyk, High Availability Explained

6/14
How deep is this hole?
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
Even Proof of Concept is very hard to provide: 5s of downtime per layer yearly!
Maciej Lasyk, High Availability Explained

6/14
Load-balancing and failover

LB:

http://www.netdigix.com/linux-loadbalancing.php

Maciej Lasyk, High Availability Explained

7/14
Load-balancing and failover

Failover:

http://www.simplefailover.com/
Maciej Lasyk, High Availability Explained

7/14
th

th

LB – 4 layer or 7 ?

4th layer:

7th layer:

- high performance

- low cost

- just do the LB work!

- good for quickfixes / patches

- reliable

- not that scalable

- scalable

- low performance
- complex codebase
- custom code for protocols
- cookies? what about memcache..

Maciej Lasyk, High Availability Explained

8/14
Disaster Recovery

Maciej Lasyk, High Availability Explained

9/14
Disaster Recovery

http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments

Maciej Lasyk, High Availability Explained

9/14
Disaster Recovery

http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments

Hot site: active synchronization, could be serving services. Cost can be high
Warm site: periodical synchronization, DR tests needed. Low costs
Cold site: Nothing here – just echo and some place to spin services; nightmare
Maciej Lasyk, High Availability Explained

9/14
Planning for failure

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
Everything starts here - DNS:
- keep TTLs low (300s). Can't make under 60min? That's bad!
- check SLA of DNS servers (dnsmadeeasy.com history)
- what do you know about DNSes?
- zero downtime here is a must!
- this can be achieved with complicated network abracadabra
- remember what 99.9999% means?
- round robin is a load – balancer but without failover!
- GSLB – killed by OS/browser/srvs cache'ing
(GlobalServerLoadBalancing)
- GlobalIP (SoftLayer etc) – workaround for GSLB via routing

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
E-mail servers:
- it's simple as MX records (delivering)
- it's almost simple as complicated system of SMTP servers (sending)
- it's not that simple when IMAP locking over DFS (reading)

5 gmail-smtp-in.l.google.com.
10 alt1.gmail-smtp-in.l.google.com.
20 alt2.gmail-smtp-in.l.google.com.
30 alt3.gmail-smtp-in.l.google.com.
40 alt4.gmail-smtp-in.l.google.com.
When MXing – watch the spam!

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
WEB servers:
- it's simple as some frontend loadbalancer
- did you really stick user session to particular server? Memcache!
- LB balancing algorithm
- how many Lbs?
- what if LB goes down?

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
DB servers:
- it's.. not that simple
- replication (master – master? App should be aware..)
- replication ring? Complicated, works, but in case of failure...
- let's talk about MySQL:
- NoSPOF solution: MySQL cluster
- MySQL Galera cluster – synch, active-active multi-master
- master – master – simply works
- Failover? Matsunobu Yoshinori mysql-master-ha
- MySQL utilities (http://www.clusterdb.com/mysql/mysql-utilities-webinar-qa-replay-now-available/)

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.

Load – balancers:
- remember about failovering IP addresses!

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.

Load – balancers:
- remember about failovering IP addresses!
Storage – DFSes:
- GlusterFS – we'll see it in action in a minute
- NFS? Could be – over some SAN / NAS (high cost solution)
- CephFS – just like GlusterFS – it's great and does the work
- DRBD – lower level, does the work on block – device layer – slow...
Maciej Lasyk, High Availability Explained

10/14
Planning for failure
GlusterFS:
- low cost (could be..)
- distributed volumes
- replicated volumes
- striped volumes
- and...
- distributed – striped volumes
- distributed – replicated volumes
- distributed – striped – replicated volumes
- sound good? :)
Maciej Lasyk, High Availability Explained

10/14
Planning for failure
GlusterFS: replicated volumes vs Geo-replication
- replicated:
- mirrors data
- provides HA
- synch – replication
- Geo-replication:
- mirrors data across geo – distributed clusters
- ensures backing up data for DR
- asynch – replica (periodic checks)

Maciej Lasyk, High Availability Explained

10/14
Planning for failure
HA for virtualization solutions?
- it's really complicated, like...

Maciej Lasyk, High Availability Explained

11/14
Planning for failure
HA for virtualization solutions?
- it's really complicated, like...

Maciej Lasyk, High Availability Explained

11/14
Tools
The most important tool would be the conclusion from the picture below:

Maciej Lasyk, High Availability Explained

12/14
Tools
The most important tool would be the conclusion from the picture below:

Maciej Lasyk, High Availability Explained

12/14
Tools
The most important tool would be the conclusion from the picture below:

Maciej Lasyk, High Availability Explained

12/14
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP

Maciej Lasyk, High Availability Explained

12/14
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx

Maciej Lasyk, High Availability Explained

12/14
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
- Failover (statefull services):
- IP: KeepAlived + sysctl

Maciej Lasyk, High Availability Explained

12/14
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
- Failover (statefull services):
- IP: KeepAlived + sysctl
- Managing: pacemaker (manager) + corosync (message'ing)

Maciej Lasyk, High Availability Explained

12/14
Tools
- DNS: roundrobin, GSLB, low ttls, globalIP
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx
- Failover (statefull services):
- IP: KeepAlived + sysctl
- Managing: pacemaker (manager) + corosync (message'ing)
- (almost) All-In-One: Linux Virtual Server

Maciej Lasyk, High Availability Explained

12/14
Turn on HA thinking!
Main goal of HA? Improve user experience!
- keep the app fully functional
- keep the app resistant and tolerant to faults
- provide method for a successful audit
- sleep well (anyone awake?) ;)

Maciej Lasyk, High Availability Explained

13/14
Thank you :)
High Availability Explained
Maciej Lasyk
Kraków, devOPS meetup #2
2014-01-28
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net

Maciej Lasyk, High Availability Explained

14/14

More Related Content

What's hot (20)

PPT
7 Stages of Scaling Web Applications
David Mitzenmacher
 
PPTX
Vmware training presentation
Amit Kapadia
 
PDF
Penetration Testing Tutorial | Penetration Testing Tools | Cyber Security Tra...
Edureka!
 
PDF
azure-security-overview-slideshare-180419183626.pdf
BenAissaTaher1
 
PPTX
Cloud Security
AWS User Group Bengaluru
 
PDF
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
Kenny Gryp
 
PDF
Solving the Asset Management Challenge for Cybersecurity (It’s About Time)
Enterprise Management Associates
 
PPTX
Azure Security Overview
Allen Brokken
 
PDF
Virtualization Architecture & KVM
Pradeep Kumar
 
PDF
Cloud-Native Security
VMware Tanzu
 
PPTX
HSM (Hardware Security Module)
Umesh Kolhe
 
PDF
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE - ATT&CKcon
 
PDF
PHDays 2018 Threat Hunting Hands-On Lab
Teymur Kheirkhabarov
 
PPTX
Threat Hunting with Splunk Hands-on
Splunk
 
PPTX
Bypass pfsense
SalmenHAJJI1
 
PPTX
Adversary Emulation using CALDERA
Erik Van Buggenhout
 
PDF
Meraki SD-WAN.pdf
ssuser7ba9b8
 
PPT
Automated Penetration Testing With The Metasploit Framework
Tom Eston
 
PDF
SOC Architecture - Building the NextGen SOC
Priyanka Aash
 
PDF
Hacking identity: A Pen Tester's Guide to IAM
Jerod Brennen
 
7 Stages of Scaling Web Applications
David Mitzenmacher
 
Vmware training presentation
Amit Kapadia
 
Penetration Testing Tutorial | Penetration Testing Tools | Cyber Security Tra...
Edureka!
 
azure-security-overview-slideshare-180419183626.pdf
BenAissaTaher1
 
Cloud Security
AWS User Group Bengaluru
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
Kenny Gryp
 
Solving the Asset Management Challenge for Cybersecurity (It’s About Time)
Enterprise Management Associates
 
Azure Security Overview
Allen Brokken
 
Virtualization Architecture & KVM
Pradeep Kumar
 
Cloud-Native Security
VMware Tanzu
 
HSM (Hardware Security Module)
Umesh Kolhe
 
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE - ATT&CKcon
 
PHDays 2018 Threat Hunting Hands-On Lab
Teymur Kheirkhabarov
 
Threat Hunting with Splunk Hands-on
Splunk
 
Bypass pfsense
SalmenHAJJI1
 
Adversary Emulation using CALDERA
Erik Van Buggenhout
 
Meraki SD-WAN.pdf
ssuser7ba9b8
 
Automated Penetration Testing With The Metasploit Framework
Tom Eston
 
SOC Architecture - Building the NextGen SOC
Priyanka Aash
 
Hacking identity: A Pen Tester's Guide to IAM
Jerod Brennen
 

Similar to High Availability (HA) Explained (20)

ODP
High Availability (HA) Explained - second edition
Maciej Lasyk
 
PPTX
Resiliency vs High Availability vs Fault Tolerance vs Reliability
jeetendra mandal
 
ODP
High availability in IT: AAAARGH
Mattias Geniar
 
PDF
Operating a Highly Available Cloud Service
Depankar Neogi
 
PDF
Hidden Costs of Chasing the Mythical 'Five Nines'
DevOpsDays DFW
 
PDF
Webinar slides: How to Measure Database Availability?
Severalnines
 
PPTX
04. availability-concepts
Muhammad Ahad
 
PDF
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
Tobias Koprowski
 
PDF
High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC...
Twilio Inc
 
PPTX
HA & DR System Design - Concepts and Solution
Continuity and Resilience
 
PPTX
Designing High Available Cloud Applications
Giovanni Mazzeo
 
PPTX
Availability conceptin operating system.
nayabimran31
 
PPTX
PriyaDharshini distributed operating system
PriyadharshiniVS
 
PPTX
Vanmathy distributed operating system
PriyadharshiniVS
 
PDF
(Kishore Jalleda) Launching products at massive scale - the DevOps way
kjalleda
 
PDF
[@IndeedEng] Redundant Array of Inexpensive Datacenters
indeedeng
 
PDF
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Rolf Koski
 
PDF
AVAILABILITY METRICS: UNDER CONTROLLED ENVIRONMENTS FOR WEB SERVICES
ijwscjournal
 
PDF
AVAILABILITY METRICS: UNDER CONTROLLED ENVIRONMENTS FOR WEB SERVICES
ijwscjournal
 
PDF
AVAILABILITY METRICS: UNDER CONTROLLED ENVIRONMENTS FOR WEB SERVICES
ijwscjournal
 
High Availability (HA) Explained - second edition
Maciej Lasyk
 
Resiliency vs High Availability vs Fault Tolerance vs Reliability
jeetendra mandal
 
High availability in IT: AAAARGH
Mattias Geniar
 
Operating a Highly Available Cloud Service
Depankar Neogi
 
Hidden Costs of Chasing the Mythical 'Five Nines'
DevOpsDays DFW
 
Webinar slides: How to Measure Database Availability?
Severalnines
 
04. availability-concepts
Muhammad Ahad
 
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
Tobias Koprowski
 
High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC...
Twilio Inc
 
HA & DR System Design - Concepts and Solution
Continuity and Resilience
 
Designing High Available Cloud Applications
Giovanni Mazzeo
 
Availability conceptin operating system.
nayabimran31
 
PriyaDharshini distributed operating system
PriyadharshiniVS
 
Vanmathy distributed operating system
PriyadharshiniVS
 
(Kishore Jalleda) Launching products at massive scale - the DevOps way
kjalleda
 
[@IndeedEng] Redundant Array of Inexpensive Datacenters
indeedeng
 
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Rolf Koski
 
AVAILABILITY METRICS: UNDER CONTROLLED ENVIRONMENTS FOR WEB SERVICES
ijwscjournal
 
AVAILABILITY METRICS: UNDER CONTROLLED ENVIRONMENTS FOR WEB SERVICES
ijwscjournal
 
AVAILABILITY METRICS: UNDER CONTROLLED ENVIRONMENTS FOR WEB SERVICES
ijwscjournal
 
Ad

More from Maciej Lasyk (20)

PDF
Rundeck & Ansible
Maciej Lasyk
 
PDF
Docker 1.11
Maciej Lasyk
 
ODP
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Maciej Lasyk
 
ODP
Co powinieneś wiedzieć na temat devops?f
Maciej Lasyk
 
ODP
"Containers do not contain"
Maciej Lasyk
 
PDF
Git Submodules
Maciej Lasyk
 
ODP
Linux containers & Devops
Maciej Lasyk
 
PDF
Under the Dome (of failure driven pipeline)
Maciej Lasyk
 
PDF
Continuous Security in DevOps
Maciej Lasyk
 
ODP
About cultural change w/Devops
Maciej Lasyk
 
ODP
Orchestrating docker containers at scale (#DockerKRK edition)
Maciej Lasyk
 
ODP
Orchestrating docker containers at scale (PJUG edition)
Maciej Lasyk
 
PDF
Orchestrating Docker containers at scale
Maciej Lasyk
 
ODP
Ghost in the shell
Maciej Lasyk
 
ODP
Scaling and securing node.js apps
Maciej Lasyk
 
ODP
Node.js security
Maciej Lasyk
 
ODP
Monitoring with Nagios and Ganglia
Maciej Lasyk
 
PDF
Stop disabling SELinux!
Maciej Lasyk
 
ODP
RHEL/Fedora + Docker (and SELinux)
Maciej Lasyk
 
PPTX
Shall we play a game? PL version
Maciej Lasyk
 
Rundeck & Ansible
Maciej Lasyk
 
Docker 1.11
Maciej Lasyk
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Maciej Lasyk
 
Co powinieneś wiedzieć na temat devops?f
Maciej Lasyk
 
"Containers do not contain"
Maciej Lasyk
 
Git Submodules
Maciej Lasyk
 
Linux containers & Devops
Maciej Lasyk
 
Under the Dome (of failure driven pipeline)
Maciej Lasyk
 
Continuous Security in DevOps
Maciej Lasyk
 
About cultural change w/Devops
Maciej Lasyk
 
Orchestrating docker containers at scale (#DockerKRK edition)
Maciej Lasyk
 
Orchestrating docker containers at scale (PJUG edition)
Maciej Lasyk
 
Orchestrating Docker containers at scale
Maciej Lasyk
 
Ghost in the shell
Maciej Lasyk
 
Scaling and securing node.js apps
Maciej Lasyk
 
Node.js security
Maciej Lasyk
 
Monitoring with Nagios and Ganglia
Maciej Lasyk
 
Stop disabling SELinux!
Maciej Lasyk
 
RHEL/Fedora + Docker (and SELinux)
Maciej Lasyk
 
Shall we play a game? PL version
Maciej Lasyk
 
Ad

Recently uploaded (20)

PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 

High Availability (HA) Explained

  • 1. High Availability Explained Maciej Lasyk Kraków, devOPS meetup #2 2014-01-28 Maciej Lasyk, High Availability Explained 1/14
  • 2. “Anything that can go wrong, will go wrong” Murphy's law Maciej Lasyk, High Availability Explained 2/14
  • 3. “Anything that can go wrong, will go wrong” Murphy's law Maciej Lasyk, High Availability Explained 2/14
  • 4. “Anything that can go wrong, will go wrong” Murphy's law An electrical explosion and fire Saturday at a Houston data center operated by The Planet has taken the entire facility offline. The company claimed power to the facility was interrupted when a transformer exploded. Official reports that three walls were blown down causing a fire. Maciej Lasyk, High Availability Explained 2/14
  • 5. “Anything that can go wrong, will go wrong” Murphy's law An electrical explosion and fire Saturday at a Houston data center operated by The Planet has taken the entire facility offline. The company claimed power to the facility was interrupted when a transformer exploded. Official reports that three walls were blown down causing a fire. Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed. Maciej Lasyk, High Availability Explained 2/14
  • 6. High Availability is in the eye of the beholder Maciej Lasyk, High Availability Explained 3/14
  • 7. High Availability is in the eye of the beholder CEO: we don't loose sales Maciej Lasyk, High Availability Explained 3/14
  • 8. High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Maciej Lasyk, High Availability Explained 3/14
  • 9. High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Maciej Lasyk, High Availability Explained 3/14
  • 10. High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Developers: we can be proud – our services are working ;) Maciej Lasyk, High Availability Explained 3/14
  • 11. High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Developers: we can be proud – our services are working ;) System engineers: we can sleep well (and fsck, we love to!) Maciej Lasyk, High Availability Explained 3/14
  • 12. High Availability is in the eye of the beholder CEO: we don't loose sales Sales: we can extend our offer basing on HA level Accounts managers: we don't upset our customers (that often) Developers: we can be proud – our services are working ;) System engineers: we can sleep well (and fsck, we love to!) Technical support: no calls? Back to WoW then.. ;) Maciej Lasyk, High Availability Explained 3/14
  • 13. So how many 9's? Maciej Lasyk, High Availability Explained 4/14
  • 14. So how many 9's? Maciej Lasyk, High Availability Explained 4/14
  • 15. So how many 9's? Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability Maciej Lasyk, High Availability Explained 4/14
  • 16. So how many 9's? Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability Maciej Lasyk, High Availability Explained 4/14
  • 17. So how many 9's? Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability Availability Downtime (year) Downtime (month) 90% (“one nine”) 36.5 days 72 hours 95% 18.25 days 36 hours 97% 10.96 days 21.6 hours 98% 7.30 days 14.4 hours 99% (“two nines”) 3.65 days 7.2 hours 99.5% 1.83 days 3.6 hours 99.8% 17.52 hours 86.23 minutes 99.9% (“three nines”) 4.38 hours 21.56 minutes 99.99 (“four nines”) 52.56 minutes 4.32 minutes 99.999 (“five nines”) 5.26 minutes 25.9 seconds Maciej Lasyk, High Availability Explained 4/14
  • 18. So how many 9's? https://jazz.net/wiki/bin/view/Deployment/HighAvailability Maciej Lasyk, High Availability Explained 4/14
  • 19. HA terminology RPO: Recovery Point Objective; how much data can we loose? Maciej Lasyk, High Availability Explained 5/14
  • 20. HA terminology RPO: Recovery Point Objective; how much data can we loose? RTO: Recovery Time Objective; how long does it take to recover? Maciej Lasyk, High Availability Explained 5/14
  • 21. HA terminology RPO: Recovery Point Objective; how much data can we loose? RTO: Recovery Time Objective; how long does it take to recover? MTBF: Mean-Times-Between-Failures; time between failures (density fnc -> reliability fnc) https://en.wikipedia.org/wiki/Mean_time_between_failures Maciej Lasyk, High Availability Explained 5/14
  • 22. HA terminology SLA: Service Level Agreement; formal definitions (customer <-> provider) Maciej Lasyk, High Availability Explained 5/14
  • 23. HA terminology SLA: Service Level Agreement; formal definitions (customer <-> provider) OLA: Operational Level Agreement; definitions within organization; help us keeping provided SLAs Maciej Lasyk, High Availability Explained 5/14
  • 24. SLAs.. So what is written in SLAs? Availability Downtime (year) Downtime (month) 90% 36.5 days 72 hours 95% 18.25 days 36 hours 97% 10.96 days 21.6 hours 98% 7.30 days 14.4 hours 99% 3.65 days 7.2 hours 99.5% (EC2, EBS) 1.83 days 3.6 hours 99.8% 17.52 hours 86.23 minutes 99.9% (SoftLayer, IBM) 4.38 hours 21.56 minutes 99.99 52.56 minutes 4.32 minutes 99.999 5.26 minutes 25.9 seconds Maciej Lasyk, High Availability Explained 5/14
  • 25. SLAs.. So what is written in SLAs? Availability Downtime (year) Downtime (month) 90% 36.5 days 72 hours 95% 18.25 days 36 hours 97% 10.96 days 21.6 hours 98% 7.30 days 14.4 hours 99% 3.65 days 7.2 hours 99.5% (EC2, EBS) 1.83 days 3.6 hours 99.8% 17.52 hours 86.23 minutes 99.9% (SoftLayer, IBM) 4.38 hours 21.56 minutes 99.99 52.56 minutes 4.32 minutes 99.999 5.26 minutes 25.9 seconds http://aws.amazon.com/ec2/sla/ http://www.softlayer.com/about/service-level-agreement Maciej Lasyk, High Availability Explained 5/14
  • 26. SLAs.. Availability mentioned in SLAs are only goals of service provider Usually when it's not met than company pays off the fees Maciej Lasyk, High Availability Explained 5/14
  • 27. How deep is this hole? app layer (core, db, cache) data storage operating system hardware networking location So we would like to achieve 99,9999% which is about 30s of downtime per year Maciej Lasyk, High Availability Explained 6/14
  • 28. How deep is this hole? app layer (core, db, cache) data storage operating system hardware networking location Even Proof of Concept is very hard to provide: 5s of downtime per layer yearly! Maciej Lasyk, High Availability Explained 6/14
  • 31. th th LB – 4 layer or 7 ? 4th layer: 7th layer: - high performance - low cost - just do the LB work! - good for quickfixes / patches - reliable - not that scalable - scalable - low performance - complex codebase - custom code for protocols - cookies? what about memcache.. Maciej Lasyk, High Availability Explained 8/14
  • 32. Disaster Recovery Maciej Lasyk, High Availability Explained 9/14
  • 34. Disaster Recovery http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments Hot site: active synchronization, could be serving services. Cost can be high Warm site: periodical synchronization, DR tests needed. Low costs Cold site: Nothing here – just echo and some place to spin services; nightmare Maciej Lasyk, High Availability Explained 9/14
  • 35. Planning for failure Maciej Lasyk, High Availability Explained 10/14
  • 36. Planning for failure Everything starts here - DNS: - keep TTLs low (300s). Can't make under 60min? That's bad! - check SLA of DNS servers (dnsmadeeasy.com history) - what do you know about DNSes? - zero downtime here is a must! - this can be achieved with complicated network abracadabra - remember what 99.9999% means? - round robin is a load – balancer but without failover! - GSLB – killed by OS/browser/srvs cache'ing (GlobalServerLoadBalancing) - GlobalIP (SoftLayer etc) – workaround for GSLB via routing Maciej Lasyk, High Availability Explained 10/14
  • 37. Planning for failure E-mail servers: - it's simple as MX records (delivering) - it's almost simple as complicated system of SMTP servers (sending) - it's not that simple when IMAP locking over DFS (reading) 5 gmail-smtp-in.l.google.com. 10 alt1.gmail-smtp-in.l.google.com. 20 alt2.gmail-smtp-in.l.google.com. 30 alt3.gmail-smtp-in.l.google.com. 40 alt4.gmail-smtp-in.l.google.com. When MXing – watch the spam! Maciej Lasyk, High Availability Explained 10/14
  • 38. Planning for failure WEB servers: - it's simple as some frontend loadbalancer - did you really stick user session to particular server? Memcache! - LB balancing algorithm - how many Lbs? - what if LB goes down? Maciej Lasyk, High Availability Explained 10/14
  • 39. Planning for failure DB servers: - it's.. not that simple - replication (master – master? App should be aware..) - replication ring? Complicated, works, but in case of failure... - let's talk about MySQL: - NoSPOF solution: MySQL cluster - MySQL Galera cluster – synch, active-active multi-master - master – master – simply works - Failover? Matsunobu Yoshinori mysql-master-ha - MySQL utilities (http://www.clusterdb.com/mysql/mysql-utilities-webinar-qa-replay-now-available/) Maciej Lasyk, High Availability Explained 10/14
  • 40. Planning for failure Caching servers: - this is cache for God's sake – why would we use HA here? - just use proper architecture like... redundancy. Maciej Lasyk, High Availability Explained 10/14
  • 41. Planning for failure Caching servers: - this is cache for God's sake – why would we use HA here? - just use proper architecture like... redundancy. Load – balancers: - remember about failovering IP addresses! Maciej Lasyk, High Availability Explained 10/14
  • 42. Planning for failure Caching servers: - this is cache for God's sake – why would we use HA here? - just use proper architecture like... redundancy. Load – balancers: - remember about failovering IP addresses! Storage – DFSes: - GlusterFS – we'll see it in action in a minute - NFS? Could be – over some SAN / NAS (high cost solution) - CephFS – just like GlusterFS – it's great and does the work - DRBD – lower level, does the work on block – device layer – slow... Maciej Lasyk, High Availability Explained 10/14
  • 43. Planning for failure GlusterFS: - low cost (could be..) - distributed volumes - replicated volumes - striped volumes - and... - distributed – striped volumes - distributed – replicated volumes - distributed – striped – replicated volumes - sound good? :) Maciej Lasyk, High Availability Explained 10/14
  • 44. Planning for failure GlusterFS: replicated volumes vs Geo-replication - replicated: - mirrors data - provides HA - synch – replication - Geo-replication: - mirrors data across geo – distributed clusters - ensures backing up data for DR - asynch – replica (periodic checks) Maciej Lasyk, High Availability Explained 10/14
  • 45. Planning for failure HA for virtualization solutions? - it's really complicated, like... Maciej Lasyk, High Availability Explained 11/14
  • 46. Planning for failure HA for virtualization solutions? - it's really complicated, like... Maciej Lasyk, High Availability Explained 11/14
  • 47. Tools The most important tool would be the conclusion from the picture below: Maciej Lasyk, High Availability Explained 12/14
  • 48. Tools The most important tool would be the conclusion from the picture below: Maciej Lasyk, High Availability Explained 12/14
  • 49. Tools The most important tool would be the conclusion from the picture below: Maciej Lasyk, High Availability Explained 12/14
  • 50. Tools - DNS: roundrobin, GSLB, low ttls, globalIP Maciej Lasyk, High Availability Explained 12/14
  • 51. Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx Maciej Lasyk, High Availability Explained 12/14
  • 52. Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx - Failover (statefull services): - IP: KeepAlived + sysctl Maciej Lasyk, High Availability Explained 12/14
  • 53. Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx - Failover (statefull services): - IP: KeepAlived + sysctl - Managing: pacemaker (manager) + corosync (message'ing) Maciej Lasyk, High Availability Explained 12/14
  • 54. Tools - DNS: roundrobin, GSLB, low ttls, globalIP - Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx - Failover (statefull services): - IP: KeepAlived + sysctl - Managing: pacemaker (manager) + corosync (message'ing) - (almost) All-In-One: Linux Virtual Server Maciej Lasyk, High Availability Explained 12/14
  • 55. Turn on HA thinking! Main goal of HA? Improve user experience! - keep the app fully functional - keep the app resistant and tolerant to faults - provide method for a successful audit - sleep well (anyone awake?) ;) Maciej Lasyk, High Availability Explained 13/14
  • 56. Thank you :) High Availability Explained Maciej Lasyk Kraków, devOPS meetup #2 2014-01-28 http://maciek.lasyk.info/sysop [email protected] @docent-net Maciej Lasyk, High Availability Explained 14/14