High Availability (HA) Explained

High Availability Explained
Maciej Lasyk
Kraków, devOPS meetup #2
2014-01-28

Maciej Lasyk, High Availability Explained

1/14

“Anything that can go wrong, will go wrong”
Murphy's law


2/14

Murphy's law

An electrical explosion and fire Saturday at a Houston data
center operated by The Planet has taken the entire facility offline.
The company claimed power to the facility was interrupted when a
transformer exploded. Official reports that three walls were blown
down causing a fire.


2/14

Murphy's law

An electrical explosion and fire Saturday at a Houston data
center operated by The Planet has taken the entire facility offline.
The company claimed power to the facility was interrupted when a
transformer exploded. Official reports that three walls were blown
down causing a fire.

Three walls of the electrical equipment room on the first floor
blew several feet from their original position, and the underground
cabling that powers the first floor of H1 was destroyed.


2/14

High Availability is in the eye of the beholder


3/14

CEO: we don't loose sales


3/14

Sales: we can extend our offer basing on HA level


3/14

Accounts managers: we don't upset our customers (that often)


3/14

Developers: we can be proud – our services are working ;)


3/14

System engineers: we can sleep well (and fsck, we love to!)


3/14

System engineers: we can sleep well (and fsck, we love to!)
Technical support: no calls? Back to WoW then.. ;)


3/14

So how many 9's?


4/14

So how many 9's?

Monthly: 1 hour of outage means 100% - 0.13888 ~= 99.86112 of availability


4/14

So how many 9's?

Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability


4/14

So how many 9's?

Yearly: 1 hour of outage means 100% - 0.01142 ~= 99.98858 of availability
Availability

Downtime (year)

Downtime (month)

90% (“one nine”)

36.5 days

72 hours

95%

18.25 days

36 hours

97%

10.96 days

21.6 hours

98%

7.30 days

14.4 hours

99% (“two nines”)

3.65 days

7.2 hours

99.5%

1.83 days

3.6 hours

99.8%

17.52 hours

86.23 minutes

99.9% (“three nines”)

4.38 hours

21.56 minutes

99.99 (“four nines”)

52.56 minutes

4.32 minutes

99.999 (“five nines”)

5.26 minutes

25.9 seconds


4/14

So how many 9's?

https://jazz.net/wiki/bin/view/Deployment/HighAvailability


4/14

HA terminology
RPO: Recovery Point Objective; how much data can we loose?


5/14

HA terminology
RTO: Recovery Time Objective; how long does it take to recover?


5/14

HA terminology
RTO: Recovery Time Objective; how long does it take to recover?
MTBF: Mean-Times-Between-Failures; time between failures
(density fnc -> reliability fnc)

https://en.wikipedia.org/wiki/Mean_time_between_failures


5/14

HA terminology
SLA: Service Level Agreement;
formal definitions (customer <-> provider)


5/14

HA terminology
SLA: Service Level Agreement;
formal definitions (customer <-> provider)
OLA: Operational Level Agreement; definitions within organization;
help us keeping provided SLAs


5/14

SLAs..
So what is written in SLAs?
Availability

Downtime (year)

Downtime (month)

90%

36.5 days

72 hours

95%

18.25 days

36 hours

97%

10.96 days

21.6 hours

98%

7.30 days

14.4 hours

99%

3.65 days

7.2 hours

99.5% (EC2, EBS)

1.83 days

3.6 hours

99.8%

17.52 hours

86.23 minutes

99.9% (SoftLayer, IBM)

4.38 hours

21.56 minutes

99.99

52.56 minutes

4.32 minutes

99.999

5.26 minutes

25.9 seconds


5/14

SLAs..
So what is written in SLAs?
Availability

Downtime (year)

Downtime (month)

90%

36.5 days

72 hours

95%

18.25 days

36 hours

97%

10.96 days

21.6 hours

98%

7.30 days

14.4 hours

99%

3.65 days

7.2 hours

99.5% (EC2, EBS)

1.83 days

3.6 hours

99.8%

17.52 hours

86.23 minutes

99.9% (SoftLayer, IBM)

4.38 hours

21.56 minutes

99.99

52.56 minutes

4.32 minutes

99.999

5.26 minutes

25.9 seconds

http://aws.amazon.com/ec2/sla/
http://www.softlayer.com/about/service-level-agreement

5/14

SLAs..

Availability mentioned in SLAs are only goals of service provider
Usually when it's not met than company pays off the fees


5/14

How deep is this hole?
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
So we would like to achieve 99,9999% which is about 30s of downtime per year

6/14

How deep is this hole?
app layer (core, db, cache)
data storage
operating system
hardware
networking
location
Even Proof of Concept is very hard to provide: 5s of downtime per layer yearly!

6/14

Load-balancing and failover

LB:

http://www.netdigix.com/linux-loadbalancing.php


7/14

Load-balancing and failover

Failover:

http://www.simplefailover.com/

7/14

th

th

LB – 4 layer or 7 ?

4th layer:

7th layer:

- high performance

- low cost

- just do the LB work!

- good for quickfixes / patches

- reliable

- not that scalable

- scalable

- low performance
- complex codebase
- custom code for protocols
- cookies? what about memcache..


8/14

Disaster Recovery


9/14

Disaster Recovery

http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments


9/14

Disaster Recovery

http://disasterrecovery.starwindsoftware.com/planning-disaster-recovery-for-virtualized-environments

Hot site: active synchronization, could be serving services. Cost can be high
Warm site: periodical synchronization, DR tests needed. Low costs
Cold site: Nothing here – just echo and some place to spin services; nightmare

9/14

Planning for failure


10/14

Everything starts here - DNS:
- keep TTLs low (300s). Can't make under 60min? That's bad!
- check SLA of DNS servers (dnsmadeeasy.com history)
- what do you know about DNSes?
- zero downtime here is a must!
- this can be achieved with complicated network abracadabra
- remember what 99.9999% means?
- round robin is a load – balancer but without failover!
- GSLB – killed by OS/browser/srvs cache'ing
(GlobalServerLoadBalancing)
- GlobalIP (SoftLayer etc) – workaround for GSLB via routing


10/14

E-mail servers:
- it's simple as MX records (delivering)
- it's almost simple as complicated system of SMTP servers (sending)
- it's not that simple when IMAP locking over DFS (reading)

5 gmail-smtp-in.l.google.com.
10 alt1.gmail-smtp-in.l.google.com.
When MXing – watch the spam!


10/14

WEB servers:
- it's simple as some frontend loadbalancer
- did you really stick user session to particular server? Memcache!
- LB balancing algorithm
- how many Lbs?
- what if LB goes down?


10/14

DB servers:
- it's.. not that simple
- replication (master – master? App should be aware..)
- replication ring? Complicated, works, but in case of failure...
- let's talk about MySQL:
- NoSPOF solution: MySQL cluster
- MySQL Galera cluster – synch, active-active multi-master
- master – master – simply works
- Failover? Matsunobu Yoshinori mysql-master-ha
- MySQL utilities (http://www.clusterdb.com/mysql/mysql-utilities-webinar-qa-replay-now-available/)


10/14

Caching servers:
- this is cache for God's sake – why would we use HA here?
- just use proper architecture like... redundancy.


10/14

Caching servers:

Load – balancers:
- remember about failovering IP addresses!


10/14

Caching servers:

Load – balancers:
- remember about failovering IP addresses!
Storage – DFSes:
- GlusterFS – we'll see it in action in a minute
- NFS? Could be – over some SAN / NAS (high cost solution)
- CephFS – just like GlusterFS – it's great and does the work
- DRBD – lower level, does the work on block – device layer – slow...

10/14

GlusterFS:
- low cost (could be..)
- distributed volumes
- replicated volumes
- striped volumes
- and...
- distributed – striped volumes
- distributed – replicated volumes
- distributed – striped – replicated volumes
- sound good? :)

10/14

GlusterFS: replicated volumes vs Geo-replication
- replicated:
- mirrors data
- provides HA
- synch – replication
- Geo-replication:
- mirrors data across geo – distributed clusters
- ensures backing up data for DR
- asynch – replica (periodic checks)


10/14

HA for virtualization solutions?
- it's really complicated, like...


11/14

Tools
The most important tool would be the conclusion from the picture below:


12/14

Tools
- DNS: roundrobin, GSLB, low ttls, globalIP


12/14

Tools
- Load-Balancers (l7, stateless services)): HaProxy, Pound, Nginx


12/14

Tools
- Failover (statefull services):
- IP: KeepAlived + sysctl


12/14

Tools
- Managing: pacemaker (manager) + corosync (message'ing)


12/14

Tools
- Managing: pacemaker (manager) + corosync (message'ing)
- (almost) All-In-One: Linux Virtual Server


12/14

Turn on HA thinking!
Main goal of HA? Improve user experience!
- keep the app fully functional
- keep the app resistant and tolerant to faults
- provide method for a successful audit
- sleep well (anyone awake?) ;)


13/14

Thank you :)
High Availability Explained
Maciej Lasyk
Kraków, devOPS meetup #2
2014-01-28
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net


14/14

High Availability (HA) Explained

More Related Content

What's hot (20)

Similar to High Availability (HA) Explained (20)

More from Maciej Lasyk (20)

Recently uploaded (20)

High Availability (HA) Explained