Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?

Flexible transactional scale for the connected world.
Challenges to Scaling MySQL:
Best Practices for Creating High Availability
Dave A. Anselmi @AnselmiDave
Director of Product Management

Questions for Today
PROPRIETARY & CONFIDENTIAL 2
o What is high availability and when is it needed?
o What’s the difference between high availability and fault tolerance?
o How is it possible to survive a multi-node failure in MySQL?
o What are the best practices for achieving high availability with MySQL?
o What are the costs of achieving HA? What can be the most cost-effective
strategy??

HA: As You Scale, Your Exposure GrowsSCALE
(GROWTH/SUCCESS)
T I M E
LAMP Stack
AWS, Azure,
RAX, GCE, etc
Private Cloud
REACH LIMIT
App too slow;
Lost users
REACH LIMIT
(AGAIN)
App too slow;
Lost users
Migrate
to Bigger
Machine
• Read slaves, then
sharding, etc:
• Add more hardware &
DBAs
• Refactor code
/hardwired app
 More Expensive
 Higher Risk
 Lost Revenue
ONGOING:
• Refactoring
hardware
• Data balancing
• Shard
maintenance
REPEAT
Migrate
to Bigger
Machine

What do we mean by “High Availability”?

Availability by the 9s
PROPRIETARY & CONFIDENTIAL 5https://www.percona.com/blog/2016/06/07/choosing-mysql-high-availability-solutions/

“High Availability” –vs- “Fault Tolerance”
o High Availability – Minimize system downtime
– Trade-off between as high “9s” level as can be budgeted
– Goal: least amount of data loss possible
o Fault Tolerance – System cannot go down
– Arrays of redundant hardware, and automated failover systems
– Cost is very high
ORCL:
– A high availability system minimizes the time when the system is down, or
unavailable and maximizes the time when it is running, or available.
IBM:
– A fault tolerant environment has no service interruption but a significantly
higher cost,
– A highly available environment has a minimal service interruption.

Fault Tolerance –vs– High Availability
o Fault Tolerance:
– Failover application processes,
including heartbeat
– Shared storage layer, multiple
participants
o High Availability:
– Multiple redundant shared-
nothing servers
– Replication to keep in sync

High Availability rather than Fault-Tolerance
o MySQL systems are rarely fault tolerant
– High cost of fault tolerance is prohibitive
o Most MySQL systems use replication for HA/DR
o Galera isn’t fault tolerant
– Certification replication provides synchronous replication
between nodes
– Availability is enforced over consistency: the write-set can be
committed on the local node before the rest of the cluster has
committed (Jepsen)

Challenges to Deploying HA Systems

Four Challenges to HA
1. Tech/DevOps always wants HA:
1. Throughput & uptime is their core metric/KPIs
2. Business/Finance demands justification of the costs:
1. Redundant servers reflect underutilized resources
2. Redundant servers are considered “wasted budget”
3. Cloud/PaaS/IaaS can imply more HA than they provide
– “Architected for 11x 9s”
4. Tension between scale and HA
1. Ideally, each new server would provide scale and redundancy
2. In practice, result is mixed; so choice is usually for scale

Realities of HA in the Cloud
o “Promise” –vs- “Reality” of the cloud
– Promise of the cloud: web scale
– Reality of the cloud: TANSTAAFL
o “Doesn’t the cloud provide HA automatically?”
– MBAs: literally taught “DevOps just wants to spend $$”
• “We don’t need redundancy: we’re on the cloud, & the cloud is 5x 9s, right?”
• “S3 is architected for 11x 9s, right?”
• “We’re on Amazon, it’s backed up”
o MUST deploy redundant hardware in the cloud
– If it’s not on your bill, you haven’t provisioned it
o “Success” of AWS Marketing  Exposed Workload
– 2/28/2017 4 Hour S3 outage: Even though “the cloud” has lots of hardware
that does NOT mean your systems are fault tolerant, let alone HA

“Obvious” Critical Workloads needing HA
o E-Commerce
– Black Friday/Cyber Monday, Single’s Day, Back to School, flash
sales, etc
– 80% of Revenue in 2 months
– Provisioning > 3x capacity for 2 months
o Finance
– System of Record
– “Money changing hands”
o Healthcare
– “Life/death decisions” & DSS

Assessing Your Workload’s Exposure
o Downtime: how much new business lost?
o How much does brand awareness/damage cost?
o Lost data = what kind of cost?
– Orders unfulfilled, unhappy customers
– Missing/stale reports, unhappy executives
o Not just e-commerce:
– Internal critical DSS Reports => top bank runs 2x 100+ node
sharded arrays
• DSS needs to be near-real time
• What if a shard fails, or the data is old?

Business Case for HA
The “insurance” of HA offsets multiple costs:
o Opportunity cost
– Each missed visitor was potentially a customer or referral
o Single sale cost
– Each missed sale is a tangible missed $-value
o Customer lifetime cost
– Unhappy customers who find sites they like better, won’t return
o Market/brand cost
– All customers use social media: communication “force multiplier”
– “If you make customers unhappy in the physical world, they might each tell six friends. If
you make customers unhappy on the internet, they can each tell 6,000.” – Jeff Bezos
– W. Edwards Deming said “5” and “20”…
– Call it “Customer Satisfaction at Web-Scale”

Strategies to Make MySQL Deployments HA

MySQL HA is usually Replication-based
o Redundant servers
– Goal: get HA and more scale
– Some level of consistency
o Read slave or DR – data is still ‘seconds behind master’
– Async or Semistrict
o Certification
– Strong consistency as long as only a single master accepts writes
o Group Replication
– Strong consistency as long as only a single master accepts writes

Consistency Ramifications to High Availability
o Async Replication (Master/Slave):
– Replication-based: latency between master and slaves
– Always some number of transactions which COMMIT on Master aren’t
represented on the Slave
– “Trade latency for throughput.” OK for your workload?
o Sync Replication:
– Certification Replication: certificate is transmitted, local master commits
before ACK, other nodes commit in background
– Cloud Spanner & CockroachDB: time-based optimization for replicated
partitions
o Strong Consistency
– Every node is in identical, global transactional state at all times
– All nodes (at least two) containing data associated with the transaction are
durably updated before application receives ACK

Different Replication Strategies for HA
Approach Details Pro’s Con’s
Read Slave(s) Add a “Slave” read-server(s) to
“Master” database server
(e.g. “DR” node or cluster)
• Easy setup
• Single-master simplicity
• Async == Slave is usually behind
master
• Eventually Consistent
Master/
Master
Both Masters are Slaves to each
other
• Allows updates to both masters • Async == Slave is usually behind
master
• Eventually Consistent
Certification
Replication
Multi-Master cluster using
synchronous Replication
• Allows multiple masters to be close
in state
• Sync == Other nodes need to commit
the certification. Window of skew
exists (much shorter than async)
Group
Replication
1. Single-Primary, with
automatic leader election
2. Multi-Primary, i.e. similar to
certification replication
• Allows multiple masters to be close
in state
• Sync == Other nodes need to commit
the certification. Window of skew
exists (much shorter than async)

MySQL Deployment Architectures
SHARDO4SHARDO1 SHARDO2 SHARDO3
A-G H-M N-S T-Z
DRDR DR DR
A-G H-M N-S T-Z

HA Strategies per Architecture
MySQL Deployment
Approach Single Node Read Slave(s) Master/Master Sharding
Read
Slave(s)
• Each read slave adds
read scale + HA
• Eventual consistency
N/A
• Secondary master is
effectively same state as a
read slave
• Each shard has a read
slave
Master/
Master
• No HA benefit over
Read Slave
• Secondary master is
effectively same state
as a read slave
N/A
• Each shard in
Master/Master
Certification
Replication
• Nodes are closer in
state than read slave
• Nodes are closer in
state than read slave
• Nodes are closer in state
than Master/Master
• Each shard in
Master/Master using
certification replication
Group
Replication
• Automatic Master
election
• Group members are
closer in state than read
slave
• Automatic Master
election
• Group members are
closer in state than
read slave
• Group members are closer
in state than Master/Master
• Each shard using group
Replication
• Automatic Master election

How ClustrixDB Provides High Availability

ClustrixDB:
ClustrixDB
ACID Compliant
Transactions & Joins
Optimized for OLTP
Built-In High Availability
Flex-Up and Flex-Down
Minimal DB Admin
o Write + Read Linear Scale-Out
o Automatically Highly Available
o MySQL-Compatible

Automatic High Availability
o Planned or Unplanned Outages
– Planned: “soft-fail” the node(s)
– Single minimal “database pause” to regain
quorum
o At least 2 instances of the data distributed
across all the nodes
– All data instances fully in sync at all times
o Data is automatically rebalanced across
the cluster
– Tables are online for reads and writes
– MVCC for lockless reads while writing
S1
S2
S3
S3
S4
S4
S5
S1
ClustrixDB
S2
S5

Questions for Today
o What is high availability and when is it needed?
– Redundancy to minimize downtimes
– Financial, health, and other critical workloads
o What’s the difference between high availability and fault tolerance?
– High availability: minimize downtime
– Fault tolerance: zero downtime
o How is it possible to survive a multi-node failure in MySQL?
– Multiple server redundancy
– Maintaining strong consistency requires synchronous data replication
between servers

Questions for Today
o What are the best practices for achieving high availability with
MySQL?
– Synchronous replication: can affect performance or scale
– Asynchronous replication: can affect data consistency
o What are the costs of achieving HA? What can be the most cost-
effective strategy??
– Redundancy of servers: CAPEX & OPEX for DevOps
– License/support costs: ramps up by # of servers
– Ideally: each server provides scale + HA

Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?

More Related Content

What's hot (20)

Similar to Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment? (20)

More from Clustrix (13)

Recently uploaded (20)

Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?

Editor's Notes