SlideShare a Scribd company logo
Flexible transactional scale for the connected world.
Challenges to Scaling MySQL:
Best Practices for Creating High Availability
Dave A. Anselmi @AnselmiDave
Director of Product Management
Questions for Today
PROPRIETARY & CONFIDENTIAL 2
o What is high availability and when is it needed?
o What’s the difference between high availability and fault tolerance?
o How is it possible to survive a multi-node failure in MySQL?
o What are the best practices for achieving high availability with MySQL?
o What are the costs of achieving HA? What can be the most cost-effective
strategy??
HA: As You Scale, Your Exposure GrowsSCALE
(GROWTH/SUCCESS)
T I M E
LAMP Stack
AWS, Azure,
RAX, GCE, etc
Private Cloud
REACH LIMIT
App too slow;
Lost users
REACH LIMIT
(AGAIN)
App too slow;
Lost users
Migrate
to Bigger
Machine
• Read slaves, then
sharding, etc:
• Add more hardware &
DBAs
• Refactor code
/hardwired app
 More Expensive
 Higher Risk
 Lost Revenue
ONGOING:
• Refactoring
hardware
• Data balancing
• Shard
maintenance
REPEAT
Migrate
to Bigger
Machine
PROPRIETARY & CONFIDENTIAL 3
What do we mean by “High Availability”?
Availability by the 9s
PROPRIETARY & CONFIDENTIAL 5https://www.percona.com/blog/2016/06/07/choosing-mysql-high-availability-solutions/
“High Availability” –vs- “Fault Tolerance”
o High Availability – Minimize system downtime
– Trade-off between as high “9s” level as can be budgeted
– Goal: least amount of data loss possible
o Fault Tolerance – System cannot go down
– Arrays of redundant hardware, and automated failover systems
– Cost is very high
ORCL:
– A high availability system minimizes the time when the system is down, or
unavailable and maximizes the time when it is running, or available.
IBM:
– A fault tolerant environment has no service interruption but a significantly
higher cost,
– A highly available environment has a minimal service interruption.
PROPRIETARY & CONFIDENTIAL 6
Fault Tolerance –vs– High Availability
o Fault Tolerance:
– Failover application processes,
including heartbeat
– Shared storage layer, multiple
participants
PROPRIETARY & CONFIDENTIAL 7
o High Availability:
– Multiple redundant shared-
nothing servers
– Replication to keep in sync
High Availability rather than Fault-Tolerance
o MySQL systems are rarely fault tolerant
– High cost of fault tolerance is prohibitive
o Most MySQL systems use replication for HA/DR
o Galera isn’t fault tolerant
– Certification replication provides synchronous replication
between nodes
– Availability is enforced over consistency: the write-set can be
committed on the local node before the rest of the cluster has
committed (Jepsen)
PROPRIETARY & CONFIDENTIAL 8
Challenges to Deploying HA Systems
Four Challenges to HA
1. Tech/DevOps always wants HA:
1. Throughput & uptime is their core metric/KPIs
2. Business/Finance demands justification of the costs:
1. Redundant servers reflect underutilized resources
2. Redundant servers are considered “wasted budget”
3. Cloud/PaaS/IaaS can imply more HA than they provide
– “Architected for 11x 9s”
4. Tension between scale and HA
1. Ideally, each new server would provide scale and redundancy
2. In practice, result is mixed; so choice is usually for scale
PROPRIETARY & CONFIDENTIAL 10
Realities of HA in the Cloud
o “Promise” –vs- “Reality” of the cloud
– Promise of the cloud: web scale
– Reality of the cloud: TANSTAAFL
o “Doesn’t the cloud provide HA automatically?”
– MBAs: literally taught “DevOps just wants to spend $$”
• “We don’t need redundancy: we’re on the cloud, & the cloud is 5x 9s, right?”
• “S3 is architected for 11x 9s, right?”
• “We’re on Amazon, it’s backed up”
o MUST deploy redundant hardware in the cloud
– If it’s not on your bill, you haven’t provisioned it
o “Success” of AWS Marketing  Exposed Workload
– 2/28/2017 4 Hour S3 outage: Even though “the cloud” has lots of hardware
that does NOT mean your systems are fault tolerant, let alone HA
PROPRIETARY & CONFIDENTIAL 11
“Obvious” Critical Workloads needing HA
o E-Commerce
– Black Friday/Cyber Monday, Single’s Day, Back to School, flash
sales, etc
– 80% of Revenue in 2 months
– Provisioning > 3x capacity for 2 months
o Finance
– System of Record
– “Money changing hands”
o Healthcare
– “Life/death decisions” & DSS
PROPRIETARY & CONFIDENTIAL 12
Assessing Your Workload’s Exposure
o Downtime: how much new business lost?
o How much does brand awareness/damage cost?
o Lost data = what kind of cost?
– Orders unfulfilled, unhappy customers
– Missing/stale reports, unhappy executives
o Not just e-commerce:
– Internal critical DSS Reports => top bank runs 2x 100+ node
sharded arrays
• DSS needs to be near-real time
• What if a shard fails, or the data is old?
PROPRIETARY & CONFIDENTIAL 13
Business Case for HA
The “insurance” of HA offsets multiple costs:
o Opportunity cost
– Each missed visitor was potentially a customer or referral
o Single sale cost
– Each missed sale is a tangible missed $-value
o Customer lifetime cost
– Unhappy customers who find sites they like better, won’t return
o Market/brand cost
– All customers use social media: communication “force multiplier”
– “If you make customers unhappy in the physical world, they might each tell six friends. If
you make customers unhappy on the internet, they can each tell 6,000.” – Jeff Bezos
– W. Edwards Deming said “5” and “20”…
– Call it “Customer Satisfaction at Web-Scale”
PROPRIETARY & CONFIDENTIAL 14
Strategies to Make MySQL Deployments HA
MySQL HA is usually Replication-based
o Redundant servers
– Goal: get HA and more scale
– Some level of consistency
o Read slave or DR – data is still ‘seconds behind master’
– Async or Semistrict
o Certification
– Strong consistency as long as only a single master accepts writes
o Group Replication
– Strong consistency as long as only a single master accepts writes
PROPRIETARY & CONFIDENTIAL 16
Consistency Ramifications to High Availability
o Async Replication (Master/Slave):
– Replication-based: latency between master and slaves
– Always some number of transactions which COMMIT on Master aren’t
represented on the Slave
– “Trade latency for throughput.” OK for your workload?
o Sync Replication:
– Certification Replication: certificate is transmitted, local master commits
before ACK, other nodes commit in background
– Cloud Spanner & CockroachDB: time-based optimization for replicated
partitions
o Strong Consistency
– Every node is in identical, global transactional state at all times
– All nodes (at least two) containing data associated with the transaction are
durably updated before application receives ACK
PROPRIETARY & CONFIDENTIAL 17
Different Replication Strategies for HA
Approach Details Pro’s Con’s
Read Slave(s) Add a “Slave” read-server(s) to
“Master” database server
(e.g. “DR” node or cluster)
• Easy setup
• Single-master simplicity
• Async == Slave is usually behind
master
• Eventually Consistent
Master/
Master
Both Masters are Slaves to each
other
• Allows updates to both masters • Async == Slave is usually behind
master
• Eventually Consistent
Certification
Replication
Multi-Master cluster using
synchronous Replication
• Allows multiple masters to be close
in state
• Sync == Other nodes need to commit
the certification. Window of skew
exists (much shorter than async)
Group
Replication
1. Single-Primary, with
automatic leader election
2. Multi-Primary, i.e. similar to
certification replication
• Allows multiple masters to be close
in state
• Sync == Other nodes need to commit
the certification. Window of skew
exists (much shorter than async)
MySQL Deployment Architectures
PROPRIETARY & CONFIDENTIAL 19
SHARDO4SHARDO1 SHARDO2 SHARDO3
A-G H-M N-S T-Z
DRDR DR DR
A-G H-M N-S T-Z
HA Strategies per Architecture
MySQL Deployment
Approach Single Node Read Slave(s) Master/Master Sharding
Read
Slave(s)
• Each read slave adds
read scale + HA
• Eventual consistency
N/A
• Secondary master is
effectively same state as a
read slave
• Each shard has a read
slave
• Eventual consistency
Master/
Master
• No HA benefit over
Read Slave
• Secondary master is
effectively same state
as a read slave
N/A
• Each shard in
Master/Master
• Eventual consistency
Certification
Replication
• Nodes are closer in
state than read slave
• Nodes are closer in
state than read slave
• Nodes are closer in state
than Master/Master
• Each shard in
Master/Master using
certification replication
Group
Replication
• Automatic Master
election
• Group members are
closer in state than read
slave
• Automatic Master
election
• Group members are
closer in state than
read slave
• Group members are closer
in state than Master/Master
• Each shard using group
Replication
• Automatic Master election
How ClustrixDB Provides High Availability
ClustrixDB:
PROPRIETARY & CONFIDENTIAL 22
ClustrixDB
ACID Compliant
Transactions & Joins
Optimized for OLTP
Built-In High Availability
Flex-Up and Flex-Down
Minimal DB Admin
o Write + Read Linear Scale-Out
o Automatically Highly Available
o MySQL-Compatible
PROPRIETARY & CONFIDENTIAL 23
Automatic High Availability
o Planned or Unplanned Outages
– Planned: “soft-fail” the node(s)
– Single minimal “database pause” to regain
quorum
o At least 2 instances of the data distributed
across all the nodes
– All data instances fully in sync at all times
o Data is automatically rebalanced across
the cluster
– Tables are online for reads and writes
– MVCC for lockless reads while writing
S1
S2
S3
S3
S4
S4
S5
S1
ClustrixDB
S2
S5
Questions for Today
o What is high availability and when is it needed?
– Redundancy to minimize downtimes
– Financial, health, and other critical workloads
o What’s the difference between high availability and fault tolerance?
– High availability: minimize downtime
– Fault tolerance: zero downtime
o How is it possible to survive a multi-node failure in MySQL?
– Multiple server redundancy
– Maintaining strong consistency requires synchronous data replication
between servers
PROPRIETARY & CONFIDENTIAL 24
Questions for Today
o What are the best practices for achieving high availability with
MySQL?
– Synchronous replication: can affect performance or scale
– Asynchronous replication: can affect data consistency
o What are the costs of achieving HA? What can be the most cost-
effective strategy??
– Redundancy of servers: CAPEX & OPEX for DevOps
– License/support costs: ramps up by # of servers
– Ideally: each server provides scale + HA
PROPRIETARY & CONFIDENTIAL 25
QUESTIONS?
THANK YOU!

More Related Content

What's hot (20)

KEY
Writing Scalable Software in Java
Ruben Badaró
 
PDF
MariaDB on Docker
MariaDB plc
 
PPTX
Running MariaDB in multiple data centers
MariaDB plc
 
PPTX
Right-Sizing your SQL Server Virtual Machine
heraflux
 
PPTX
Azure Databases with IaaS
Kellyn Pot'Vin-Gorman
 
PPTX
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
Esther Kundin
 
PDF
Become a MySQL DBA: performing live database upgrades - webinar slides
Severalnines
 
PDF
Webinar slides: Managing MySQL Replication for High Availability
Severalnines
 
PDF
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
confluent
 
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
PDF
Client Drivers and Cassandra, the Right Way
DataStax Academy
 
PDF
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Dave Anselmi
 
PDF
Apache Cassandra Certification
Vskills
 
PPTX
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
MariaDB plc
 
PDF
Using all of the high availability options in MariaDB
MariaDB plc
 
PDF
Dev Ops without the Ops
Konstantin Gredeskoul
 
PPTX
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
PPTX
MariaDB High Availability
MariaDB plc
 
PDF
Cassandra at eBay - Cassandra Summit 2013
Jay Patel
 
PDF
PowerDNS with MySQL
I Goo Lee
 
Writing Scalable Software in Java
Ruben Badaró
 
MariaDB on Docker
MariaDB plc
 
Running MariaDB in multiple data centers
MariaDB plc
 
Right-Sizing your SQL Server Virtual Machine
heraflux
 
Azure Databases with IaaS
Kellyn Pot'Vin-Gorman
 
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
Esther Kundin
 
Become a MySQL DBA: performing live database upgrades - webinar slides
Severalnines
 
Webinar slides: Managing MySQL Replication for High Availability
Severalnines
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
confluent
 
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
Client Drivers and Cassandra, the Right Way
DataStax Academy
 
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Dave Anselmi
 
Apache Cassandra Certification
Vskills
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
MariaDB plc
 
Using all of the high availability options in MariaDB
MariaDB plc
 
Dev Ops without the Ops
Konstantin Gredeskoul
 
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
MariaDB High Availability
MariaDB plc
 
Cassandra at eBay - Cassandra Summit 2013
Jay Patel
 
PowerDNS with MySQL
I Goo Lee
 

Similar to Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment? (20)

PDF
Why MySQL High Availability Matters
Matt Lord
 
PDF
1 architecture & design
Mark Swarbrick
 
PPTX
MySQL High Availibility Solutions
Mark Swarbrick
 
PDF
MySQL 5.7 InnoDB Cluster (Jan 2018)
Olivier DASINI
 
PDF
Breda Development Meetup 2016-06-08 - High Availability
Bas Peters
 
PPTX
MySQL High Availability Solutions - Feb 2015 webinar
Andrew Morgan
 
PDF
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
Olivier DASINI
 
PDF
Moodle Moot Spain: Moodle Available and Scalable with MySQL HA - InnoDB Clust...
Keith Hollman
 
PDF
Netherlands Tech Tour 02 - MySQL Fabric
Mark Swarbrick
 
PPTX
MySQL HA Sharding-Fabric
Abdul Manaf
 
PPTX
Using MySQL Fabric for High Availability and Scaling Out
OSSCube
 
PPT
MySQL HA Presentation
papablues
 
PDF
MySQL InnoDB Cluster HA Overview & Demo
Keith Hollman
 
PDF
MySQL High Availability Solutions
Mydbops
 
PDF
MySQL InnoDB Cluster - Meetup Oracle MySQL / AFUP Paris
Olivier DASINI
 
PDF
Architecting High Availability Linux Environments within the Rackspace Cloud
Rackspace
 
PDF
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
VMware Tanzu
 
PDF
Scaling MySQL -- Swanseacon.co.uk
Dave Stokes
 
PPT
Mysql high availability and scalability
yin gong
 
Why MySQL High Availability Matters
Matt Lord
 
1 architecture & design
Mark Swarbrick
 
MySQL High Availibility Solutions
Mark Swarbrick
 
MySQL 5.7 InnoDB Cluster (Jan 2018)
Olivier DASINI
 
Breda Development Meetup 2016-06-08 - High Availability
Bas Peters
 
MySQL High Availability Solutions - Feb 2015 webinar
Andrew Morgan
 
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
Olivier DASINI
 
Moodle Moot Spain: Moodle Available and Scalable with MySQL HA - InnoDB Clust...
Keith Hollman
 
Netherlands Tech Tour 02 - MySQL Fabric
Mark Swarbrick
 
MySQL HA Sharding-Fabric
Abdul Manaf
 
Using MySQL Fabric for High Availability and Scaling Out
OSSCube
 
MySQL HA Presentation
papablues
 
MySQL InnoDB Cluster HA Overview & Demo
Keith Hollman
 
MySQL High Availability Solutions
Mydbops
 
MySQL InnoDB Cluster - Meetup Oracle MySQL / AFUP Paris
Olivier DASINI
 
Architecting High Availability Linux Environments within the Rackspace Cloud
Rackspace
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
VMware Tanzu
 
Scaling MySQL -- Swanseacon.co.uk
Dave Stokes
 
Mysql high availability and scalability
yin gong
 
Ad

More from Clustrix (13)

PDF
Achieve new levels of performance for Magento e-commerce sites.
Clustrix
 
PDF
ClustrixDB 7.5 Announcement
Clustrix
 
PPTX
Moving an E-commerce Site to AWS. A Case Study
Clustrix
 
PPTX
Beyond Aurora. Scale-out SQL databases for AWS
Clustrix
 
PPTX
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Clustrix
 
PDF
Benchmark: Beyond Aurora. Scale-out SQL databases for AWS.
Clustrix
 
PPTX
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Clustrix
 
PPTX
Scaling Techniques to Increase Magento Capacity
Clustrix
 
PPTX
Supersizing Magento
Clustrix
 
PDF
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
Clustrix
 
PDF
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
Clustrix
 
PPTX
Clustrix Database Overview
Clustrix
 
PPTX
Clustrix Database Percona Ruby on Rails benchmark
Clustrix
 
Achieve new levels of performance for Magento e-commerce sites.
Clustrix
 
ClustrixDB 7.5 Announcement
Clustrix
 
Moving an E-commerce Site to AWS. A Case Study
Clustrix
 
Beyond Aurora. Scale-out SQL databases for AWS
Clustrix
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Clustrix
 
Benchmark: Beyond Aurora. Scale-out SQL databases for AWS.
Clustrix
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Clustrix
 
Scaling Techniques to Increase Magento Capacity
Clustrix
 
Supersizing Magento
Clustrix
 
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
Clustrix
 
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
Clustrix
 
Clustrix Database Overview
Clustrix
 
Clustrix Database Percona Ruby on Rails benchmark
Clustrix
 
Ad

Recently uploaded (20)

PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
PPTX
Essential Content-centric Plugins for your Website
Laura Byrne
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
PPTX
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Modern Decentralized Application Architectures.pdf
Kalema Edgar
 
PDF
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
Essential Content-centric Plugins for your Website
Laura Byrne
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Modern Decentralized Application Architectures.pdf
Kalema Edgar
 
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 

Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?

  • 1. Flexible transactional scale for the connected world. Challenges to Scaling MySQL: Best Practices for Creating High Availability Dave A. Anselmi @AnselmiDave Director of Product Management
  • 2. Questions for Today PROPRIETARY & CONFIDENTIAL 2 o What is high availability and when is it needed? o What’s the difference between high availability and fault tolerance? o How is it possible to survive a multi-node failure in MySQL? o What are the best practices for achieving high availability with MySQL? o What are the costs of achieving HA? What can be the most cost-effective strategy??
  • 3. HA: As You Scale, Your Exposure GrowsSCALE (GROWTH/SUCCESS) T I M E LAMP Stack AWS, Azure, RAX, GCE, etc Private Cloud REACH LIMIT App too slow; Lost users REACH LIMIT (AGAIN) App too slow; Lost users Migrate to Bigger Machine • Read slaves, then sharding, etc: • Add more hardware & DBAs • Refactor code /hardwired app  More Expensive  Higher Risk  Lost Revenue ONGOING: • Refactoring hardware • Data balancing • Shard maintenance REPEAT Migrate to Bigger Machine PROPRIETARY & CONFIDENTIAL 3
  • 4. What do we mean by “High Availability”?
  • 5. Availability by the 9s PROPRIETARY & CONFIDENTIAL 5https://www.percona.com/blog/2016/06/07/choosing-mysql-high-availability-solutions/
  • 6. “High Availability” –vs- “Fault Tolerance” o High Availability – Minimize system downtime – Trade-off between as high “9s” level as can be budgeted – Goal: least amount of data loss possible o Fault Tolerance – System cannot go down – Arrays of redundant hardware, and automated failover systems – Cost is very high ORCL: – A high availability system minimizes the time when the system is down, or unavailable and maximizes the time when it is running, or available. IBM: – A fault tolerant environment has no service interruption but a significantly higher cost, – A highly available environment has a minimal service interruption. PROPRIETARY & CONFIDENTIAL 6
  • 7. Fault Tolerance –vs– High Availability o Fault Tolerance: – Failover application processes, including heartbeat – Shared storage layer, multiple participants PROPRIETARY & CONFIDENTIAL 7 o High Availability: – Multiple redundant shared- nothing servers – Replication to keep in sync
  • 8. High Availability rather than Fault-Tolerance o MySQL systems are rarely fault tolerant – High cost of fault tolerance is prohibitive o Most MySQL systems use replication for HA/DR o Galera isn’t fault tolerant – Certification replication provides synchronous replication between nodes – Availability is enforced over consistency: the write-set can be committed on the local node before the rest of the cluster has committed (Jepsen) PROPRIETARY & CONFIDENTIAL 8
  • 10. Four Challenges to HA 1. Tech/DevOps always wants HA: 1. Throughput & uptime is their core metric/KPIs 2. Business/Finance demands justification of the costs: 1. Redundant servers reflect underutilized resources 2. Redundant servers are considered “wasted budget” 3. Cloud/PaaS/IaaS can imply more HA than they provide – “Architected for 11x 9s” 4. Tension between scale and HA 1. Ideally, each new server would provide scale and redundancy 2. In practice, result is mixed; so choice is usually for scale PROPRIETARY & CONFIDENTIAL 10
  • 11. Realities of HA in the Cloud o “Promise” –vs- “Reality” of the cloud – Promise of the cloud: web scale – Reality of the cloud: TANSTAAFL o “Doesn’t the cloud provide HA automatically?” – MBAs: literally taught “DevOps just wants to spend $$” • “We don’t need redundancy: we’re on the cloud, & the cloud is 5x 9s, right?” • “S3 is architected for 11x 9s, right?” • “We’re on Amazon, it’s backed up” o MUST deploy redundant hardware in the cloud – If it’s not on your bill, you haven’t provisioned it o “Success” of AWS Marketing  Exposed Workload – 2/28/2017 4 Hour S3 outage: Even though “the cloud” has lots of hardware that does NOT mean your systems are fault tolerant, let alone HA PROPRIETARY & CONFIDENTIAL 11
  • 12. “Obvious” Critical Workloads needing HA o E-Commerce – Black Friday/Cyber Monday, Single’s Day, Back to School, flash sales, etc – 80% of Revenue in 2 months – Provisioning > 3x capacity for 2 months o Finance – System of Record – “Money changing hands” o Healthcare – “Life/death decisions” & DSS PROPRIETARY & CONFIDENTIAL 12
  • 13. Assessing Your Workload’s Exposure o Downtime: how much new business lost? o How much does brand awareness/damage cost? o Lost data = what kind of cost? – Orders unfulfilled, unhappy customers – Missing/stale reports, unhappy executives o Not just e-commerce: – Internal critical DSS Reports => top bank runs 2x 100+ node sharded arrays • DSS needs to be near-real time • What if a shard fails, or the data is old? PROPRIETARY & CONFIDENTIAL 13
  • 14. Business Case for HA The “insurance” of HA offsets multiple costs: o Opportunity cost – Each missed visitor was potentially a customer or referral o Single sale cost – Each missed sale is a tangible missed $-value o Customer lifetime cost – Unhappy customers who find sites they like better, won’t return o Market/brand cost – All customers use social media: communication “force multiplier” – “If you make customers unhappy in the physical world, they might each tell six friends. If you make customers unhappy on the internet, they can each tell 6,000.” – Jeff Bezos – W. Edwards Deming said “5” and “20”… – Call it “Customer Satisfaction at Web-Scale” PROPRIETARY & CONFIDENTIAL 14
  • 15. Strategies to Make MySQL Deployments HA
  • 16. MySQL HA is usually Replication-based o Redundant servers – Goal: get HA and more scale – Some level of consistency o Read slave or DR – data is still ‘seconds behind master’ – Async or Semistrict o Certification – Strong consistency as long as only a single master accepts writes o Group Replication – Strong consistency as long as only a single master accepts writes PROPRIETARY & CONFIDENTIAL 16
  • 17. Consistency Ramifications to High Availability o Async Replication (Master/Slave): – Replication-based: latency between master and slaves – Always some number of transactions which COMMIT on Master aren’t represented on the Slave – “Trade latency for throughput.” OK for your workload? o Sync Replication: – Certification Replication: certificate is transmitted, local master commits before ACK, other nodes commit in background – Cloud Spanner & CockroachDB: time-based optimization for replicated partitions o Strong Consistency – Every node is in identical, global transactional state at all times – All nodes (at least two) containing data associated with the transaction are durably updated before application receives ACK PROPRIETARY & CONFIDENTIAL 17
  • 18. Different Replication Strategies for HA Approach Details Pro’s Con’s Read Slave(s) Add a “Slave” read-server(s) to “Master” database server (e.g. “DR” node or cluster) • Easy setup • Single-master simplicity • Async == Slave is usually behind master • Eventually Consistent Master/ Master Both Masters are Slaves to each other • Allows updates to both masters • Async == Slave is usually behind master • Eventually Consistent Certification Replication Multi-Master cluster using synchronous Replication • Allows multiple masters to be close in state • Sync == Other nodes need to commit the certification. Window of skew exists (much shorter than async) Group Replication 1. Single-Primary, with automatic leader election 2. Multi-Primary, i.e. similar to certification replication • Allows multiple masters to be close in state • Sync == Other nodes need to commit the certification. Window of skew exists (much shorter than async)
  • 19. MySQL Deployment Architectures PROPRIETARY & CONFIDENTIAL 19 SHARDO4SHARDO1 SHARDO2 SHARDO3 A-G H-M N-S T-Z DRDR DR DR A-G H-M N-S T-Z
  • 20. HA Strategies per Architecture MySQL Deployment Approach Single Node Read Slave(s) Master/Master Sharding Read Slave(s) • Each read slave adds read scale + HA • Eventual consistency N/A • Secondary master is effectively same state as a read slave • Each shard has a read slave • Eventual consistency Master/ Master • No HA benefit over Read Slave • Secondary master is effectively same state as a read slave N/A • Each shard in Master/Master • Eventual consistency Certification Replication • Nodes are closer in state than read slave • Nodes are closer in state than read slave • Nodes are closer in state than Master/Master • Each shard in Master/Master using certification replication Group Replication • Automatic Master election • Group members are closer in state than read slave • Automatic Master election • Group members are closer in state than read slave • Group members are closer in state than Master/Master • Each shard using group Replication • Automatic Master election
  • 21. How ClustrixDB Provides High Availability
  • 22. ClustrixDB: PROPRIETARY & CONFIDENTIAL 22 ClustrixDB ACID Compliant Transactions & Joins Optimized for OLTP Built-In High Availability Flex-Up and Flex-Down Minimal DB Admin o Write + Read Linear Scale-Out o Automatically Highly Available o MySQL-Compatible
  • 23. PROPRIETARY & CONFIDENTIAL 23 Automatic High Availability o Planned or Unplanned Outages – Planned: “soft-fail” the node(s) – Single minimal “database pause” to regain quorum o At least 2 instances of the data distributed across all the nodes – All data instances fully in sync at all times o Data is automatically rebalanced across the cluster – Tables are online for reads and writes – MVCC for lockless reads while writing S1 S2 S3 S3 S4 S4 S5 S1 ClustrixDB S2 S5
  • 24. Questions for Today o What is high availability and when is it needed? – Redundancy to minimize downtimes – Financial, health, and other critical workloads o What’s the difference between high availability and fault tolerance? – High availability: minimize downtime – Fault tolerance: zero downtime o How is it possible to survive a multi-node failure in MySQL? – Multiple server redundancy – Maintaining strong consistency requires synchronous data replication between servers PROPRIETARY & CONFIDENTIAL 24
  • 25. Questions for Today o What are the best practices for achieving high availability with MySQL? – Synchronous replication: can affect performance or scale – Asynchronous replication: can affect data consistency o What are the costs of achieving HA? What can be the most cost- effective strategy?? – Redundancy of servers: CAPEX & OPEX for DevOps – License/support costs: ramps up by # of servers – Ideally: each server provides scale + HA PROPRIETARY & CONFIDENTIAL 25

Editor's Notes

  • #4: With each additional server or node, you add complexity and fragility
  • #6: Here’s what “5x 9’s” really means, etc. Typical production systems target 5x 9’s
  • #7: Let’s define some terms… ORCL: https://docs.oracle.com/cd/E17904_01/core.1111/e10106/intro.htm#ASHIA712 IBM: https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.1/com.ibm.powerha.concepts/ha_concepts_fault.htm
  • #9: https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster
  • #11: At the risk of making generalizations…