SlideShare a Scribd company logo
krzysztof@severalnines.com
Copyright 2018 Severalnines AB
Presenter
Krzysztof Książek, Senior Support Engineer @Severalnines
How to Manage Replication Failover
Processes for MySQL, MariaDB &
PostgreSQL
December 11th, 2018
Copyright 2018 Severalnines AB
I'm JJ from the Severalnines Team and I'm your host for
today's webinar!
Feel free to ask any questions in the Questions section
of this application or via the Chat box.
You can also contact me directly via the chat box or via
email: jj@severalnines.com during or after the
webinar.
Your host & some logistics
Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB
About Severalnines & ClusterControl
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
About ClusterControl
# Free to Download
# Initial 30 Days Enterprise
Trial
# Reverts to Free
Community Edition
# Enterprise / Paid Versions
Available
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
ClusterControl Automation & Management
Deployment (Free Community)
# Deploy a Cluster in Minutes
○ On-Prem
○ Cloud (AWS/Azure/Google) - paid

Monitoring (Free Community)
# Systems View with 1 sec Resolution
# Agentless via SSH, or agent-based with Prometheus
# DB / OS stats & Performance Advisors
# Configurable Dashboards
# Query Analyzer
# Real-time / historical
Management (Paid Features)
# Backup Management
# Upgrades & Patching
# Security & Compliance
# Operational Reports
# Automatic Recovery & Repair
# Performance Management
# Automatic Performance Advisors
Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB
Supported Databases
Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
Our Customers
krzysztof@severalnines.com
Copyright 2018 Severalnines AB
Presenter
Krzysztof Książek, Senior Support Engineer @Severalnines
How to Manage Replication Failover
Processes for MySQL, MariaDB &
PostgreSQL
December 11th, 2018
Copyright 2018 Severalnines AB
•An introduction to failover - what, when, how
in MySQL / MariaDB
in PostgreSQL
•To automate or not to automate
•Understanding the failover process
•Orchestrating failover across the whole HA stack
•Difficult problems
Network partitioning
Missed heartbeats
Split brain
•From assisted to fully automated failover with ClusterControl
Demo
Agenda
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
An introduction to failover - what, when, how
Copyright 2018 Severalnines AB
•A switchover is the process of switching a
master role to another server through the
process of a slave promotion
•A failover is the process of switching a master
role to another server through the process of a
slave promotion. Old master is not available or
its availability is limited
This is worse scenario as you cannot
assume all the slaves are in sync
•Today the we will focus on the failover process
An introduction to replication failover - what, when, how
Copyright 2018 Severalnines AB
•The failover is performed when the old master became
unavailable. Both in MySQL and PostgreSQL replication,
writes have to be sent to the master therefore its crash
affects the whole cluster, making it not available
•What is important, you should verify the master
connectivity from the point of the slaves
It may happen that the monitoring node cannot reach
the master while slaves are happily replicating from it
Failover should be triggered only if the master is
indeed not reachable neither by the application nor
by the slaves
An introduction to replication failover - what, when, how
Copyright 2018 Severalnines AB
•After a master crash you end up with one or more slaves
•Verify that the master is indeed not reachable
•Decide which slave is the most up to date and pick it as master candidate
•Ensure there are no errant transactions on the master candidate
•Collect missing data from the master (if it is possible) and replay them on the master
candidate
•Reslave all remaining slaves off the new master
•Ensure to the best of your abilities that the old master will not be started again before it can
be investigated
•Rebuild the old master as a slave using the data from the new master
Failover in MySQL
Copyright 2018 Severalnines AB
•After an active server crash you end up with one or more standby servers
•Verify that the active server is indeed not reachable
•Find the most advanced standby server
•Trigger the failover using either pg_ctl promote or the trigger_file
•pg_rewind for remaining standby servers to make them in sync with the new master
•Reslave remaining standby servers to the new master
•Ensure to the best of your abilities that the old master will not be started again before it can
be investigated
•Rebuild the old master as a slave using the data from the new master
Failover in PostgreSQL
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
To automate or not to automate?
Copyright 2018 Severalnines AB
•As shown in last two slides, the failover requires couple of steps to be performed
As usual, more steps and more complex they are, the higher chance for human error
•Scripts can easily perform all the tasks required, run all the checks and do it way faster and
more reliable than human can do
•Scripts are as smart as we wrote them, though. Humans tend to be more flexible and can
handle unpredictable situations better
•Should we automate the failover or not? That’s the question!
•Let’s go through some pros and cons of automated failover
To automate or not to automate?
Copyright 2018 Severalnines AB
•Pros
Way faster reaction on the issue
Higher reliability for typical situations
When configured correctly, may handle
majority of the cases in a proper way
Reduce oncall burnout - even though
you page your staff, it’s not as critical
given that the systems are up and
running
To automate or not to automate?
•Cons
Limited situation awareness - does not
understand the large picture (or
understand what has been coded in)
Decisions made are not always correct
Requires intensive tests to ensure
reliability
Has to be maintained (if it is your own
script)
Copyright 2018 Severalnines AB
•The main differencing factors are the reaction time and lack of the situation awareness
•Automated failover will be faster but may take actions user would not take
•But the logic can be improved and safety features like white/blacklists can be use in attempt
to reduce incorrect behaviour
•Better visibility can also be implemented:
Access tests through multiple hosts (slaves, proxies)
Utilising clustering protocol like Raft or Paxos for network split detection
•Don’t expect automated failover to cover correctly 100% of the cases though
•A third way may also be applicable - assisted failover
Does everything automatically but is initiated by the user, after the initial assessment
To automate or not to automate?
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Understanding the failover process
Copyright 2018 Severalnines AB
•Ensuring that the master is indeed down is critical
•You never want to run two writable masters at the same time!
•You may want to implement some sort of STONITH (Shoot The Other Node In The Head) to
ensure dead master will stay dead
•You can leverage data from multiple sources. Are slaves replicating? Do proxies see the
master?
Understanding the failover process

Ensure that the master is indeed down
Copyright 2018 Severalnines AB
•Picking correct slave as the master candidate is critical
•You want to use the most advanced slave to avoid data loss
•You want to ensure there are no errant transactions (in GTID setup)
•You want to allow slave to apply the events from relay logs (as long as it does not take too
long)
•You want to try and reach the master to see if there are non-replicated binary log events
Master failure not always mean you cannot SSH there and parse binlogs for missing
transactions
Understanding the failover process

Pick the correct slave as the master candidate
Copyright 2018 Severalnines AB
•Correct usage of whitelists and blacklists is critical
•You may not want to promote any slave that you have
•Better to stay within the same datacenter to avoid split brain scenario with two masters
•Better to stay within the same datastore version for compatibility reasons
•Better to stay within the same hardware for performance reasons
•While executing a failover use the standard procedures for marking masters and slaves
read_only and super_read_only = 0 or 1?
Understanding the failover process
Correct usage of whitelists and blacklists
Copyright 2018 Severalnines AB
•Automated failover process can sometimes be augmented by the use of pre- or post-failover
actions
•Do you want to perform some action when the master failed?
•Do you need to reconfigure some application when a new master is promoted?
•Do you want to remove old master entry from your Consul key/value store?
•Most of the main tools that support failover handling support also pre- and post-failover
actions
MHA
Orchestrator
ClusterControl
Understanding the failover process
Pre- and post-failover actions
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
•Databases do not exist in vacuum, they are surrounded by other services to create a highly
available environment
•Proxies need a way to distinguish between the master and a slave
In PostgreSQL streaming replication this is typically the existence of a recovery.conf file
In MySQL it can be, for example a value of read_only and super_read_only: 1 or 0
•When failover is happening, you have to make sure you manage the variable’s value
correctly
You don’t want loadbalancers to send the traffic to your databases while failover is
happening
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
Copyright 2018 Severalnines AB
•All loadbalancers deployed by ClusterControl follow those rules
recovery.conf file on PostgreSQL
read_only value on MySQL
•ClusterControl ensures that the values in MySQL are defined accordingly to the stage of the
process
in switchover, the master is demoted through read_only=1. In failover this cannot be done
still, read_only=1 is configured in MySQL configuration on all nodes to minimise the chance
of old master returning as writable host
new master is marked with read_only=0
•This process works but it does not cover all the situations
Orchestrating failover across the whole HA stack
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Difficult problems
Copyright 2018 Severalnines AB
•Networks can be unstable and packets may be lost in the transfer
•Replication itself is robust and it will work quite well even if there are network problems
•Health checks performed over the replication also have to take such conditions under
consideration
•Make sure you do not take any actions based on just a single health check
•Make sure you do not take any actions based on just a single host’s point of view
•Expect network problems and try to understand their severity before an action will be taken
Difficult problems - network issues
Copyright 2018 Severalnines AB
•Every cluster type has its own problems.
For MySQL and PostgreSQL replication one
of the biggest issues is the lack of cluster
awareness and lack of quorum support
•Replication clusters are prone to the
network split issues
•Automated topology detection by proxies
can make things even more tricky
•There’s no easy, standard way to avoid this
problem
Difficult problems - network split
Copyright 2018 Severalnines AB
•Network split happens when there’s lack of connectivity between one part of the cluster and
the other part
For example, the master cannot reach slaves, slaves cannot reach the master
•Master is unavailable therefore cluster cannot handle writes
Failover should be performed to restore cluster’s ability to handle traffic
•Master is still running though, when networks converge two writeable hosts will show up
•Standard topology detection logic will not be enough. Two nodes will have read_only=0, two
nodes will not have the recovery.conf file
Without additional measures to ensure the old master won’t get the traffic, a split brain is
imminent
Difficult problems - network split
Copyright 2018 Severalnines AB
•Split brain is a condition in which two writable nodes take the traffic and, as a result, their
data sets drift apart
•There’s no easy solution to recover from such condition
Shut down rogue master as soon as possible to minimise the data drift
Manual action will be required to converge the data sets
•Make sure that whatever solution you choose, it works
You can do better than GitHub!
Difficult problems - split brain
Copyright 2018 Severalnines AB
Difficult problems - split brain
Copyright 2018 Severalnines AB
•There are numerous ways in which you can reduce (but not avoid) the impact and probability
that your data will be affected by the network issues
•Collect as much data about the state of the replication topology before an action is taken
Utilize multiple nodes as the point of view on the topology
•Try to implement STONITH to reduce the chance that old master will show up
Some kind of Lights-Out solution (iLO for example) might work in physical environment
Kill scripts (destroy given virtual instance) may work in the cloud
•Modify configuration of the proxies to remove old master after it’s deemed as dead
•No solution will be 100% bullet proof
You may not be able to reach all the proxies, the node itself or cloud service to kill the master
Difficult problems - how to avoid them?
Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Demo
End of Year Promotion
Get Three Months Free
25% In
Savings
Just Sign By December 20th!
with an Annual Contract
Copyright 2018 Severalnines AB
•Blogs that cover failover:
https://severalnines.com/blog/introduction-failover-mysql-replication-101-blog
https://severalnines.com/blog/failover-postgresql-replication-101
https://severalnines.com/blog/how-control-replication-failover-mysql-and-mariadb
https://severalnines.com/blog/controlling-replication-failover-mysql-and-mariadb-pre-or-post-
failover-scripts
•To automate or not to automate?
https://severalnines.com/blog/failover-mysql-replication-and-others-should-it-be-automated
• Contact: jj@severalnines.com
Thank you!

More Related Content

What's hot (20)

PDF
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Severalnines
 
PPTX
How Pixid dropped Oracle and went hybrid with MariaDB
MariaDB plc
 
PPTX
Advanced MySql Data-at-Rest Encryption in Percona Server
Severalnines
 
PPTX
RedisConf18 - My Other Car is a Redis Cluster
Redis Labs
 
PDF
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Severalnines
 
PPTX
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
MariaDB plc
 
PPTX
How we switched to columnar at SpendHQ
MariaDB plc
 
PPTX
CCV: migrating our payment processing system to MariaDB
MariaDB plc
 
PDF
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
InfluxData
 
PPTX
ClustrixDB: how distributed databases scale out
MariaDB plc
 
PDF
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
InfluxData
 
PDF
Introducing the R2DBC async Java connector
MariaDB plc
 
PDF
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
PPTX
Best practices: running high-performance databases on Kubernetes
MariaDB plc
 
PPTX
FinOps introduction
Alexander Tokarev
 
PDF
Database Security Threats - MariaDB Security Best Practices
MariaDB plc
 
PPTX
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
Redis Labs
 
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PDF
Using Databases and Containers From Development to Deployment
Aerospike, Inc.
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Severalnines
 
How Pixid dropped Oracle and went hybrid with MariaDB
MariaDB plc
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Severalnines
 
RedisConf18 - My Other Car is a Redis Cluster
Redis Labs
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Severalnines
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
MariaDB plc
 
How we switched to columnar at SpendHQ
MariaDB plc
 
CCV: migrating our payment processing system to MariaDB
MariaDB plc
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
InfluxData
 
ClustrixDB: how distributed databases scale out
MariaDB plc
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
InfluxData
 
Introducing the R2DBC async Java connector
MariaDB plc
 
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
Best practices: running high-performance databases on Kubernetes
MariaDB plc
 
FinOps introduction
Alexander Tokarev
 
Database Security Threats - MariaDB Security Best Practices
MariaDB plc
 
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
Redis Labs
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Using Databases and Containers From Development to Deployment
Aerospike, Inc.
 

Similar to Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL (20)

PDF
Webinar slides: How to Get Started with Open Source Database Management
Severalnines
 
PDF
Failover or not to failover
Henrik Ingo
 
PDF
Webinar slides: How to Measure Database Availability?
Severalnines
 
PDF
Failover System in Cloud Computing System
Hitesh Mohapatra
 
ODP
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios
 
ODP
Fail over fail_back
PostgreSQL Experts, Inc.
 
PPTX
PriyaDharshini distributed operating system
PriyadharshiniVS
 
PPTX
Vanmathy distributed operating system
PriyadharshiniVS
 
PDF
Design (Cloud systems) for Failures
Rodolfo Kohn
 
PDF
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
Severalnines
 
PDF
TECHNICAL WHITE PAPER: Bare Metal & Dissimilar Hardware Recovery with Backup ...
Symantec
 
PDF
Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Con...
Paulo Jesus
 
PDF
Fault tolerance
Gaurav Rawat
 
PDF
Webinar slides: Replication Topology Changes for MySQL and MariaDB
Severalnines
 
PPTX
Planning to Fail #phpne13
Dave Gardner
 
PPT
Database Expert Q&A from 2600hz and Cloudant
Joshua Goldbard
 
PDF
Nagios, Getting Started.
Hitesh Bhatia
 
PDF
Production Readiness Strategies in an Automated World
Sean Chittenden
 
PPT
Lecture07_FaultTolerance in parallel and distributing
sameerkumar56473
 
PPT
Lecture07_FaultTolerance in parallel and distributed
sameerkumar56473
 
Webinar slides: How to Get Started with Open Source Database Management
Severalnines
 
Failover or not to failover
Henrik Ingo
 
Webinar slides: How to Measure Database Availability?
Severalnines
 
Failover System in Cloud Computing System
Hitesh Mohapatra
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios
 
Fail over fail_back
PostgreSQL Experts, Inc.
 
PriyaDharshini distributed operating system
PriyadharshiniVS
 
Vanmathy distributed operating system
PriyadharshiniVS
 
Design (Cloud systems) for Failures
Rodolfo Kohn
 
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
Severalnines
 
TECHNICAL WHITE PAPER: Bare Metal & Dissimilar Hardware Recovery with Backup ...
Symantec
 
Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Con...
Paulo Jesus
 
Fault tolerance
Gaurav Rawat
 
Webinar slides: Replication Topology Changes for MySQL and MariaDB
Severalnines
 
Planning to Fail #phpne13
Dave Gardner
 
Database Expert Q&A from 2600hz and Cloudant
Joshua Goldbard
 
Nagios, Getting Started.
Hitesh Bhatia
 
Production Readiness Strategies in an Automated World
Sean Chittenden
 
Lecture07_FaultTolerance in parallel and distributing
sameerkumar56473
 
Lecture07_FaultTolerance in parallel and distributed
sameerkumar56473
 
Ad

More from Severalnines (19)

PDF
The Long Term Cost of Managed DBaaS vs Sovereign DBaaS
Severalnines
 
PPTX
Sovereign DBaaS_ A Practical Vision for Self-Implementation of DBaaS.pptx
Severalnines
 
PDF
PostgreSQL on AWS Aurora/Azure Cosmos VS EC2/Azure VMs
Severalnines
 
PDF
Localhost Conference 2024_ Building a Flexible and Scalable Database Strategy...
Severalnines
 
PDF
SREDAY London 2024 | Cloud Native Technologies: The Building Blocks of Modern...
Severalnines
 
PDF
Building a Sovereign DBaaS on K8s OpenInfra Summit Asia 2024.pdf
Severalnines
 
PDF
S-DBaaS Community Call | Introduction to Sovereign DBaaS: The why, what and how
Severalnines
 
PDF
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
PPTX
LIVE DEMO: CCX for CSPs, a drop-in DBaaS solution
Severalnines
 
PDF
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
PDF
DIY DBaaS: A guide to building your own full-featured DBaaS
Severalnines
 
PDF
Cloud's future runs through Sovereign DBaaS
Severalnines
 
PPTX
Tips to drive maria db cluster performance for nextcloud
Severalnines
 
PPTX
Working with the Moodle Database: The Basics
Severalnines
 
PPTX
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
Severalnines
 
PPTX
Performance Tuning Cheat Sheet for MongoDB
Severalnines
 
PPTX
Polyglot Persistence Utilizing Open Source Databases as a Swiss Pocket Knife
Severalnines
 
PDF
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Severalnines
 
PDF
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
The Long Term Cost of Managed DBaaS vs Sovereign DBaaS
Severalnines
 
Sovereign DBaaS_ A Practical Vision for Self-Implementation of DBaaS.pptx
Severalnines
 
PostgreSQL on AWS Aurora/Azure Cosmos VS EC2/Azure VMs
Severalnines
 
Localhost Conference 2024_ Building a Flexible and Scalable Database Strategy...
Severalnines
 
SREDAY London 2024 | Cloud Native Technologies: The Building Blocks of Modern...
Severalnines
 
Building a Sovereign DBaaS on K8s OpenInfra Summit Asia 2024.pdf
Severalnines
 
S-DBaaS Community Call | Introduction to Sovereign DBaaS: The why, what and how
Severalnines
 
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
LIVE DEMO: CCX for CSPs, a drop-in DBaaS solution
Severalnines
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
DIY DBaaS: A guide to building your own full-featured DBaaS
Severalnines
 
Cloud's future runs through Sovereign DBaaS
Severalnines
 
Tips to drive maria db cluster performance for nextcloud
Severalnines
 
Working with the Moodle Database: The Basics
Severalnines
 
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
Severalnines
 
Performance Tuning Cheat Sheet for MongoDB
Severalnines
 
Polyglot Persistence Utilizing Open Source Databases as a Swiss Pocket Knife
Severalnines
 
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...
Severalnines
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
Ad

Recently uploaded (20)

PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PPTX
internet básico presentacion es una red global
70965857
 
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
PDF
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PDF
The Internet - By the numbers, presented at npNOG 11
APNIC
 
PDF
BRKACI-1001 - Your First 7 Days of ACI.pdf
fcesargonca
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PPTX
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
PDF
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
PDF
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PPTX
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PPTX
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
PDF
Paper: Quantum Financial System - DeFi patent wars
Steven McGee
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
internet básico presentacion es una red global
70965857
 
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
The Internet - By the numbers, presented at npNOG 11
APNIC
 
BRKACI-1001 - Your First 7 Days of ACI.pdf
fcesargonca
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
Paper: Quantum Financial System - DeFi patent wars
Steven McGee
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 

Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL

  • 1. [email protected] Copyright 2018 Severalnines AB Presenter Krzysztof Książek, Senior Support Engineer @Severalnines How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL December 11th, 2018
  • 2. Copyright 2018 Severalnines AB I'm JJ from the Severalnines Team and I'm your host for today's webinar! Feel free to ask any questions in the Questions section of this application or via the Chat box. You can also contact me directly via the chat box or via email: [email protected] during or after the webinar. Your host & some logistics
  • 3. Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB About Severalnines & ClusterControl
  • 4. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
  • 5. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB About ClusterControl # Free to Download # Initial 30 Days Enterprise Trial # Reverts to Free Community Edition # Enterprise / Paid Versions Available
  • 6. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB ClusterControl Automation & Management Deployment (Free Community) # Deploy a Cluster in Minutes ○ On-Prem ○ Cloud (AWS/Azure/Google) - paid
 Monitoring (Free Community) # Systems View with 1 sec Resolution # Agentless via SSH, or agent-based with Prometheus # DB / OS stats & Performance Advisors # Configurable Dashboards # Query Analyzer # Real-time / historical Management (Paid Features) # Backup Management # Upgrades & Patching # Security & Compliance # Operational Reports # Automatic Recovery & Repair # Performance Management # Automatic Performance Advisors
  • 7. Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB Supported Databases
  • 8. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB Our Customers
  • 9. [email protected] Copyright 2018 Severalnines AB Presenter Krzysztof Książek, Senior Support Engineer @Severalnines How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL December 11th, 2018
  • 10. Copyright 2018 Severalnines AB •An introduction to failover - what, when, how in MySQL / MariaDB in PostgreSQL •To automate or not to automate •Understanding the failover process •Orchestrating failover across the whole HA stack •Difficult problems Network partitioning Missed heartbeats Split brain •From assisted to fully automated failover with ClusterControl Demo Agenda
  • 11. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB An introduction to failover - what, when, how
  • 12. Copyright 2018 Severalnines AB •A switchover is the process of switching a master role to another server through the process of a slave promotion •A failover is the process of switching a master role to another server through the process of a slave promotion. Old master is not available or its availability is limited This is worse scenario as you cannot assume all the slaves are in sync •Today the we will focus on the failover process An introduction to replication failover - what, when, how
  • 13. Copyright 2018 Severalnines AB •The failover is performed when the old master became unavailable. Both in MySQL and PostgreSQL replication, writes have to be sent to the master therefore its crash affects the whole cluster, making it not available •What is important, you should verify the master connectivity from the point of the slaves It may happen that the monitoring node cannot reach the master while slaves are happily replicating from it Failover should be triggered only if the master is indeed not reachable neither by the application nor by the slaves An introduction to replication failover - what, when, how
  • 14. Copyright 2018 Severalnines AB •After a master crash you end up with one or more slaves •Verify that the master is indeed not reachable •Decide which slave is the most up to date and pick it as master candidate •Ensure there are no errant transactions on the master candidate •Collect missing data from the master (if it is possible) and replay them on the master candidate •Reslave all remaining slaves off the new master •Ensure to the best of your abilities that the old master will not be started again before it can be investigated •Rebuild the old master as a slave using the data from the new master Failover in MySQL
  • 15. Copyright 2018 Severalnines AB •After an active server crash you end up with one or more standby servers •Verify that the active server is indeed not reachable •Find the most advanced standby server •Trigger the failover using either pg_ctl promote or the trigger_file •pg_rewind for remaining standby servers to make them in sync with the new master •Reslave remaining standby servers to the new master •Ensure to the best of your abilities that the old master will not be started again before it can be investigated •Rebuild the old master as a slave using the data from the new master Failover in PostgreSQL
  • 16. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB To automate or not to automate?
  • 17. Copyright 2018 Severalnines AB •As shown in last two slides, the failover requires couple of steps to be performed As usual, more steps and more complex they are, the higher chance for human error •Scripts can easily perform all the tasks required, run all the checks and do it way faster and more reliable than human can do •Scripts are as smart as we wrote them, though. Humans tend to be more flexible and can handle unpredictable situations better •Should we automate the failover or not? That’s the question! •Let’s go through some pros and cons of automated failover To automate or not to automate?
  • 18. Copyright 2018 Severalnines AB •Pros Way faster reaction on the issue Higher reliability for typical situations When configured correctly, may handle majority of the cases in a proper way Reduce oncall burnout - even though you page your staff, it’s not as critical given that the systems are up and running To automate or not to automate? •Cons Limited situation awareness - does not understand the large picture (or understand what has been coded in) Decisions made are not always correct Requires intensive tests to ensure reliability Has to be maintained (if it is your own script)
  • 19. Copyright 2018 Severalnines AB •The main differencing factors are the reaction time and lack of the situation awareness •Automated failover will be faster but may take actions user would not take •But the logic can be improved and safety features like white/blacklists can be use in attempt to reduce incorrect behaviour •Better visibility can also be implemented: Access tests through multiple hosts (slaves, proxies) Utilising clustering protocol like Raft or Paxos for network split detection •Don’t expect automated failover to cover correctly 100% of the cases though •A third way may also be applicable - assisted failover Does everything automatically but is initiated by the user, after the initial assessment To automate or not to automate?
  • 20. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Understanding the failover process
  • 21. Copyright 2018 Severalnines AB •Ensuring that the master is indeed down is critical •You never want to run two writable masters at the same time! •You may want to implement some sort of STONITH (Shoot The Other Node In The Head) to ensure dead master will stay dead •You can leverage data from multiple sources. Are slaves replicating? Do proxies see the master? Understanding the failover process
 Ensure that the master is indeed down
  • 22. Copyright 2018 Severalnines AB •Picking correct slave as the master candidate is critical •You want to use the most advanced slave to avoid data loss •You want to ensure there are no errant transactions (in GTID setup) •You want to allow slave to apply the events from relay logs (as long as it does not take too long) •You want to try and reach the master to see if there are non-replicated binary log events Master failure not always mean you cannot SSH there and parse binlogs for missing transactions Understanding the failover process
 Pick the correct slave as the master candidate
  • 23. Copyright 2018 Severalnines AB •Correct usage of whitelists and blacklists is critical •You may not want to promote any slave that you have •Better to stay within the same datacenter to avoid split brain scenario with two masters •Better to stay within the same datastore version for compatibility reasons •Better to stay within the same hardware for performance reasons •While executing a failover use the standard procedures for marking masters and slaves read_only and super_read_only = 0 or 1? Understanding the failover process Correct usage of whitelists and blacklists
  • 24. Copyright 2018 Severalnines AB •Automated failover process can sometimes be augmented by the use of pre- or post-failover actions •Do you want to perform some action when the master failed? •Do you need to reconfigure some application when a new master is promoted? •Do you want to remove old master entry from your Consul key/value store? •Most of the main tools that support failover handling support also pre- and post-failover actions MHA Orchestrator ClusterControl Understanding the failover process Pre- and post-failover actions
  • 25. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  • 26. Copyright 2018 Severalnines AB •Databases do not exist in vacuum, they are surrounded by other services to create a highly available environment •Proxies need a way to distinguish between the master and a slave In PostgreSQL streaming replication this is typically the existence of a recovery.conf file In MySQL it can be, for example a value of read_only and super_read_only: 1 or 0 •When failover is happening, you have to make sure you manage the variable’s value correctly You don’t want loadbalancers to send the traffic to your databases while failover is happening Orchestrating failover across the whole HA stack
  • 27. Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  • 28. Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  • 29. Copyright 2018 Severalnines AB •All loadbalancers deployed by ClusterControl follow those rules recovery.conf file on PostgreSQL read_only value on MySQL •ClusterControl ensures that the values in MySQL are defined accordingly to the stage of the process in switchover, the master is demoted through read_only=1. In failover this cannot be done still, read_only=1 is configured in MySQL configuration on all nodes to minimise the chance of old master returning as writable host new master is marked with read_only=0 •This process works but it does not cover all the situations Orchestrating failover across the whole HA stack
  • 30. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Difficult problems
  • 31. Copyright 2018 Severalnines AB •Networks can be unstable and packets may be lost in the transfer •Replication itself is robust and it will work quite well even if there are network problems •Health checks performed over the replication also have to take such conditions under consideration •Make sure you do not take any actions based on just a single health check •Make sure you do not take any actions based on just a single host’s point of view •Expect network problems and try to understand their severity before an action will be taken Difficult problems - network issues
  • 32. Copyright 2018 Severalnines AB •Every cluster type has its own problems. For MySQL and PostgreSQL replication one of the biggest issues is the lack of cluster awareness and lack of quorum support •Replication clusters are prone to the network split issues •Automated topology detection by proxies can make things even more tricky •There’s no easy, standard way to avoid this problem Difficult problems - network split
  • 33. Copyright 2018 Severalnines AB •Network split happens when there’s lack of connectivity between one part of the cluster and the other part For example, the master cannot reach slaves, slaves cannot reach the master •Master is unavailable therefore cluster cannot handle writes Failover should be performed to restore cluster’s ability to handle traffic •Master is still running though, when networks converge two writeable hosts will show up •Standard topology detection logic will not be enough. Two nodes will have read_only=0, two nodes will not have the recovery.conf file Without additional measures to ensure the old master won’t get the traffic, a split brain is imminent Difficult problems - network split
  • 34. Copyright 2018 Severalnines AB •Split brain is a condition in which two writable nodes take the traffic and, as a result, their data sets drift apart •There’s no easy solution to recover from such condition Shut down rogue master as soon as possible to minimise the data drift Manual action will be required to converge the data sets •Make sure that whatever solution you choose, it works You can do better than GitHub! Difficult problems - split brain
  • 35. Copyright 2018 Severalnines AB Difficult problems - split brain
  • 36. Copyright 2018 Severalnines AB •There are numerous ways in which you can reduce (but not avoid) the impact and probability that your data will be affected by the network issues •Collect as much data about the state of the replication topology before an action is taken Utilize multiple nodes as the point of view on the topology •Try to implement STONITH to reduce the chance that old master will show up Some kind of Lights-Out solution (iLO for example) might work in physical environment Kill scripts (destroy given virtual instance) may work in the cloud •Modify configuration of the proxies to remove old master after it’s deemed as dead •No solution will be 100% bullet proof You may not be able to reach all the proxies, the node itself or cloud service to kill the master Difficult problems - how to avoid them?
  • 37. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Demo
  • 38. End of Year Promotion Get Three Months Free 25% In Savings Just Sign By December 20th! with an Annual Contract
  • 39. Copyright 2018 Severalnines AB •Blogs that cover failover: https://severalnines.com/blog/introduction-failover-mysql-replication-101-blog https://severalnines.com/blog/failover-postgresql-replication-101 https://severalnines.com/blog/how-control-replication-failover-mysql-and-mariadb https://severalnines.com/blog/controlling-replication-failover-mysql-and-mariadb-pre-or-post- failover-scripts •To automate or not to automate? https://severalnines.com/blog/failover-mysql-replication-and-others-should-it-be-automated • Contact: [email protected] Thank you!