SlideShare a Scribd company logo
Autopsy of an automation disaster
Jean-François Gagné - Saturday, February 4, 2017
FOSDEM MySQL & Friends Devroom
2
To err is human
To really foul things up requires a computer[1]
(or a script)
[1]: http://quoteinvestigator.com/2010/12/07/foul-computer/
Booking.com
● Based in Amsterdam since 1996
● Online Hotel/Accommodation/Travel Agent (OTA):
● +1.134.000 properties in 225 countries
● +1.200.000 room nights reserved daily
● +40 languages (website and customer service)
● +13.000 people working in 187 offices worldwide
● Part of the Priceline Group
● And we use MySQL:
● Thousands (1000s) of servers, ~90% replicating
● >150 masters: ~30 >50 slaves & ~10 >100 slaves
3
Session Summary
1. MySQL replication at Booking.com
2. Automation disaster: external eye
3. Chain of events: analysis
4. Learning / takeaway
4
MySQL replication at Booking.com
● Typical MySQL replication deployment at Booking.com:
+---+
| M |
+---+
|
+------+-- ... --+---------------+-------- ...
| | | |
+---+ +---+ +---+ +---+
| S1| | S2| | Sn| | M1|
+---+ +---+ +---+ +---+
|
+-- ... --+
| |
+---+ +---+
| T1| | Tm|
+---+ +---+
5
MySQL replication at Booking.com’
● And we use Orchestrator:
6
MySQL replication at Booking.com’’
● Orchestrator allows us to:
● Visualize our replication deployments
● Move slaves for planned maintenance of an intermediate master
● Automatically replace an intermediate master in case of its unexpected failure
(thanks to pseudo-GTIDs when we have not deployed GTIDs)
● Automatically replace a master in case of a failure (failing over to a slave)
● But Orchestrator cannot replace a master alone:
● Booking.com uses DNS for master discovery
● So Orchestrator calls a homemade script to repoint DNS (and to do other magic)
7
Our subject database
● Simple replication deployment (in two data centers):
DNS (master) +---+
points here --> | A |
+---+
|
+------------------------+
| |
Reads +---+ +---+
happen here --> | B | | X |
+---+ +---+
|
+---+ And reads
| Y | <-- happen here
+---+
8
Split brain: 1st event
● A and B (two servers in same data center) fail at the same time:
DNS (master) +-/+
points here --> | A |
but accesses +/-+
are now failing
Reads +-/+ +---+
happen here --> | B | | X |
but accesses +/-+ +---+
are now failing |
+---+ And reads
| Y | <-- happen here
+---+
9
(I will cover how/why this happened later.)
Split brain: 1st event’
● Orchestrator fixes things:
+-/+
| A |
+/-+
Reads +-/+ +---+ Now, DNS (master)
happen here --> | B | | X | <-- points here
but accesses +/-+ +---+
are now failing |
+---+ Reads
| Y | <-- happen here
+---+
10
Split brain: disaster
● A few things happen in this day and night, and I wake-up to this:
+-/+
| A |
+/-+
DNS +---+ +---+
points here --> | B | | X |
+---+ +---+
|
+---+
| Y |
+---+
11
Split brain: disaster’
● And to make things worse, reads are still happening on Y:
+-/+
| A |
+/-+
DNS (master) +---+ +---+
points here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
12
Split brain: disaster’’
● This is not good:
● When A and B failed, X was promoted as the new master
● Something made DNS point to B (we will see what later)
 writes are now happening on B
● But B is outdated: all writes to X (after the failure of A) did not reach B
● So we have data on X that cannot be read on B
● And we have new data on B that is not read on Y
13
+-/+
| A |
+/-+
DNS (master) +---+ +---+
points here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
14
+-/+
| A |
+/-+
+-/+ +---+ DNS (master)
| B | | X | <-- points here
+/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
● And something caused a failure of A
15
+---+
| A |
+---+
|
+---+ +---+ DNS (master)
| B | | X | <-- points here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
● And something caused a failure of A
● But how did DNS end-up pointing to B ?
● The failover to B called the DNS repointing script
● The script stole the DNS entry from X
and pointed it to B
16
+-/+
| A |
+/-+
+---+ +---+ DNS (master)
| B | | X | <-- points here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis
● Digging more in the chain of events, we find that:
● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B
● So after their failures, A and B came back and formed an isolated replication chain
● And something caused a failure of A
● But how did DNS end-up pointing to B ?
● The failover to B called the DNS repointing script
● The script stole the DNS entry from X
and pointed it to B
● But is that all: what made A fail ?
17
+-/+
| A |
+/-+
DNS (master) +---+ +---+
points here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
18
+---+
| A |
+---+
|
+---+ +---+ DNS (master)
| B | | X | <-- points here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But as A came back before re-slaving, it injected heartbeat and p-GTID to B
19
+-/+
| A |
+/-+
+---+ +---+ DNS (master)
| B | | X | <-- points here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But as A came back before re-slaving, it injected heartbeat and p-GTID to B
● Then B could have been re-cloned without problems
20
+---+
| A |
+---+
|
+---+ +---+ DNS (master)
| B | | X | <-- points here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But as A came back before re-slaving, it injected heartbeat and p-GTID to B
● Then B could have been re-cloned without problems
● But A was re-cloned instead (human error #1)
21
+---+
| A |
+---+
+-/+ +---+ DNS (master)
| B | | X | <-- points here
+/-+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Split-brain: analysis’
● What made A fail ?
● Once A and B came back up as a new replication chain, they had outdated data
● If B would have come back before A, it could have been re-slaved to X
● But as A came back before re-slaving, it injected heartbeat and p-GTID to B
● Then B could have been re-cloned without problems
● But A was re-cloned instead (human error #1)
● Why did Orchestrator not fail over right away ?
● B was promoted hours after A was brought down…
● Because A was downed time only for 4 hours
(human error #2)
22
+-/+
| A |
+/-+
+---+ +---+ DNS (master)
| B | | X | <-- points here
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Orchestrator anti-flapping
● Orchestrator has a failover throttling/acknowledgment mechanism[1]:
● Automated recovery will happen
● for an instance in a cluster that has not recently been recovered
● unless such recent recoveries were acknowledged.
● In our case:
● the recovery might have been acknowledged too early (human error #0 ?)
● or the “recently” timeout might have been too short
● and maybe Orchestrator should not have failed over the second time
[1]: https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md
#blocking-acknowledgments-anti-flapping
Split brain: summary
● So in summary, this disaster was caused by:
1. A fancy failure: 2 servers failing in the same data center at the same time
2. A debatable premature acknowledgment in Orchestrator
and probably too short a timeout for recent failover
3. Edge-case recovery: both servers forming a new replication topology
4. Re-cloning of the wrong server (A instead of B)
5. Too short downtime for the re-cloning
6. Orchestrator failing over something that it should not have
7. DNS repointing script not defensive enough
24
Fancy failure: more details
● Why did A and B fail at the same time ?
● Deployment error: the two servers in the same rack/failure domain ?
● And/or very unlucky ?
● Very unlucky because…
10 to 20 servers failed that day in the same data center
Because human operations and sensitive hardware
25
DNS (master) +-/+
points here ---> | A |
but now accesses +/-+
are failing
+-/+ +---+
| B | | X |
+/-+ +---+
|
+---+
| Y |
+---+
How to fix such situation ?
● Fixing non-intersecting data on B and X is hard.
● Some solutions are:
● Kill B or X (and lose data)
● Replay writes from B on X (manually or with replication)
● But AUTO_INCREMENTs are in the way:
● up to i used on A before 1st failover
● i-n to j1 used on X after recovery
● i to j2 used on B after 2nd failover
26
+-/+
| A |
+/-+
DNS (master) +---+ +---+
points here --> | B | | X |
+---+ +---+
|
+---+ Reads
| Y | <-- happen here
+---+
Takeaway
● Twisted situations happen
● Automation (including failover) is not simple:
 code automation scripts defensively
● Be mindful for premature acknowledgment
● Downtime more than less
● Shutdown slaves first
● Try something else than AUTO-INCREMENTs for PK
(monotonically increasing UUID[1] [2] ?)
[1]: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
[2]: http://mysql.rjweb.org/doc.php/uuid
Improvements in Orchestrator
● Orchestrator failed over something that it should not have
● Should Orchestrator be changed ?
● I do not know…
● Not easy to define what should be changed
● Suggestions welcome
https://github.com/github/orchestrator/issues
28
Links
● Booking.com Blog: https://blog.booking.com/
● GitHub Orchestrator: https://github.com/github/orchestrator/
● UUID as Primary Key:
● https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
● http://mysql.rjweb.org/doc.php/uuid/
● Myself:
● https://jfg-mysql.blogspot.com/
● https://www.linkedin.com/in/jfg956/
29
Thanks
Jean-François Gagné
jeanfrancois DOT gagne AT booking.com

More Related Content

PDF
The two little bugs that almost brought down Booking.com
Jean-François Gagné
 
PDF
静的型つき組版処理システムSATySFi @第61回プログラミング・シンポジウム
T. Suwa
 
PDF
SATySFi 最近の発展と目下実装中の変更
T. Suwa
 
PDF
Профилирование распределенных систем, Александр Казаков, СКБ Контур
it-people
 
PDF
How Booking.com avoids and deals with replication lag
Jean-François Gagné
 
PDF
MySQL Parallel Replication: inventory, use-case and limitations
Jean-François Gagné
 
PDF
MySQL Parallel Replication: inventory, use-case and limitations
Jean-François Gagné
 
PDF
Autopsy of a MySQL Automation Disaster
Jean-François Gagné
 
The two little bugs that almost brought down Booking.com
Jean-François Gagné
 
静的型つき組版処理システムSATySFi @第61回プログラミング・シンポジウム
T. Suwa
 
SATySFi 最近の発展と目下実装中の変更
T. Suwa
 
Профилирование распределенных систем, Александр Казаков, СКБ Контур
it-people
 
How Booking.com avoids and deals with replication lag
Jean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
Jean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
Jean-François Gagné
 
Autopsy of a MySQL Automation Disaster
Jean-François Gagné
 

Similar to Autopsy of an automation disaster (20)

PPTX
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
Umair Shahid
 
PPTX
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Cloudera, Inc.
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
PDF
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebula Project
 
PDF
Monitoring of OpenNebula installations
NETWAYS
 
PDF
Clustering in PostgreSQL - Because one database server is never enough (and n...
Umair Shahid
 
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
Adrian Hornsby
 
PDF
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Umair Shahid
 
PPTX
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
 
PPTX
Nn ha hadoop world.final
Hortonworks
 
PPTX
Hadoop Summit 2012 | HDFS High Availability
Cloudera, Inc.
 
PDF
Reactive Stream Processing with Mantis
Nick Mahilani
 
PPTX
Dash presentation
Calvin French-Owen
 
PDF
Hadoop availability
Subhas Kumar Ghosh
 
PDF
The anatomy of a cascading failure
Rares Musina
 
PDF
OS_File_systems_Consistency_Semantics.ppt
sujaachar6
 
PPTX
HDFS Namenode High Availability
Hortonworks
 
PDF
Tachyon memory centric, fault tolerance storage for cluster framworks
Viet-Trung TRAN
 
PPTX
Forward Networks - Networking Field Day 13 presentation
Andrew Wesbecher
 
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
Umair Shahid
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Cloudera, Inc.
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebula Project
 
Monitoring of OpenNebula installations
NETWAYS
 
Clustering in PostgreSQL - Because one database server is never enough (and n...
Umair Shahid
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Adrian Hornsby
 
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Umair Shahid
 
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
 
Nn ha hadoop world.final
Hortonworks
 
Hadoop Summit 2012 | HDFS High Availability
Cloudera, Inc.
 
Reactive Stream Processing with Mantis
Nick Mahilani
 
Dash presentation
Calvin French-Owen
 
Hadoop availability
Subhas Kumar Ghosh
 
The anatomy of a cascading failure
Rares Musina
 
OS_File_systems_Consistency_Semantics.ppt
sujaachar6
 
HDFS Namenode High Availability
Hortonworks
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Viet-Trung TRAN
 
Forward Networks - Networking Field Day 13 presentation
Andrew Wesbecher
 
Ad

More from Jean-François Gagné (14)

PDF
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
Jean-François Gagné
 
PDF
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Jean-François Gagné
 
PDF
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
PDF
The consequences of sync_binlog != 1
Jean-François Gagné
 
PDF
MySQL Scalability and Reliability for Replicated Environment
Jean-François Gagné
 
PDF
MySQL Scalability and Reliability for Replicated Environment
Jean-François Gagné
 
PDF
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
PDF
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
PDF
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
PDF
MySQL Parallel Replication by Booking.com
Jean-François Gagné
 
PDF
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
Jean-François Gagné
 
PDF
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
Jean-François Gagné
 
PDF
MySQL Parallel Replication: inventory, use-cases and limitations
Jean-François Gagné
 
PDF
Riding the Binlog: an in Deep Dissection of the Replication Stream
Jean-François Gagné
 
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
Jean-François Gagné
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Jean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
The consequences of sync_binlog != 1
Jean-François Gagné
 
MySQL Scalability and Reliability for Replicated Environment
Jean-François Gagné
 
MySQL Scalability and Reliability for Replicated Environment
Jean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
MySQL Parallel Replication by Booking.com
Jean-François Gagné
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
Jean-François Gagné
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
Jean-François Gagné
 
MySQL Parallel Replication: inventory, use-cases and limitations
Jean-François Gagné
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Jean-François Gagné
 
Ad

Recently uploaded (20)

PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
The Future of Artificial Intelligence (AI)
Mukul
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 

Autopsy of an automation disaster

  • 1. Autopsy of an automation disaster Jean-François Gagné - Saturday, February 4, 2017 FOSDEM MySQL & Friends Devroom
  • 2. 2 To err is human To really foul things up requires a computer[1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/
  • 3. Booking.com ● Based in Amsterdam since 1996 ● Online Hotel/Accommodation/Travel Agent (OTA): ● +1.134.000 properties in 225 countries ● +1.200.000 room nights reserved daily ● +40 languages (website and customer service) ● +13.000 people working in 187 offices worldwide ● Part of the Priceline Group ● And we use MySQL: ● Thousands (1000s) of servers, ~90% replicating ● >150 masters: ~30 >50 slaves & ~10 >100 slaves 3
  • 4. Session Summary 1. MySQL replication at Booking.com 2. Automation disaster: external eye 3. Chain of events: analysis 4. Learning / takeaway 4
  • 5. MySQL replication at Booking.com ● Typical MySQL replication deployment at Booking.com: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 5
  • 6. MySQL replication at Booking.com’ ● And we use Orchestrator: 6
  • 7. MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs) ● Automatically replace a master in case of a failure (failing over to a slave) ● But Orchestrator cannot replace a master alone: ● Booking.com uses DNS for master discovery ● So Orchestrator calls a homemade script to repoint DNS (and to do other magic) 7
  • 8. Our subject database ● Simple replication deployment (in two data centers): DNS (master) +---+ points here --> | A | +---+ | +------------------------+ | | Reads +---+ +---+ happen here --> | B | | X | +---+ +---+ | +---+ And reads | Y | <-- happen here +---+ 8
  • 9. Split brain: 1st event ● A and B (two servers in same data center) fail at the same time: DNS (master) +-/+ points here --> | A | but accesses +/-+ are now failing Reads +-/+ +---+ happen here --> | B | | X | but accesses +/-+ +---+ are now failing | +---+ And reads | Y | <-- happen here +---+ 9 (I will cover how/why this happened later.)
  • 10. Split brain: 1st event’ ● Orchestrator fixes things: +-/+ | A | +/-+ Reads +-/+ +---+ Now, DNS (master) happen here --> | B | | X | <-- points here but accesses +/-+ +---+ are now failing | +---+ Reads | Y | <-- happen here +---+ 10
  • 11. Split brain: disaster ● A few things happen in this day and night, and I wake-up to this: +-/+ | A | +/-+ DNS +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ | Y | +---+ 11
  • 12. Split brain: disaster’ ● And to make things worse, reads are still happening on Y: +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+ 12
  • 13. Split brain: disaster’’ ● This is not good: ● When A and B failed, X was promoted as the new master ● Something made DNS point to B (we will see what later)  writes are now happening on B ● But B is outdated: all writes to X (after the failure of A) did not reach B ● So we have data on X that cannot be read on B ● And we have new data on B that is not read on Y 13 +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 14. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain 14 +-/+ | A | +/-+ +-/+ +---+ DNS (master) | B | | X | <-- points here +/-+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 15. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A 15 +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 16. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? ● The failover to B called the DNS repointing script ● The script stole the DNS entry from X and pointed it to B 16 +-/+ | A | +/-+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 17. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? ● The failover to B called the DNS repointing script ● The script stole the DNS entry from X and pointed it to B ● But is that all: what made A fail ? 17 +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 18. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X 18 +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 19. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B 19 +-/+ | A | +/-+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 20. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B ● Then B could have been re-cloned without problems 20 +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 21. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B ● Then B could have been re-cloned without problems ● But A was re-cloned instead (human error #1) 21 +---+ | A | +---+ +-/+ +---+ DNS (master) | B | | X | <-- points here +/-+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 22. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B ● Then B could have been re-cloned without problems ● But A was re-cloned instead (human error #1) ● Why did Orchestrator not fail over right away ? ● B was promoted hours after A was brought down… ● Because A was downed time only for 4 hours (human error #2) 22 +-/+ | A | +/-+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 23. Orchestrator anti-flapping ● Orchestrator has a failover throttling/acknowledgment mechanism[1]: ● Automated recovery will happen ● for an instance in a cluster that has not recently been recovered ● unless such recent recoveries were acknowledged. ● In our case: ● the recovery might have been acknowledged too early (human error #0 ?) ● or the “recently” timeout might have been too short ● and maybe Orchestrator should not have failed over the second time [1]: https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md #blocking-acknowledgments-anti-flapping
  • 24. Split brain: summary ● So in summary, this disaster was caused by: 1. A fancy failure: 2 servers failing in the same data center at the same time 2. A debatable premature acknowledgment in Orchestrator and probably too short a timeout for recent failover 3. Edge-case recovery: both servers forming a new replication topology 4. Re-cloning of the wrong server (A instead of B) 5. Too short downtime for the re-cloning 6. Orchestrator failing over something that it should not have 7. DNS repointing script not defensive enough 24
  • 25. Fancy failure: more details ● Why did A and B fail at the same time ? ● Deployment error: the two servers in the same rack/failure domain ? ● And/or very unlucky ? ● Very unlucky because… 10 to 20 servers failed that day in the same data center Because human operations and sensitive hardware 25 DNS (master) +-/+ points here ---> | A | but now accesses +/-+ are failing +-/+ +---+ | B | | X | +/-+ +---+ | +---+ | Y | +---+
  • 26. How to fix such situation ? ● Fixing non-intersecting data on B and X is hard. ● Some solutions are: ● Kill B or X (and lose data) ● Replay writes from B on X (manually or with replication) ● But AUTO_INCREMENTs are in the way: ● up to i used on A before 1st failover ● i-n to j1 used on X after recovery ● i to j2 used on B after 2nd failover 26 +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  • 27. Takeaway ● Twisted situations happen ● Automation (including failover) is not simple:  code automation scripts defensively ● Be mindful for premature acknowledgment ● Downtime more than less ● Shutdown slaves first ● Try something else than AUTO-INCREMENTs for PK (monotonically increasing UUID[1] [2] ?) [1]: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ [2]: http://mysql.rjweb.org/doc.php/uuid
  • 28. Improvements in Orchestrator ● Orchestrator failed over something that it should not have ● Should Orchestrator be changed ? ● I do not know… ● Not easy to define what should be changed ● Suggestions welcome https://github.com/github/orchestrator/issues 28
  • 29. Links ● Booking.com Blog: https://blog.booking.com/ ● GitHub Orchestrator: https://github.com/github/orchestrator/ ● UUID as Primary Key: ● https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ ● http://mysql.rjweb.org/doc.php/uuid/ ● Myself: ● https://jfg-mysql.blogspot.com/ ● https://www.linkedin.com/in/jfg956/ 29