SlideShare a Scribd company logo
Distributed Data Systems 1©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Database Replication with Kafka
Tom Quiggle
Principal Staff Software Engineer
tquiggle@linkedin.com
www.linkedin.com/in/tquiggle
@TomQuiggle
Distributed Data Systems 2©2016 LinkedIn Corporation. All Rights Reserved.
 ESPRESSO Overview
– Architecture
– GTIDs and SCNs
– Per-instance replication (0.8)
– Per-partition replication (1.0)
 Kafka Per-Partition Replication
– Requirements
– Kafka Configuration
– Message Protocol
– Producer
– Consumer
 Q&A
Agenda
Distributed Data Systems 3©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Overview
ESPRESSO Database Replication with Kafka
Distributed Data Systems 4©2016 LinkedIn Corporation. All Rights Reserved.
 Hosted, Scalable, Data as a Service (DaaS) for LinkedIn’s Online
Structured Data Needs
 Databases are partitioned
 Partitions distributed across available hardware
 HTTP proxy routes requests to appropriate database node
 Apache Helix provides centralized cluster management
ESPRESSO1
1. Elastic, Scalable, Performant, Reliable, Extensible, Stable, Speedy and Operational
Distributed Data Systems 5©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Architecture
Storage Node
API Server
MySQL
Router
Router
Router
Apache Helix
ZooKeeper
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Data
Control
Routing Table
r
r
r
HTTP
Client
HTTP
Distributed Data Systems 6©2016 LinkedIn Corporation. All Rights Reserved.
GTIDs and SCNs
MySQL 5.6 Global Transaction Identifier
 Unique, monotonically increasing, identifier for each
transaction committed
 GTID :== source_id:transaction_id
 ESPRESSO conventions
– source_id encodes database name and partition number
– transaction_id is a 64 bit numeric value
 High Order 32 bits is generation count
 Low order 32 bit are sequence within generation
– Generation increments with every change in mastership
– Sequence increases with each transaction
– We refer to a transaction_id component as a Sequence Commit Number (SCN)
Distributed Data Systems 7©2016 LinkedIn Corporation. All Rights Reserved.
GTIDs and SCNs
Example binlog transaction:
SET @@SESSION.GTID_NEXT= 'hash(db_part):(gen<<32 + seq)';
SET TIMESTAMP=<seconds_since_Unix_epoch>
BEGIN
Table_map: `db_part`.`table1` mapped to number 1234
Update_rows: table id 1234
BINLOG '...'
BINLOG '...'
Table_map: `db_part`.`table2` mapped to number 5678
Update_rows: table id 5678
BINLOG '...'
COMMIT
Distributed Data Systems 8©2016 LinkedIn Corporation. All Rights Reserved.
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
ESPRESSO: 0.8 Per-Instance Replication
Master
Slave
Offline
Distributed Data Systems 9©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO: 0.8 Per-Instance Replication
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
Master
Slave
Offline
Distributed Data Systems 10©2016 LinkedIn Corporation. All Rights Reserved.
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
Master
Slave
Offline
ESPRESSO: 0.8 Per-Instance Replication
Distributed Data Systems 11©2016 LinkedIn Corporation. All Rights Reserved.
Issues with Per-Instance Replication
 Poor resource utilization – only 1/3 of nodes service application requests
 Partitions unnecessarily share fate
 Cluster expansion is an arduous process
 Upon node failure, 100% of the traffic is redirected to one node
Distributed Data Systems 12©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO: 1.0 Per-Partition Replication
Per-Instance MySQL replication replaced with Per-Partition Kafka
HELIX
P4:
Master: 1
Slave: 3
…
EXTERNALVIEW
Node 1
Node 2
Node 3
LIVEINSTANCES
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Kafka
Distributed Data Systems 13©2016 LinkedIn Corporation. All Rights Reserved.
Cluster Expansion
Initial State with 12 partitions, 3 storage nodes, r=2
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 3
LIVEINSTANCES
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
P4:
Master: 1
Slave: 3
…
Distributed Data Systems 14©2016 LinkedIn Corporation. All Rights Reserved.
Cluster Expansion
Adding Node: Helix Sends OfflineToSlave for new partitions
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 3
Node 4
LIVEINSTANCES
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Node 4
P4 P8
P1
P12
P7 P9
P4:
Master: 1
Slave: 3
Offline: 4
…
Distributed Data Systems 15©2016 LinkedIn Corporation. All Rights Reserved.
Cluster Expansion
Once a new partition is ready, transfer ownership and drop old
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 3
Node 4
LIVEINSTANCES
Node 1
P1 P2 P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Node 4
P4 P8
P1
P12
P7 P9
P4:
Master: 4
Slave: 3
…
Distributed Data Systems 16©2016 LinkedIn Corporation. All Rights Reserved.
Cluster Expansion
Continue migration of master and slave partitions
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 3
Node 4
LIVEINSTANCES
Node 1
P1 P2 P3
P5 P6
P9 P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Node 4
P4 P8
P1
P12
P7 P9
P9:
Master: 3
Slave: 1
Offline: 4
…
Distributed Data Systems 17©2016 LinkedIn Corporation. All Rights Reserved.
Cluster Expansion
Rebalancing is complete after last partition migration
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 3
Node 4
LIVEINSTANCES
Node 1 Node 2
Node 3 Node 4
P4 P8
P1
P12
P7 P9
P5 P6
P2
P7
P11 P12
P9 P10
P3
P11
P4 P8
P1 P2
P5
P3
P6 P10
P9:
Master: 4
Slave: 3
…
Distributed Data Systems 18©2016 LinkedIn Corporation. All Rights Reserved.
Node Failover
During failure or planned maintenance, promote slaves to master
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 3
Node 4
LIVEINSTANCES
Node 1 Node 2
Node 3 Node 4
P4 P8
P1
P12
P7 P9
P5 P6
P2
P7
P11 P12
P9 P10
P3
P11
P4 P8
P1 P2
P5
P3
P6 P10
P9:
Master: 4
Slave: 3
…
Distributed Data Systems 19©2016 LinkedIn Corporation. All Rights Reserved.
Node Failover
During failure or planned maintenance, promote slaves to master
HELIX
EXTERNALVIEW
Node 1
Node 2
Node 4
LIVEINSTANCES
Node 1 Node 2
Node 3 Node 4
P4 P8
P1
P12
P7 P9
P5 P6
P2
P7
P11 P12
P9 P10
P3
P11
P4 P8
P1 P2
P5
P3
P6 P10
P9:
Master: 4
Offline: 3
…
Distributed Data Systems 20©2016 LinkedIn Corporation. All Rights Reserved.
Advantages of Per-Partition Replication
 Better hardware utilization
– All nodes service application requests
 Mastership hand-off done in parallel
 After node failure, can restore full replication factor in parallel
 Cluster expansion is as easy as:
– Add node(s) to cluster
– Rebalance
 Single platform for all Change Data Capture
– Internal replication
– Cross-colo replication
– Application CDC consumers
Distributed Data Systems 21©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Per-Partition Replication
ESPRESSO Database Replication with Kafka
Distributed Data Systems 22©2016 LinkedIn Corporation. All Rights Reserved.
Kafka for Internal Replication
Storage Node
MySQL
Open Replicator
Kafka Producer
API Server
binlog
binlog
event
Kafka Consumer
SQL
INSERT..UPDATE
SQL
INSERT..UPDATE
Storage Node
MySQL
Open Replicator
Kafka Producer
API Server
binlog
binlog
event
Kafka Consumer
SQL
INSERT..UPDATE
SQL
INSERT..UPDATE
Kafka Partition
Kafka Message
Kafka Message
Client
HTTP
PUT/POST
Distributed Data Systems 23©2016 LinkedIn Corporation. All Rights Reserved.
Requirements
Delivery Must Be:
 Guaranteed
 In-Order
 Exactly Once (sort of)
Distributed Data Systems 24©2016 LinkedIn Corporation. All Rights Reserved.
Broker Configuration
 Replication factor = 3
(most LinkedIn clusters use 2)
 min.isr=2
 Disable unclean leader elections
Distributed Data Systems 25©2016 LinkedIn Corporation. All Rights Reserved.
B – Begin txn
E – End txn
C – Control
Message Protocol
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
DB_0:
3:104
E
Distributed Data Systems 26©2016 LinkedIn Corporation. All Rights Reserved.
Message Protocol – Mastership Handoff
Old Master
MySQ
L
ProducerConsumer
Promoted Slave
MySQ
L
ProducerConsumer
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
DB_0:
3:104
E
4:0
C
Distributed Data Systems 27©2016 LinkedIn Corporation. All Rights Reserved.
Message Protocol – Mastership Handoff
Master
MySQ
L
ProducerConsumer
Promoted Slave
MySQ
L
ProducerConsumer
Consumed
own control
message
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
DB_0:
3:104
E
4:0
C
Distributed Data Systems 28©2016 LinkedIn Corporation. All Rights Reserved.
Message Protocol – Mastership Handoff
Old Master
MySQ
L
ProducerConsumer
Master
MySQ
L
ProducerConsumer
Enable
writes with
new gen
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
DB_0:
3:104
E
4:0
C
4:0
B
Distributed Data Systems 29©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Configuration
 acks = “all”
 retries = Integer.MAX_VALUE
 block.on.buffer.full=true
 max.in.flight.requests.per.connection=1
 linger=0
 On non-retryable exception:
– destroy producer
– create new producer
– resume from last checkpoint
Distributed Data Systems 30©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Checkpointing
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
Can’t
Checkpoint
Here
Periodically writes (SCN, Kafka Offset) to MySQL table
May only checkpoint offset at end of valid transaction!
Distributed Data Systems 31©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Checkpointing
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
Producer checkpoint will lag current producer Kafka Offset
Kafka Offset obtained from callback
Last
Checkpoint
Here
Distributed Data Systems 32©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Checkpointing
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
Last
Checkpoint
Here
send()
FAILS
X
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104 3:104
Distributed Data Systems 33©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Checkpointing
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
Recreate producer and resume from last checkpoint
Resume
From
Checkpoint
Messages will be replayed
3:102
B
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
Distributed Data Systems 34©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Checkpointing
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
Kafka stream now contains replayed transactions
(possibly including partial transactions)
Can
Checkpoint
Here
Replayed Messages
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104 3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104
Distributed Data Systems 35©2016 LinkedIn Corporation. All Rights Reserved.
Partition 3
Kafka Consumer
 Uses Low Level Consumer
 Consume Kafka partitions slaved on node
Partition 1
Partition 2
Kafka Broker A
Kafka Broker B
Kafka Consumer
poll()
Consumer Thread
EspressoKafkaConsumer
EspressoReplicationApplier
MySQL
P1
P2
P3
Applier
Threads
Distributed Data Systems 36©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Consumer
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104
Slave updates (SCN, Kafka Offset) row for every committed txn
3:101@2
Distributed Data Systems 37©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Consumer
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
Client only applies messages with SCN greater than last committed
Replayed Messages
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104 3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104
BEGIN
Transaction
3:104
3:103@6
Distributed Data Systems 38©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Consumer
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
Incomplete transaction is rolled back
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104 3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104
ROLLBACK
3:104
Replayed Messages
3:103@6
Distributed Data Systems 39©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Consumer
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
Client only applies messages with SCN greater than last committed
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104 3:102
B
3:102 3:102
E
3:104
B
3:104
SKIP
3:102..3:10
3
Replayed Messages
3:103@6
3:103
B,E
Distributed Data Systems 40©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Consumer
Master
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
3:101
B,E
3:102
B
3:102 3:102
E
3:100
B,E
3:103
B,E
3:104
B
3:104 3:102
B
3:102 3:102
E
3:104
B
3:104
Replayed Messages
BEGIN
3:104
(again)
3:104
E
3:103@6
3:103
B,E
Distributed Data Systems 41©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
 What if stalled master continues writing after transition?
Distributed Data Systems 42©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
MASTER
MySQ
L
ProducerConsumer
Slave
MySQ
L
ProducerConsumer
3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104
Master Stalled
Distributed Data Systems 43©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
Master
MySQ
L
ProducerConsumer
Promoted Slave
MySQ
L
ProducerConsumer
3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104
Master Stalled
4:0
C
Helix sends SlaveToMaster transition to one of the slaves
Distributed Data Systems 44©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
Master
MySQ
L
ProducerConsumer
New Master
MySQ
L
ProducerConsumer
Master Stalled
3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104 4:0
C
4:1
B,E
4:2
B
4:2
E
Slave becomes master and starts taking writes
Distributed Data Systems 45©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
Master
MySQ
L
ProducerConsumer
New Master
MySQ
L
ProducerConsumer
3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104 4:0
C
4:1
B,E
4:2
B
4:2
E
3:104
E
3:105
B,E
Stalled Master resumes and sends binlog entries to Kafka
Distributed Data Systems 46©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
ERROR
MySQ
L
ProducerConsumer
New Master
MySQ
L
ProducerConsumer
3:102
B
3:102 3:102
E
3:103
B,E
3:104
B
3:104 4:0
C
4:1
B,E
4:2
B
4:2
E
3:104
E
3:105
B,E
4:3
B,E
Former master goes into ERROR state
Zombie writes filtered by all consumers based on increasing SCN rule
Distributed Data Systems 47©2016 LinkedIn Corporation. All Rights Reserved.
Current Status
ESPRESSO Database Replication with Kafka
Distributed Data Systems 48©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Kafka Replication: Current Status
 Pre-Production integration environment migrated to Kafka replication
 8 production clusters migrated (as of 4/11)
 Migration will continue through Q3 of 2016
 Average replication latency < 90ms
Distributed Data Systems 49©2016 LinkedIn Corporation. All Rights Reserved.
Conclusions
 Configure Kafka for reliable, at least once, delivery. See:
http://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
 Carefully control producer and consumer checkpoints along txn boundaries
 Embed sequence information in message stream to implement exactly-
once application of messages
Distributed Data Systems
Even our workspace is
Horizontally Scalable!

More Related Content

What's hot (20)

PDF
サーバPUSHざっくりまとめ
Yasuhiro Mawarimichi
 
PDF
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
Splunk
 
PDF
AWS X-Rayによるアプリケーションの分析とデバッグ
Amazon Web Services Japan
 
PDF
ネットワークエンジニア的Ansibleの始め方
akira6592
 
PDF
Storage and Alfresco
Toni de la Fuente
 
PDF
Serf / Consul 入門 ~仕事を楽しくしよう~
Masahito Zembutsu
 
PDF
serverspecでサーバ環境のテストを書いてみよう
Daisuke Ikeda
 
PDF
俺とHashiCorp
Toru Makabe
 
PDF
脆弱性ハンドリングと耐える設計 -Vulnerability Response-
Tomohiro Nakashima
 
PPTX
MySQL Monitoring using Prometheus & Grafana
YoungHeon (Roy) Kim
 
PDF
Ingress on Azure Kubernetes Service
Toru Makabe
 
PDF
[D36] Michael Stonebrakerが生み出した列指向データベースは何が凄いのか? ~Verticaを例に列指向データベースのアーキテクチャ...
Insight Technology, Inc.
 
PPTX
Grafana
NoelMc Grath
 
PDF
GraphQL入門 (AWS AppSync)
Amazon Web Services Japan
 
PDF
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
confluent
 
PDF
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
PDF
個人データ連携から見えるSociety5.0~法令対応に向けた技術的な活用事例について~
Scalar, Inc.
 
PDF
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
Amazon Web Services Japan
 
PDF
実環境にTerraform導入したら驚いた
Akihiro Kuwano
 
ODP
Introduction to OData
Mindfire Solutions
 
サーバPUSHざっくりまとめ
Yasuhiro Mawarimichi
 
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
Splunk
 
AWS X-Rayによるアプリケーションの分析とデバッグ
Amazon Web Services Japan
 
ネットワークエンジニア的Ansibleの始め方
akira6592
 
Storage and Alfresco
Toni de la Fuente
 
Serf / Consul 入門 ~仕事を楽しくしよう~
Masahito Zembutsu
 
serverspecでサーバ環境のテストを書いてみよう
Daisuke Ikeda
 
俺とHashiCorp
Toru Makabe
 
脆弱性ハンドリングと耐える設計 -Vulnerability Response-
Tomohiro Nakashima
 
MySQL Monitoring using Prometheus & Grafana
YoungHeon (Roy) Kim
 
Ingress on Azure Kubernetes Service
Toru Makabe
 
[D36] Michael Stonebrakerが生み出した列指向データベースは何が凄いのか? ~Verticaを例に列指向データベースのアーキテクチャ...
Insight Technology, Inc.
 
Grafana
NoelMc Grath
 
GraphQL入門 (AWS AppSync)
Amazon Web Services Japan
 
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
confluent
 
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
個人データ連携から見えるSociety5.0~法令対応に向けた技術的な活用事例について~
Scalar, Inc.
 
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
Amazon Web Services Japan
 
実環境にTerraform導入したら驚いた
Akihiro Kuwano
 
Introduction to OData
Mindfire Solutions
 

Viewers also liked (20)

PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
PPTX
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
confluent
 
PPTX
Never at Rest - IoT and Data Streaming at British Gas Connected Homes, Paul M...
confluent
 
PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
confluent
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
The Rise of Real Time
confluent
 
PDF
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
confluent
 
PDF
Securing Kafka
confluent
 
PPTX
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
PDF
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
confluent
 
PDF
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
confluent
 
PDF
Ingesting Healthcare Data, Micah Whitacre
confluent
 
PPTX
Towards A Stream Centered Enterprise, Gabriel Commeau
confluent
 
PPTX
Kafka, the "DialTone for Data": Building a self-service, scalable, streaming ...
confluent
 
PDF
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
confluent
 
PPTX
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
PDF
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
confluent
 
Never at Rest - IoT and Data Streaming at British Gas Connected Homes, Paul M...
confluent
 
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
confluent
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
The Rise of Real Time
confluent
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
confluent
 
Securing Kafka
confluent
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
confluent
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
confluent
 
Ingesting Healthcare Data, Micah Whitacre
confluent
 
Towards A Stream Centered Enterprise, Gabriel Commeau
confluent
 
Kafka, the "DialTone for Data": Building a self-service, scalable, streaming ...
confluent
 
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
confluent
 
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
Ad

Similar to Espresso Database Replication with Kafka, Tom Quiggle (20)

PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
PDF
OpenShift Multicluster
Juan Vicente Herrera Ruiz de Alejo
 
PPT
Ogf2008 Grid Data Caching
Jags Ramnarayan
 
PDF
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
Equnix Business Solutions
 
PPT
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
gojkoadzic
 
PDF
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
PPTX
Using Kafka to scale database replication
Venu Ryali
 
PDF
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC
 
PDF
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
javier ramirez
 
PDF
Everything you always wanted to know about highly available distributed datab...
Codemotion
 
PDF
Successful Architectures for Fast Data
Patrick McFadin
 
PDF
Crossing the Streams Mesos &lt;> Kubernetes
Timothy St. Clair
 
PDF
Intro to Databases
Sargun Dhillon
 
PDF
Scaling distributed data systems: A LinkedIn Case study
Sai Kiran Kanuri
 
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
PDF
OrientDB distributed architecture 1.1
Luca Garulli
 
PDF
Highly available distributed databases, how they work, javier ramirez at teowaki
javier ramirez
 
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
PPT
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
PPTX
OpenEBS hangout #4
OpenEBS
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
OpenShift Multicluster
Juan Vicente Herrera Ruiz de Alejo
 
Ogf2008 Grid Data Caching
Jags Ramnarayan
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
Equnix Business Solutions
 
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
gojkoadzic
 
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Using Kafka to scale database replication
Venu Ryali
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC
 
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
javier ramirez
 
Everything you always wanted to know about highly available distributed datab...
Codemotion
 
Successful Architectures for Fast Data
Patrick McFadin
 
Crossing the Streams Mesos &lt;> Kubernetes
Timothy St. Clair
 
Intro to Databases
Sargun Dhillon
 
Scaling distributed data systems: A LinkedIn Case study
Sai Kiran Kanuri
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
OrientDB distributed architecture 1.1
Luca Garulli
 
Highly available distributed databases, how they work, javier ramirez at teowaki
javier ramirez
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
OpenEBS hangout #4
OpenEBS
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 

Recently uploaded (20)

PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
Big Data and Data Science hype .pptx
SUNEEL37
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
Product Development & DevelopmentLecture02.pptx
zeeshanwazir2
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PPTX
Presentation 2.pptx AI-powered home security systems Secure-by-design IoT fr...
SoundaryaBC2
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Thermal runway and thermal stability.pptx
godow93766
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Big Data and Data Science hype .pptx
SUNEEL37
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Product Development & DevelopmentLecture02.pptx
zeeshanwazir2
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
Presentation 2.pptx AI-powered home security systems Secure-by-design IoT fr...
SoundaryaBC2
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 

Espresso Database Replication with Kafka, Tom Quiggle

  • 1. Distributed Data Systems 1©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Database Replication with Kafka Tom Quiggle Principal Staff Software Engineer [email protected] www.linkedin.com/in/tquiggle @TomQuiggle
  • 2. Distributed Data Systems 2©2016 LinkedIn Corporation. All Rights Reserved.  ESPRESSO Overview – Architecture – GTIDs and SCNs – Per-instance replication (0.8) – Per-partition replication (1.0)  Kafka Per-Partition Replication – Requirements – Kafka Configuration – Message Protocol – Producer – Consumer  Q&A Agenda
  • 3. Distributed Data Systems 3©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Overview ESPRESSO Database Replication with Kafka
  • 4. Distributed Data Systems 4©2016 LinkedIn Corporation. All Rights Reserved.  Hosted, Scalable, Data as a Service (DaaS) for LinkedIn’s Online Structured Data Needs  Databases are partitioned  Partitions distributed across available hardware  HTTP proxy routes requests to appropriate database node  Apache Helix provides centralized cluster management ESPRESSO1 1. Elastic, Scalable, Performant, Reliable, Extensible, Stable, Speedy and Operational
  • 5. Distributed Data Systems 5©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Architecture Storage Node API Server MySQL Router Router Router Apache Helix ZooKeeper Storage Node API Server MySQL Storage Node API Server MySQL Storage Node API Server MySQL Data Control Routing Table r r r HTTP Client HTTP
  • 6. Distributed Data Systems 6©2016 LinkedIn Corporation. All Rights Reserved. GTIDs and SCNs MySQL 5.6 Global Transaction Identifier  Unique, monotonically increasing, identifier for each transaction committed  GTID :== source_id:transaction_id  ESPRESSO conventions – source_id encodes database name and partition number – transaction_id is a 64 bit numeric value  High Order 32 bits is generation count  Low order 32 bit are sequence within generation – Generation increments with every change in mastership – Sequence increases with each transaction – We refer to a transaction_id component as a Sequence Commit Number (SCN)
  • 7. Distributed Data Systems 7©2016 LinkedIn Corporation. All Rights Reserved. GTIDs and SCNs Example binlog transaction: SET @@SESSION.GTID_NEXT= 'hash(db_part):(gen<<32 + seq)'; SET TIMESTAMP=<seconds_since_Unix_epoch> BEGIN Table_map: `db_part`.`table1` mapped to number 1234 Update_rows: table id 1234 BINLOG '...' BINLOG '...' Table_map: `db_part`.`table2` mapped to number 5678 Update_rows: table id 5678 BINLOG '...' COMMIT
  • 8. Distributed Data Systems 8©2016 LinkedIn Corporation. All Rights Reserved. Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 ESPRESSO: 0.8 Per-Instance Replication Master Slave Offline
  • 9. Distributed Data Systems 9©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO: 0.8 Per-Instance Replication Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 Master Slave Offline
  • 10. Distributed Data Systems 10©2016 LinkedIn Corporation. All Rights Reserved. Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 Master Slave Offline ESPRESSO: 0.8 Per-Instance Replication
  • 11. Distributed Data Systems 11©2016 LinkedIn Corporation. All Rights Reserved. Issues with Per-Instance Replication  Poor resource utilization – only 1/3 of nodes service application requests  Partitions unnecessarily share fate  Cluster expansion is an arduous process  Upon node failure, 100% of the traffic is redirected to one node
  • 12. Distributed Data Systems 12©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO: 1.0 Per-Partition Replication Per-Instance MySQL replication replaced with Per-Partition Kafka HELIX P4: Master: 1 Slave: 3 … EXTERNALVIEW Node 1 Node 2 Node 3 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Kafka
  • 13. Distributed Data Systems 13©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Initial State with 12 partitions, 3 storage nodes, r=2 HELIX EXTERNALVIEW Node 1 Node 2 Node 3 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline P4: Master: 1 Slave: 3 …
  • 14. Distributed Data Systems 14©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Adding Node: Helix Sends OfflineToSlave for new partitions HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Node 4 P4 P8 P1 P12 P7 P9 P4: Master: 1 Slave: 3 Offline: 4 …
  • 15. Distributed Data Systems 15©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Once a new partition is ready, transfer ownership and drop old HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 P1 P2 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Node 4 P4 P8 P1 P12 P7 P9 P4: Master: 4 Slave: 3 …
  • 16. Distributed Data Systems 16©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Continue migration of master and slave partitions HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 P1 P2 P3 P5 P6 P9 P10 Node 2 P5 P6 P7 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Node 4 P4 P8 P1 P12 P7 P9 P9: Master: 3 Slave: 1 Offline: 4 …
  • 17. Distributed Data Systems 17©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Rebalancing is complete after last partition migration HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 Node 2 Node 3 Node 4 P4 P8 P1 P12 P7 P9 P5 P6 P2 P7 P11 P12 P9 P10 P3 P11 P4 P8 P1 P2 P5 P3 P6 P10 P9: Master: 4 Slave: 3 …
  • 18. Distributed Data Systems 18©2016 LinkedIn Corporation. All Rights Reserved. Node Failover During failure or planned maintenance, promote slaves to master HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 Node 2 Node 3 Node 4 P4 P8 P1 P12 P7 P9 P5 P6 P2 P7 P11 P12 P9 P10 P3 P11 P4 P8 P1 P2 P5 P3 P6 P10 P9: Master: 4 Slave: 3 …
  • 19. Distributed Data Systems 19©2016 LinkedIn Corporation. All Rights Reserved. Node Failover During failure or planned maintenance, promote slaves to master HELIX EXTERNALVIEW Node 1 Node 2 Node 4 LIVEINSTANCES Node 1 Node 2 Node 3 Node 4 P4 P8 P1 P12 P7 P9 P5 P6 P2 P7 P11 P12 P9 P10 P3 P11 P4 P8 P1 P2 P5 P3 P6 P10 P9: Master: 4 Offline: 3 …
  • 20. Distributed Data Systems 20©2016 LinkedIn Corporation. All Rights Reserved. Advantages of Per-Partition Replication  Better hardware utilization – All nodes service application requests  Mastership hand-off done in parallel  After node failure, can restore full replication factor in parallel  Cluster expansion is as easy as: – Add node(s) to cluster – Rebalance  Single platform for all Change Data Capture – Internal replication – Cross-colo replication – Application CDC consumers
  • 21. Distributed Data Systems 21©2016 LinkedIn Corporation. All Rights Reserved. Kafka Per-Partition Replication ESPRESSO Database Replication with Kafka
  • 22. Distributed Data Systems 22©2016 LinkedIn Corporation. All Rights Reserved. Kafka for Internal Replication Storage Node MySQL Open Replicator Kafka Producer API Server binlog binlog event Kafka Consumer SQL INSERT..UPDATE SQL INSERT..UPDATE Storage Node MySQL Open Replicator Kafka Producer API Server binlog binlog event Kafka Consumer SQL INSERT..UPDATE SQL INSERT..UPDATE Kafka Partition Kafka Message Kafka Message Client HTTP PUT/POST
  • 23. Distributed Data Systems 23©2016 LinkedIn Corporation. All Rights Reserved. Requirements Delivery Must Be:  Guaranteed  In-Order  Exactly Once (sort of)
  • 24. Distributed Data Systems 24©2016 LinkedIn Corporation. All Rights Reserved. Broker Configuration  Replication factor = 3 (most LinkedIn clusters use 2)  min.isr=2  Disable unclean leader elections
  • 25. Distributed Data Systems 25©2016 LinkedIn Corporation. All Rights Reserved. B – Begin txn E – End txn C – Control Message Protocol Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E
  • 26. Distributed Data Systems 26©2016 LinkedIn Corporation. All Rights Reserved. Message Protocol – Mastership Handoff Old Master MySQ L ProducerConsumer Promoted Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E 4:0 C
  • 27. Distributed Data Systems 27©2016 LinkedIn Corporation. All Rights Reserved. Message Protocol – Mastership Handoff Master MySQ L ProducerConsumer Promoted Slave MySQ L ProducerConsumer Consumed own control message 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E 4:0 C
  • 28. Distributed Data Systems 28©2016 LinkedIn Corporation. All Rights Reserved. Message Protocol – Mastership Handoff Old Master MySQ L ProducerConsumer Master MySQ L ProducerConsumer Enable writes with new gen 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E 4:0 C 4:0 B
  • 29. Distributed Data Systems 29©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Configuration  acks = “all”  retries = Integer.MAX_VALUE  block.on.buffer.full=true  max.in.flight.requests.per.connection=1  linger=0  On non-retryable exception: – destroy producer – create new producer – resume from last checkpoint
  • 30. Distributed Data Systems 30©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 Can’t Checkpoint Here Periodically writes (SCN, Kafka Offset) to MySQL table May only checkpoint offset at end of valid transaction!
  • 31. Distributed Data Systems 31©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 Producer checkpoint will lag current producer Kafka Offset Kafka Offset obtained from callback Last Checkpoint Here
  • 32. Distributed Data Systems 32©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Last Checkpoint Here send() FAILS X 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:104
  • 33. Distributed Data Systems 33©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Recreate producer and resume from last checkpoint Resume From Checkpoint Messages will be replayed 3:102 B 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104
  • 34. Distributed Data Systems 34©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Kafka stream now contains replayed transactions (possibly including partial transactions) Can Checkpoint Here Replayed Messages 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104
  • 35. Distributed Data Systems 35©2016 LinkedIn Corporation. All Rights Reserved. Partition 3 Kafka Consumer  Uses Low Level Consumer  Consume Kafka partitions slaved on node Partition 1 Partition 2 Kafka Broker A Kafka Broker B Kafka Consumer poll() Consumer Thread EspressoKafkaConsumer EspressoReplicationApplier MySQL P1 P2 P3 Applier Threads
  • 36. Distributed Data Systems 36©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 Slave updates (SCN, Kafka Offset) row for every committed txn 3:101@2
  • 37. Distributed Data Systems 37©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Client only applies messages with SCN greater than last committed Replayed Messages 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 BEGIN Transaction 3:104 3:103@6
  • 38. Distributed Data Systems 38©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Incomplete transaction is rolled back 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 ROLLBACK 3:104 Replayed Messages 3:103@6
  • 39. Distributed Data Systems 39©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Client only applies messages with SCN greater than last committed 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:104 B 3:104 SKIP 3:102..3:10 3 Replayed Messages 3:103@6 3:103 B,E
  • 40. Distributed Data Systems 40©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:104 B 3:104 Replayed Messages BEGIN 3:104 (again) 3:104 E 3:103@6 3:103 B,E
  • 41. Distributed Data Systems 41©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering  What if stalled master continues writing after transition?
  • 42. Distributed Data Systems 42©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering MASTER MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 Master Stalled
  • 43. Distributed Data Systems 43©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering Master MySQ L ProducerConsumer Promoted Slave MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 Master Stalled 4:0 C Helix sends SlaveToMaster transition to one of the slaves
  • 44. Distributed Data Systems 44©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering Master MySQ L ProducerConsumer New Master MySQ L ProducerConsumer Master Stalled 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 4:0 C 4:1 B,E 4:2 B 4:2 E Slave becomes master and starts taking writes
  • 45. Distributed Data Systems 45©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering Master MySQ L ProducerConsumer New Master MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 4:0 C 4:1 B,E 4:2 B 4:2 E 3:104 E 3:105 B,E Stalled Master resumes and sends binlog entries to Kafka
  • 46. Distributed Data Systems 46©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering ERROR MySQ L ProducerConsumer New Master MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 4:0 C 4:1 B,E 4:2 B 4:2 E 3:104 E 3:105 B,E 4:3 B,E Former master goes into ERROR state Zombie writes filtered by all consumers based on increasing SCN rule
  • 47. Distributed Data Systems 47©2016 LinkedIn Corporation. All Rights Reserved. Current Status ESPRESSO Database Replication with Kafka
  • 48. Distributed Data Systems 48©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Kafka Replication: Current Status  Pre-Production integration environment migrated to Kafka replication  8 production clusters migrated (as of 4/11)  Migration will continue through Q3 of 2016  Average replication latency < 90ms
  • 49. Distributed Data Systems 49©2016 LinkedIn Corporation. All Rights Reserved. Conclusions  Configure Kafka for reliable, at least once, delivery. See: http://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844  Carefully control producer and consumer checkpoints along txn boundaries  Embed sequence information in message stream to implement exactly- once application of messages
  • 50. Distributed Data Systems Even our workspace is Horizontally Scalable!

Editor's Notes

  • #3: The first section is quick and intended to frame why the requirements for Kafka internal replication Seriously, this should take 5 minutes max
  • #5: ESPRESSO is a NoSQL, RESTFULL, HTTP Document store
  • #6: Partition Placement and Replication Helix assigns partitions to nodes Initial deployments (0.8) used MySQL replication between nodes Evolving to (1.0) using Kafka for internal replication
  • #7: A couple of concepts that are key to how Espresso replication with Kafka works.
  • #9: Time To Market dictated 0.8 Architecture Delegated intra cluster replication to MySQL Replication is at instance level Rigid partition placement
  • #12: Graph of 3 hosts in a “slice” One node is performing 500 to 3K qps. The other two are performing exactly zero.
  • #13: Next we’ll explore the reasons for replacing MySQL replication with Kafka.
  • #20: Upon node failure, rather than 1 node getting 100% of the workload for the failed node, each of the surviving node gets an increase of 1/num_nodes load
  • #23: All subsequent examples one partition. This is to simplify the diagrams. The same logic runs for every partition.
  • #24: Sound like Kafka, right?
  • #26: Let’s look at the “happy path” for replication: Each Kafka message contains the SCN of the commit and an indicator of whether it is the beginning, and/or end of the transaction When consumer sees first message in a txn, it starts a txn in the local MySQL Each message generates an “INSERT … ON DUPLICATE UPDATE …” statement When consumer processes last message in a txn, it executes a COMMIT statement
  • #27: Old Master stops producing Helix sends SlaveToMaster transition to selected slave for partition Slave emits a control message to propose next generation
  • #28: Once slave has read its own control message, it updates generation in Helix Property Store – if successful, can start accepting writes
  • #29: Once slave has read its own control message, it updates generation in Helix Property Store – if successful, can start accepting writes
  • #32: Periodically writes (SCN, Kafka Offset) to per-partition MySQL table May only checkpoint offset at end of valid transaction
  • #33: Non retryable exception. We destroy the producer and restart from the last checkpoint.
  • #35: Next we will explore how the client handles these replayed messages.
  • #38: Here is the replication stream from our master reconnect example
  • #39: Here is the replication stream from our master reconnect example
  • #40: Here is the replication stream from our master reconnect example
  • #41: Here is the replication stream from our master reconnect example
  • #43: Stall may be due to a Garbage Collection event, disk failing disk, a switch glitch, … Here the master is in the middle of a transaction
  • #44: Helix sends a SlaveToMaster transition to one of the slaves
  • #45: Slave becomes master and starts taking writes
  • #47: Helix has revoked mastership Node transitions to ERROR state We have the ability to replay binlogged events back into the top of the stack with Last Writer Wins conflict resolution
  • #49: Latency is measured from the time we send a message to Kaka until it is committed in the slave’s MySQL.