SlideShare a Scribd company logo
Consensus in Apache Kafka:
From Theory to Production
Guozhang Wang, Jason Gustafson
SIGMOD 2023
01
Kafka’s Control
Plane Needs
02
The Quorum
Controller: KRaft
03
KRaft
Implementation
04
KRaft in Prod
(in Cloud)
Apache Kafka: Streaming Platform
3
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
Apache Kafka: Streaming Platform
4
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
• Kafka’s architecture:
• Data organized as partitioned topics
• Partitions are replicated & log-structured
• Clients produce to / consume from topics
via sequential log IOs
Distributed Consensus: An Everlasting Tale
5
• Kafka needs consensus on:
• Broker metadata
• Topic metadata
• Client metadata (offsets, txns)
• And of course, replicated data itself
• Consensus access patterns varys:
• Control metadata propagation: low throughput (relatively), strict consistency
• Data replication: high throughput, low latency
Kafka Circa 2013
6
• Apache ZooKeeper for metadata
• Single controller elected to broadcast changes
• Control operations executed as ZK writes
• Leader-follower replication for data [VLDB 2015]
• Configurable latency / durability tradeoff
• Leader (re-)selected from in-sync replicas
Controller
Brokers
Zookeeper
Challenges for the Cloud Scale
7
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
Challenges for the Cloud Scale
8
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
How to scale Kafka clusters efficiently in the Cloud?
What do we really need for Consensus?
9
• A unified, locally replicable metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…
Rethinking Kafka Control Plane on the LOG
10
• Why not have the local metadata changelog as the source of truth
Rethinking Kafka Control Plane on the LOG
11
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
Rethinking Kafka Control Plane on the LOG
12
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
Rethinking Kafka Control Plane on the LOG
13
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
• Flexibility in consensus trade-offs
• Quorum controllers v.s. single controller
• Selective metadata materialization
Metadata
Listeners
Metadata
Log
Metadata
Quorum
KRaft: Kafka’s Log of All Logs [Kafka Summit APAC 2021]
14
• Log-based leader election
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
• Quorum-based replication
• Favor latency over failure tolerance
• O(1) controller failover
• Piggy-back on Kafka’s log replication utilities
• Schema, NIO layer, log recovery algo.
• Batching / compression / indexing / segmentation, etc..
• However, isolated access from data path: separate ports, queues, metrics
Quorum Controller on top of KRaft Logs
15
Metadata
Quorum
Observers
Metadata
Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller quorum can be isolated on the network
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Consistent snapshots
• Potential for clients to reason about consistent
metadata as well
KRaft Made Live
16
Hurdles to bring KRaft to production:
• Model Checking for Correctness: TLA+
• Performance tuning: fsync, leader/broker session timeouts, broker forwarding
• Integration challenges: JBOD, SCRAM, delegation tokens, metadata versioning
• Zk Migration Path: dynamic configuration, API compatibility
• Robustness: client quotas, disaster recovery
• Hardening…
Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)
Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)
Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)
Production Incident
KRaft in Production
• Default for new clusters in all regions
in AWS, GCP, and Azure
• 2000+ clusters
• 20% of all partitions
• ~50ms p99 metadata log latency
Kora: The Cloud Native Engine for Kafka [VLDB 2023]
22
• KRaft: simple metadata consensus for control
plane
• Tiered storage: low-cost, predictable perf data
plane
• Multi-tenant resource isolation and
management
• Automated upgrade and mitigation
• Elasticity, observability, durability, and more..
23
Thank you!
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog

More Related Content

Similar to Consensus in Apache Kafka: From Theory to Production.pdf (20)

PPTX
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
PPTX
How is Kafka so Fast?
Ricardo Paiva
 
PPTX
kafka for db as postgres
PivotalOpenSourceHub
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PDF
A day in the life of a log message
Josef Karásek
 
PPTX
Stream Processing @ Lyft
Jamie Grier
 
PPTX
messaging.pptx
NParakh1
 
PPT
Kafka Explainaton
NguyenChiHoangMinh
 
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
PPTX
Real time data pipline with kafka streams
Yoni Farin
 
PDF
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
confluent
 
PPTX
Fault tolerance
Thisara Pramuditha
 
PPTX
Introduction to Kafka
Akash Vacher
 
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
confluent
 
PPTX
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
 
PDF
A Primer Towards Running Kafka on Top of Kubernetes.pdf
AvinashUpadhyaya3
 
PDF
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
 
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
How is Kafka so Fast?
Ricardo Paiva
 
kafka for db as postgres
PivotalOpenSourceHub
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
A day in the life of a log message
Josef Karásek
 
Stream Processing @ Lyft
Jamie Grier
 
messaging.pptx
NParakh1
 
Kafka Explainaton
NguyenChiHoangMinh
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
Real time data pipline with kafka streams
Yoni Farin
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
confluent
 
Fault tolerance
Thisara Pramuditha
 
Introduction to Kafka
Akash Vacher
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
confluent
 
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
AvinashUpadhyaya3
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
 

More from Guozhang Wang (14)

PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
PDF
Introduction to the Incremental Cooperative Protocol of Kafka
Guozhang Wang
 
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
PPTX
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
PDF
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPTX
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PPTX
Behavioral Simulations in MapReduce
Guozhang Wang
 
PPTX
Automatic Scaling Iterative Computations
Guozhang Wang
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
Introduction to the Incremental Cooperative Protocol of Kafka
Guozhang Wang
 
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
Introduction to Kafka Streams
Guozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
Apache Kafka at LinkedIn
Guozhang Wang
 
Behavioral Simulations in MapReduce
Guozhang Wang
 
Automatic Scaling Iterative Computations
Guozhang Wang
 
Ad

Recently uploaded (20)

PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PDF
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Ad

Consensus in Apache Kafka: From Theory to Production.pdf

  • 1. Consensus in Apache Kafka: From Theory to Production Guozhang Wang, Jason Gustafson SIGMOD 2023
  • 2. 01 Kafka’s Control Plane Needs 02 The Quorum Controller: KRaft 03 KRaft Implementation 04 KRaft in Prod (in Cloud)
  • 3. Apache Kafka: Streaming Platform 3 • Source-of-truth stream data storage • De-facto programing paradigm for real-time events
  • 4. Apache Kafka: Streaming Platform 4 • Source-of-truth stream data storage • De-facto programing paradigm for real-time events • Kafka’s architecture: • Data organized as partitioned topics • Partitions are replicated & log-structured • Clients produce to / consume from topics via sequential log IOs
  • 5. Distributed Consensus: An Everlasting Tale 5 • Kafka needs consensus on: • Broker metadata • Topic metadata • Client metadata (offsets, txns) • And of course, replicated data itself • Consensus access patterns varys: • Control metadata propagation: low throughput (relatively), strict consistency • Data replication: high throughput, low latency
  • 6. Kafka Circa 2013 6 • Apache ZooKeeper for metadata • Single controller elected to broadcast changes • Control operations executed as ZK writes • Leader-follower replication for data [VLDB 2015] • Configurable latency / durability tradeoff • Leader (re-)selected from in-sync replicas Controller Brokers Zookeeper
  • 7. Challenges for the Cloud Scale 7 • Single-controller syndromes • Slow failover, ops latency, split-brain brokers, etc.. • Listener-based metadata propagation limits • Exploding metadata state machines [SIGMOD 2021] • New features == new metadata • Metadata scattered on multiple “sources” • Yet another system to operate • Deployment and monitoring • Security, networking, interface evolutions, etc.. Controller Brokers Zookeeper
  • 8. Challenges for the Cloud Scale 8 • Single-controller syndromes • Slow failover, ops latency, split-brain brokers, etc.. • Listener-based metadata propagation limits • Exploding metadata state machines [SIGMOD 2021] • New features == new metadata • Metadata scattered on multiple “sources” • Yet another system to operate • Deployment and monitoring • Security, networking, interface evolutions, etc.. Controller Brokers Zookeeper How to scale Kafka clusters efficiently in the Cloud?
  • 9. What do we really need for Consensus? 9 • A unified, locally replicable metadata LOG! /brokers/topics/foo/partitions/0/state changed /topics changed /brokers/ids/0 changed /config/topics/bar changed /kafka-acl/group/grp1 changed …
  • 10. Rethinking Kafka Control Plane on the LOG 10 • Why not have the local metadata changelog as the source of truth
  • 11. Rethinking Kafka Control Plane on the LOG 11 • Why not have the local metadata changelog as the source of truth • Unified metadata replication APIs • Async, multi in-flight log appends • Pull-based log reads
  • 12. Rethinking Kafka Control Plane on the LOG 12 • Why not have the local metadata changelog as the source of truth • Unified metadata replication APIs • Async, multi in-flight log appends • Pull-based log reads • Versioned metadata state machines • Local log offset == version numbers • Easy membership management and split brain resolution
  • 13. Rethinking Kafka Control Plane on the LOG 13 • Why not have the local metadata changelog as the source of truth • Unified metadata replication APIs • Async, multi in-flight log appends • Pull-based log reads • Versioned metadata state machines • Local log offset == version numbers • Easy membership management and split brain resolution • Flexibility in consensus trade-offs • Quorum controllers v.s. single controller • Selective metadata materialization Metadata Listeners Metadata Log Metadata Quorum
  • 14. KRaft: Kafka’s Log of All Logs [Kafka Summit APAC 2021] 14 • Log-based leader election • No “split-brain” with multiple leaders • No “grid-locking” with no leaders being elected • Quorum-based replication • Favor latency over failure tolerance • O(1) controller failover • Piggy-back on Kafka’s log replication utilities • Schema, NIO layer, log recovery algo. • Batching / compression / indexing / segmentation, etc.. • However, isolated access from data path: separate ports, queues, metrics
  • 15. Quorum Controller on top of KRaft Logs 15 Metadata Quorum Observers Metadata Log • Controller run in a broker JVM or standalone • Single-node Kafka cluster is possible • Controller quorum can be isolated on the network • Controller operations can be pipelined • Brokers cache metadata read from the log • Consistent snapshots • Potential for clients to reason about consistent metadata as well
  • 16. KRaft Made Live 16 Hurdles to bring KRaft to production: • Model Checking for Correctness: TLA+ • Performance tuning: fsync, leader/broker session timeouts, broker forwarding • Integration challenges: JBOD, SCRAM, delegation tokens, metadata versioning • Zk Migration Path: dynamic configuration, API compatibility • Robustness: client quotas, disaster recovery • Hardening…
  • 21. KRaft in Production • Default for new clusters in all regions in AWS, GCP, and Azure • 2000+ clusters • 20% of all partitions • ~50ms p99 metadata log latency
  • 22. Kora: The Cloud Native Engine for Kafka [VLDB 2023] 22 • KRaft: simple metadata consensus for control plane • Tiered storage: low-cost, predictable perf data plane • Multi-tenant resource isolation and management • Automated upgrade and mitigation • Elasticity, observability, durability, and more..