SlideShare a Scribd company logo
Apache Kafka at LinkedIn
How LinkedIn Customizes Kafka to Work at the Trillion Scale
Jon Lee
Staff Software
Engineer LinkedIn
Wesley Wu
Senior Software
Engineer LinkedIn
Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
Apache Kafka
• Distributed stream
processing platform
• Publish and subscribe to
persistent messages
• High throughput and
low latency
• Developed at LinkedIn
• Top-level Apache project
Kafka @ LinkedIn
Ecosystem
BrokerBroker
Cruise Control
Client
BrokerBroker
Client
REST Proxy
Schema
Registry
Schema
Registry
Client Client
Brooklin
Completeness Audit
Usage Monitor
Cruise Control
Kafka @
LinkedIn
Running at Scale
• 7 trillion messages per day
• 100+ clusters, 4K+
brokers
• 100K+ topics, 7M+
partitions
• Constant scalability and
operability challenges
• Source of releases
running in LinkedIn
Production
• Branched from an Apache
Kafka release branch
• Contains hotfix patches
and upstream cherry-
picks
• Tailored to operations
and scale at LinkedIn
LinkedIn Kafka
Release Branch
Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
Tracking Upstream Closely
“Upstream Everything”
Upstream First
• Commit to upstream first (file a KIP
if necessary)
• Cherry-pick it onto the current
LinkedIn release branch or pick it
up when a new branch containing
the upstream patch is created
• Suitable for patches with low to
medium urgency
LinkedIn First (a.k.a. hotfix
approach)
• Commit to LinkedIn branch first
• Double-commit to upstream (best
effort)
• Suitable for patches with high
urgency
Tale of Three Patches
Cherry-pick
• Cherry-picked from
upstream
• Kept until a new
LinkedIn release
branch containing the
original upstream
patch is created
Double-committed
Hotfix
• Hotfix eventually
committed to
upstream
• Kept until a new
LinkedIn branch
containing the
corresponding
upstream patch is
created
LinkedIn-private Hotfix
• Hotfix not of interest
to upstream (e.g.,
temporary debug
patches)
• OR double-commit
attempted but not
accepted by upstream
• Kept in LinkedIn
branches until they are
not needed
Close Look at a LinkedIn Release Branch
Apache Kafka
Release branch
LinkedIn
Release branch
Upstream patch (before branching point)
Cherry-pick
Hotfix (double-committed)
Hotfix (LinkedIn-private)
Apache Kafka
trunk
Developmen
t
Workflow
New
Issue
New
Feature
Already fixed
in upstream?
Intend to commit to
upstream?
File upstream ticket
Commit to
upstream
First?
Can be
cherrypicked?
KIP required?
File KIP /
upstream ticket
Done Done Done
Patch will be
picked up
at next rebase
Fixed in upstream
and patch exists in
LI Branch
Patch exists only in
LI Branch
Y
N YN
Y
NY
N
Upstream
patching
Hotfix
patching
N
Y
Cherry-
Pick
Rejected
Rejected
Rejected
Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
Scalability Support
• Challenges
• 140+ brokers and 1M+ replicas on a single cluster
• Controller failure leads to site unavailability
• Slowness in bouncing a broker causes deployment delay
• Solutions
• Reuse UpdateMetadataRequest object to reduce controller memory
footprint
• Improve broker shutdown time by reducing lock contention
• Avoid excessive logging
Operability Support
• Challenges
• Broker removal for maintenance requires moving out all replicas.
• New replicas can get assigned to brokers that are going to be removed.
• Solutions
• Add a broker to maintenance broker list
• New replicas do not get assigned to maintenance brokers.
• Integrated with Kafka Cruise Control to automate broker removal process
Features
• Observer for billing
• Provide accounting information
• Enforce minimum replication factor
• Minimize data loss risk in case of broker failure
• New offset reset policy
• Help consumer navigate to the closest offset
We are considering (WIP):​
• CPU Optimization (e.g., using Open SSL library)​
• Separate controller node from data broker nodes
Direct Contributions to Upstream
• KIP-219: Improve quota communication
• KIP-291: Separating controller connections and requests from the
data plane
• KIP-354: Add a maximum log compaction lag
• KIP-380: Detect outdated control requests and bounced brokers
using broker generation
Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
Creating a New LinkedIn Release Branch
Apache Kafka Trunk
Apache
Kafka 2.0.0
Apache
Kafka 2.3.0
Cherrypick Hotfix (LinkedIn-private)
Certifying a Release
20
Cert
Cluster
Baseline
Broker 0
Broker 1
Broker …
Broker N
• Identical Setup with 30+
brokers
• Production Traffic
• Automated compare run
• Detailed report
Cert
Cluster
Release
Broker 0
Broker 1
Broker …
Broker N
Produce
Traffic
Consume Traffic
Produce Traffic
Consume Traffic
• Certification covers rebalance, deployment, rolling bounce, stability, and
downgrade
• Source code available at GitHub​
http://github.com/linkedin/kafka
• NOT a fork
• Branches are named as
<Apache Kafka Release>-li (e.g.,
2.0-li and 2.3-li)​
• We are not accepting
external contributions. Please
contribute directly to upstream
Please Check Out
Thank You

More Related Content

What's hot (20)

PDF
Kafka summit apac session
Christina Lin
 
PDF
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
confluent
 
PDF
Evolving from Messaging to Event Streaming
confluent
 
PDF
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
PDF
Introducing Kafka's Streams API
confluent
 
PPTX
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
confluent
 
PDF
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
DATAVERSITY
 
PDF
Battle-tested event-driven patterns for your microservices architecture - Sca...
Natan Silnitsky
 
PDF
Understanding Apache Kafka® Latency at Scale
confluent
 
PDF
Partner Development Guide for Kafka Connect
confluent
 
PDF
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
confluent
 
PDF
Kafka Pluggable Authorization for Enterprise Security (Anna Kepler, Viasat) K...
confluent
 
PDF
How did we move the mountain? - Migrating 1 trillion+ messages per day across...
HostedbyConfluent
 
PDF
Common issues with Apache Kafka® Producer
confluent
 
PDF
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
Natan Silnitsky
 
PDF
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
PDF
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
 
PDF
Stream processing for the masses with beam, python and flink
Enrico Canzonieri
 
PDF
Tips & Tricks for Apache Kafka®
confluent
 
PPTX
Stream Processing Live Traffic Data with Kafka Streams
Tom Van den Bulck
 
Kafka summit apac session
Christina Lin
 
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
confluent
 
Evolving from Messaging to Event Streaming
confluent
 
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
Introducing Kafka's Streams API
confluent
 
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
confluent
 
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
DATAVERSITY
 
Battle-tested event-driven patterns for your microservices architecture - Sca...
Natan Silnitsky
 
Understanding Apache Kafka® Latency at Scale
confluent
 
Partner Development Guide for Kafka Connect
confluent
 
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
confluent
 
Kafka Pluggable Authorization for Enterprise Security (Anna Kepler, Viasat) K...
confluent
 
How did we move the mountain? - Migrating 1 trillion+ messages per day across...
HostedbyConfluent
 
Common issues with Apache Kafka® Producer
confluent
 
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
Natan Silnitsky
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
 
Stream processing for the masses with beam, python and flink
Enrico Canzonieri
 
Tips & Tricks for Apache Kafka®
confluent
 
Stream Processing Live Traffic Data with Kafka Streams
Tom Van den Bulck
 

Similar to Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trillion Scale (20)

PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PPTX
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Dong Lin
 
PPTX
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
PDF
Data stream with cruise control
Bill Liu
 
PPTX
Kafka 0.9, Things you should know
Ratish Ravindran
 
PPTX
Introduction to Kafka
Akash Vacher
 
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
PPTX
Copy of Kafka-Camus
Deep Shah
 
PPTX
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
PPTX
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
PDF
Software Development & Architecture @ LinkedIn
C4Media
 
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PPTX
Building Stream Processing as a Service
Steven Wu
 
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
PPT
Apache kafka- Onkar Kadam
Onkar Kadam
 
PDF
Apache Kafka - Free Friday
Otávio Carvalho
 
PPTX
Enterprise Kafka: Kafka as a Service
Todd Palino
 
PDF
Real-World DevOps — 20 Practical Developers Tips for Tightening Your Operatio...
VictorSzoltysek
 
Apache Kafka at LinkedIn
Guozhang Wang
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Dong Lin
 
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
Data stream with cruise control
Bill Liu
 
Kafka 0.9, Things you should know
Ratish Ravindran
 
Introduction to Kafka
Akash Vacher
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
Copy of Kafka-Camus
Deep Shah
 
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
Software Development & Architecture @ LinkedIn
C4Media
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Building Stream Processing as a Service
Steven Wu
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
Apache kafka- Onkar Kadam
Onkar Kadam
 
Apache Kafka - Free Friday
Otávio Carvalho
 
Enterprise Kafka: Kafka as a Service
Todd Palino
 
Real-World DevOps — 20 Practical Developers Tips for Tightening Your Operatio...
VictorSzoltysek
 
Ad

Recently uploaded (20)

PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Ad

Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trillion Scale

  • 1. Apache Kafka at LinkedIn How LinkedIn Customizes Kafka to Work at the Trillion Scale Jon Lee Staff Software Engineer LinkedIn Wesley Wu Senior Software Engineer LinkedIn
  • 2. Agenda 1 Apache Kafka @ LinkedIn 2 Development Workflow 3 Patch Examples 4 Release Process
  • 3. Agenda 1 Apache Kafka @ LinkedIn 2 Development Workflow 3 Patch Examples 4 Release Process
  • 4. Apache Kafka • Distributed stream processing platform • Publish and subscribe to persistent messages • High throughput and low latency • Developed at LinkedIn • Top-level Apache project
  • 5. Kafka @ LinkedIn Ecosystem BrokerBroker Cruise Control Client BrokerBroker Client REST Proxy Schema Registry Schema Registry Client Client Brooklin Completeness Audit Usage Monitor Cruise Control
  • 6. Kafka @ LinkedIn Running at Scale • 7 trillion messages per day • 100+ clusters, 4K+ brokers • 100K+ topics, 7M+ partitions • Constant scalability and operability challenges
  • 7. • Source of releases running in LinkedIn Production • Branched from an Apache Kafka release branch • Contains hotfix patches and upstream cherry- picks • Tailored to operations and scale at LinkedIn LinkedIn Kafka Release Branch
  • 8. Agenda 1 Apache Kafka @ LinkedIn 2 Development Workflow 3 Patch Examples 4 Release Process
  • 9. Tracking Upstream Closely “Upstream Everything” Upstream First • Commit to upstream first (file a KIP if necessary) • Cherry-pick it onto the current LinkedIn release branch or pick it up when a new branch containing the upstream patch is created • Suitable for patches with low to medium urgency LinkedIn First (a.k.a. hotfix approach) • Commit to LinkedIn branch first • Double-commit to upstream (best effort) • Suitable for patches with high urgency
  • 10. Tale of Three Patches Cherry-pick • Cherry-picked from upstream • Kept until a new LinkedIn release branch containing the original upstream patch is created Double-committed Hotfix • Hotfix eventually committed to upstream • Kept until a new LinkedIn branch containing the corresponding upstream patch is created LinkedIn-private Hotfix • Hotfix not of interest to upstream (e.g., temporary debug patches) • OR double-commit attempted but not accepted by upstream • Kept in LinkedIn branches until they are not needed
  • 11. Close Look at a LinkedIn Release Branch Apache Kafka Release branch LinkedIn Release branch Upstream patch (before branching point) Cherry-pick Hotfix (double-committed) Hotfix (LinkedIn-private) Apache Kafka trunk
  • 12. Developmen t Workflow New Issue New Feature Already fixed in upstream? Intend to commit to upstream? File upstream ticket Commit to upstream First? Can be cherrypicked? KIP required? File KIP / upstream ticket Done Done Done Patch will be picked up at next rebase Fixed in upstream and patch exists in LI Branch Patch exists only in LI Branch Y N YN Y NY N Upstream patching Hotfix patching N Y Cherry- Pick Rejected Rejected Rejected
  • 13. Agenda 1 Apache Kafka @ LinkedIn 2 Development Workflow 3 Patch Examples 4 Release Process
  • 14. Scalability Support • Challenges • 140+ brokers and 1M+ replicas on a single cluster • Controller failure leads to site unavailability • Slowness in bouncing a broker causes deployment delay • Solutions • Reuse UpdateMetadataRequest object to reduce controller memory footprint • Improve broker shutdown time by reducing lock contention • Avoid excessive logging
  • 15. Operability Support • Challenges • Broker removal for maintenance requires moving out all replicas. • New replicas can get assigned to brokers that are going to be removed. • Solutions • Add a broker to maintenance broker list • New replicas do not get assigned to maintenance brokers. • Integrated with Kafka Cruise Control to automate broker removal process
  • 16. Features • Observer for billing • Provide accounting information • Enforce minimum replication factor • Minimize data loss risk in case of broker failure • New offset reset policy • Help consumer navigate to the closest offset We are considering (WIP):​ • CPU Optimization (e.g., using Open SSL library)​ • Separate controller node from data broker nodes
  • 17. Direct Contributions to Upstream • KIP-219: Improve quota communication • KIP-291: Separating controller connections and requests from the data plane • KIP-354: Add a maximum log compaction lag • KIP-380: Detect outdated control requests and bounced brokers using broker generation
  • 18. Agenda 1 Apache Kafka @ LinkedIn 2 Development Workflow 3 Patch Examples 4 Release Process
  • 19. Creating a New LinkedIn Release Branch Apache Kafka Trunk Apache Kafka 2.0.0 Apache Kafka 2.3.0 Cherrypick Hotfix (LinkedIn-private)
  • 20. Certifying a Release 20 Cert Cluster Baseline Broker 0 Broker 1 Broker … Broker N • Identical Setup with 30+ brokers • Production Traffic • Automated compare run • Detailed report Cert Cluster Release Broker 0 Broker 1 Broker … Broker N Produce Traffic Consume Traffic Produce Traffic Consume Traffic • Certification covers rebalance, deployment, rolling bounce, stability, and downgrade
  • 21. • Source code available at GitHub​ http://github.com/linkedin/kafka • NOT a fork • Branches are named as <Apache Kafka Release>-li (e.g., 2.0-li and 2.3-li)​ • We are not accepting external contributions. Please contribute directly to upstream Please Check Out