SlideShare a Scribd company logo
Jay Kreps
Introduction to Apache Kafka
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Apache Kafka
A
brief
history
of
Apache
Kafka
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
– Partitioning model
Kafka is about logs
What is a log?
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Logs: pub/sub done right
Partitioning
Nodes Host Many Partitions
Producers Balance Load
Consumer’s Divide Up
Partitions
End-to-End
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
• Hadoop integration
Performance
• Producer (3x replication):
– Async: 786,980 records/sec (75.1 MB/sec)
– Sync: 421,823 records/sec (40.2 MB/sec)
• Consumer:
– 940,521 records/sec (89.7 MB/sec)
• End-to-end latency:
– 2 ms (median)
– 14 ms (99.9th percentile)
Apache Kafka at LinkedIn
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Data Integration
Maslow’s Hierarchy
For Data
New Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Application metrics
– CPU usage, requests/sec
• Application logs
– Service calls, errors
New Types of Systems
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata
Bad
Good
Example: User views job
Comparing Data Transfer
Mechanisms
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Stream Processing
Stream processing is a
generalization
of batch processing
Stream Processing = Logs + Jobs
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL
Frameworks Can Help
Samza Architecture
Log-centric Architecture
Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Benchmark:
http://t.co/40fkKJvanx
Me
http://www.linkedin.com/in/jaykreps
@jaykreps

More Related Content

What's hot (20)

PDF
Kafka 101 and Developer Best Practices
confluent
 
PDF
Can Apache Kafka Replace a Database?
Kai Wähner
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Kafka
emreakis
 
PDF
Introduction to apache kafka
Dimitris Kontokostas
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PDF
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PDF
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
PPTX
Envoy and Kafka
Adam Kotwasinski
 
PDF
So You Want to Write a Connector?
confluent
 
PPTX
Kafka at Peak Performance
Todd Palino
 
PDF
Apache Kafka Introduction
Amita Mirajkar
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Kafka 101 and Developer Best Practices
confluent
 
Can Apache Kafka Replace a Database?
Kai Wähner
 
Apache Kafka - Martin Podval
Martin Podval
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Apache Kafka
emreakis
 
Introduction to apache kafka
Dimitris Kontokostas
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Introduction to Apache Kafka
AIMDek Technologies
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
An Introduction to Apache Kafka
Amir Sedighi
 
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Envoy and Kafka
Adam Kotwasinski
 
So You Want to Write a Connector?
confluent
 
Kafka at Peak Performance
Todd Palino
 
Apache Kafka Introduction
Amita Mirajkar
 
Introduction to Apache Kafka
Shiao-An Yuan
 
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 

Similar to Apache Kafka at LinkedIn (20)

PPTX
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
PPTX
Apache Kafka: Past, Present and Future
confluent
 
PPTX
Kafka. seattle data science and data engineering meetup
Abhishek Goswami
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PPTX
kafka for db as postgres
PivotalOpenSourceHub
 
PDF
Apache Kafka Use Cases_ When To Use It_ When Not To Use_.pdf
Noman Shaikh
 
PDF
Bank of China Tech Talk 2: Introduction to Streaming Data and Stream Processi...
confluent
 
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
PDF
kafka-tutorial-cloudruable-v2.pdf
PriyamTomar1
 
PDF
Introduction to Apache Kafka and why it matters - Madrid
Paolo Castagna
 
PPTX
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
PDF
Introduction to Kafka Streams - Knolx.pdf
Knoldus Inc.
 
PDF
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
gluent.
 
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
PDF
Apache kafka
the100rabh
 
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
PDF
What is apache Kafka?
Kenny Gorman
 
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Apache Kafka: Past, Present and Future
confluent
 
Kafka. seattle data science and data engineering meetup
Abhishek Goswami
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
kafka for db as postgres
PivotalOpenSourceHub
 
Apache Kafka Use Cases_ When To Use It_ When Not To Use_.pdf
Noman Shaikh
 
Bank of China Tech Talk 2: Introduction to Streaming Data and Stream Processi...
confluent
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
kafka-tutorial-cloudruable-v2.pdf
PriyamTomar1
 
Introduction to Apache Kafka and why it matters - Madrid
Paolo Castagna
 
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
Introduction to Kafka Streams - Knolx.pdf
Knoldus Inc.
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
gluent.
 
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Apache kafka
the100rabh
 
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
What is apache Kafka?
Kenny Gorman
 
Ad

Recently uploaded (20)

PDF
NTPC PATRATU Summer internship report.pdf
hemant03701
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
Alan Turing - life and importance for all of us now
Pedro Concejero
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
NTPC PATRATU Summer internship report.pdf
hemant03701
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
Alan Turing - life and importance for all of us now
Pedro Concejero
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
Ad

Apache Kafka at LinkedIn

Editor's Notes

  • #2: Who are you? What is this talk about? Exciting topic More
  • #4: Messaging system, like JMS (but different!) Producers, consumers distributed
  • #5: Start with state at LinkedIn, describe each pipeline 1 Pipeline for database data 1 Pipeline for metrics 1 Pipeline for events 1 JMS-based pipeline No pipeline for application logs 300 ActiveMQ brokers
  • #6: 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • #7: The log is fundamental abstraction Kafka provides You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • #8: What is a log? Traditional uses? Non-traditional uses…
  • #9: Time ordered Semi-structured
  • #10: Data structure not a text file List of changes Contents of record doesn’t matter Indexed by “time” Not application log (i.e. text file)
  • #11: Remotely accessible State machine replication
  • #12: Data model of Kafka: A topic Partitions can be spread over machines, replicated
  • #16: Path of a write Leadership failover Guarantees
  • #21: AKA ETL Many systems Event data Most important problem for data-centric companies Integration >> ML
  • #22: Maslow’s Hiearchy Abraham Maslow, Physchologist, 1943 Physiological – eat, drink, sleep Safety – Not being attacked Love/Belonging – friends and family Esteem – respect of others Self-Actualization – morality, creativity, spontenaity
  • #23: Want to do Deep Learning Instead finding that their CSV data ALSO has commas in it Copying files around Ugh The Caveman Data Warehousing has a bad reputation
  • #24: Two exacerbating factors 15 years ago, just the first one (transactional data) New categories are very high volume, maybe 100x the transactional data Look like events Internet of things
  • #25: One-size fits all
  • #26: Tell story: Started with Hadoop, added arrows to get data there Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data) Holy shit this is hard! Data is missing, data is late, computation runs on wrong data Hadoop without good data is just a very expensive space heater Never get to full connectivity
  • #27: Metcalfe’s law Each new system connects to get/give data All data in multi-subscriber, real-time logs The company is a big distributed system The data center is the distributed system
  • #29: Three dims: Throughput Guarantees Latency Advantages over messaging: Huge data backlog Order Advantages over files Real-time Advantage over both: principled notion of time
  • #31: Whole organization is big distributed system Commit log = data transfer Stream processing = triggers Batch is dominant paradigm for data processing, why?
  • #32: Service: One input = one output Batch job: All inputs = all outputs Stream computing: any window = output for that window
  • #33: No different from batch processing flow (instead of files/tables, logs)
  • #35: Storm and Samza About process management – both integrate with Kafka MapReduce and HDFS