SlideShare a Scribd company logo
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline
with Kafka
Peerapat Asoktummarungsri
AGODA
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Senior Software
Engineer Agoda.com
Contributor Thai Java
User Group (THJUG.com)
Contributor Agile66
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
AGENDA
Big Data & Data Pipeline
Kafka Introduction
Quick Start
Monitoring
Data Pipeline for Search API
Hadoop integration with Camus
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Hadoop
+

HDFS
Information
Big Data
MapReduce
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Pipeline
hadoopWebsite
log
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobile
Growth
log
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobile
realtime
monitoring
Complex
log
message
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
New
New
hadoopWebsite
Mobile
realtime
monitoring
Data
Warehouse
API
Features becomes the problem
NEW
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoop
Website
Mobile
realtime
monitoring
API
Data Pipeline
Produce
Consume
Data Pipeline
Warehouse
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
compare
Topic
Queue Consumer
Consumer
Consumer
Consumer
Consumer
Consumer
1
2
3
1
1
1
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
General Topic Implement
Topic
Consumer 1
Consumer 2
Consumer 3
2
2
This consumer will lose a message.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Distributed by Design
Fast
Scalable - It can be elastically and transparently
expanded without downtime.
Durable - Messages are persisted on disk and
replicated within the cluster to prevent data loss.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic Consumer 1
Consumer 2
Consumer 3
msg
gid = Group ID
msg
msg
1
2
3
4
7
6 5
gid = hadoop
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic hadoop
gid = hadoop
realtime
monitoring
data
warehouse
msg
gid = Group ID
msg
msg
12
gid = rtmon
gid = warehouse
3
123
123
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic hadoop
gid = hadoop
realtime
monitoring
data
warehouse
msg
gid = Group ID
msg 9
gid = rtmon
gid = warehouse
9
9
New
Consumer
1
2
3
gid = newconsumer
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Vagrant
Install Vagrant
Install Virtual Box
Clone https://github.com/stealthly/scala-kafka
vagrant up
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
BREW
brew update
brew install zookeeper kafka -y
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Some Kafka Config
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0
# The port the socket server listens on
port=9092
# Zookeeper connection string (see zookeeper docs for details).
zookeeper.connect=localhost:2181
# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
# The minimum age of a log file to be eligible for deletion
log.retention.hours=168
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Kafka @ Linkedin (2013)
10 billion message writes per day
55 billion messages delivered to real-time consumers
367 topics that cover both user activity topics and
operational data
the largest of which adds an average of 92GB per day of
batch-compressed messages
Messages are kept for 7 days, and these average at
about 9.5 TB of compressed messages across all topics.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
KafkaOffsetMonitor
java -cp KafkaOffsetMonitor-assembly-0.2.1.jar 
com.quantifind.kafka.offsetapp.OffsetGetterWeb 
--zk localhost 
--port 8080 
--refresh 10.seconds 
--retain 2.days
Download KafkaOffsetMonitor from Github
https://github.com/quantifind/KafkaOffsetMonitor
1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
CHANGE
Produce Change
Price & Inventory
Consumer
Cassandra
Search API
Calculate Price
HTTP
KafkaAPI
Hotel Manager
Hotels
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
CHANGE
KafkaAPI
Hotel Manager
Hotels
B Consumer
A Consumer
Price & Inventory
Consumer
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Camus
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
http://www.slideshare.net/nuboat
https://github.com/nuboat/akkakafkaexam
Slide available here
Sourcecode available here
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
REFERENCES
http://www.slideshare.net/charmalloc/
developingwithapachekafka-29910685
http://www.infoq.com/articles/apache-kafka
http://kafka.apache.org/
https://github.com/stealthly/scala-kafka
https://github.com/quantifind/KafkaOffsetMonitor
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Q & A

More Related Content

What's hot (20)

PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PPTX
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
confluent
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
confluent
 
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
StreamNative
 
PDF
fluentd -- the missing log collector
Muga Nishizawa
 
ODP
Lambda Architecture with Spark
Knoldus Inc.
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
PDF
Uber Real Time Data Analytics
Ankur Bansal
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
PPTX
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
PPTX
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
confluent
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
confluent
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
StreamNative
 
fluentd -- the missing log collector
Muga Nishizawa
 
Lambda Architecture with Spark
Knoldus Inc.
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
Uber Real Time Data Analytics
Ankur Bansal
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 

Viewers also liked (20)

PPTX
jstein.cassandra.nyc.2011
Joe Stein
 
PPTX
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
PPTX
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
PDF
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
PPTX
Containerized Data Persistence on Mesos
Joe Stein
 
PPTX
Apache Cassandra 2.0
Joe Stein
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
PPTX
Introduction Apache Kafka
Joe Stein
 
PPTX
Developing Frameworks for Apache Mesos
Joe Stein
 
PPTX
Developing with the Go client for Apache Kafka
Joe Stein
 
PPTX
Introduction To Apache Mesos
Joe Stein
 
PPTX
Current and Future of Apache Kafka
Joe Stein
 
PPTX
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
PDF
SMACK Stack 1.1
Joe Stein
 
PPTX
Hadoop Streaming Tutorial With Python
Joe Stein
 
PDF
Streaming Processing with a Distributed Commit Log
Joe Stein
 
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
PPTX
Apache Kafka
Joe Stein
 
jstein.cassandra.nyc.2011
Joe Stein
 
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Containerized Data Persistence on Mesos
Joe Stein
 
Apache Cassandra 2.0
Joe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
Introduction Apache Kafka
Joe Stein
 
Developing Frameworks for Apache Mesos
Joe Stein
 
Developing with the Go client for Apache Kafka
Joe Stein
 
Introduction To Apache Mesos
Joe Stein
 
Current and Future of Apache Kafka
Joe Stein
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
SMACK Stack 1.1
Joe Stein
 
Hadoop Streaming Tutorial With Python
Joe Stein
 
Streaming Processing with a Distributed Commit Log
Joe Stein
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Apache Kafka
Joe Stein
 
Ad

Similar to Data Pipeline with Kafka (20)

PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Timothy Spann
 
PPTX
Extending the Yahoo Streaming Benchmark
Jamie Grier
 
PDF
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
tjademargis
 
PDF
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Summit
 
PDF
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
PDF
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
PDF
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
PDF
GCP for Apache Kafka® Users: Stream Ingestion and Processing
confluent
 
PPTX
Apache kafka
sureshraj43
 
PPTX
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
Nuno Godinho
 
PDF
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
PDF
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
PDF
Leverage Kafka to build a stream processing platform
confluent
 
PDF
Apache Kafka - Strakin Technologies Pvt Ltd
Strakin Technologies Pvt Ltd
 
PDF
Introduction to Apache Kafka
Ricardo Bravo
 
PDF
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
PPT
June 2004 IPv6 – Hands on
Videoguy
 
PDF
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
HostedbyConfluent
 
PDF
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
Joan Viladrosa Riera
 
PDF
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
 
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Timothy Spann
 
Extending the Yahoo Streaming Benchmark
Jamie Grier
 
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
tjademargis
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Summit
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
confluent
 
Apache kafka
sureshraj43
 
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
Nuno Godinho
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
Leverage Kafka to build a stream processing platform
confluent
 
Apache Kafka - Strakin Technologies Pvt Ltd
Strakin Technologies Pvt Ltd
 
Introduction to Apache Kafka
Ricardo Bravo
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
June 2004 IPv6 – Hands on
Videoguy
 
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
HostedbyConfluent
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
Joan Viladrosa Riera
 
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
 
Ad

More from Peerapat Asoktummarungsri (13)

PDF
ePassport eKYC for Financial
Peerapat Asoktummarungsri
 
PDF
Security Deployment by CI/CD
Peerapat Asoktummarungsri
 
PDF
Cassandra - Distributed Data Store
Peerapat Asoktummarungsri
 
PDF
Modern Java Development
Peerapat Asoktummarungsri
 
PDF
Meetup Big Data by THJUG
Peerapat Asoktummarungsri
 
PDF
Lightweight javaEE with Guice
Peerapat Asoktummarungsri
 
ePassport eKYC for Financial
Peerapat Asoktummarungsri
 
Security Deployment by CI/CD
Peerapat Asoktummarungsri
 
Cassandra - Distributed Data Store
Peerapat Asoktummarungsri
 
Modern Java Development
Peerapat Asoktummarungsri
 
Meetup Big Data by THJUG
Peerapat Asoktummarungsri
 
Lightweight javaEE with Guice
Peerapat Asoktummarungsri
 

Recently uploaded (20)

PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
Digital water marking system project report
Kamal Acharya
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Digital water marking system project report
Kamal Acharya
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 

Data Pipeline with Kafka

  • 1. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA
  • 2. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66
  • 3. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. AGENDA Big Data & Data Pipeline Kafka Introduction Quick Start Monitoring Data Pipeline for Search API Hadoop integration with Camus
  • 4. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Hadoop +
 HDFS Information Big Data MapReduce
  • 5. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Pipeline hadoopWebsite log
  • 6. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. hadoopWebsite Mobile Growth log
  • 7. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. hadoopWebsite Mobile realtime monitoring Complex log message
  • 8. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. New New hadoopWebsite Mobile realtime monitoring Data Warehouse API Features becomes the problem NEW
  • 9. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. hadoop Website Mobile realtime monitoring API Data Pipeline Produce Consume Data Pipeline Warehouse
  • 10. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. compare Topic Queue Consumer Consumer Consumer Consumer Consumer Consumer 1 2 3 1 1 1
  • 11. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. General Topic Implement Topic Consumer 1 Consumer 2 Consumer 3 2 2 This consumer will lose a message.
  • 12. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Distributed by Design Fast Scalable - It can be elastically and transparently expanded without downtime. Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.
  • 13. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Topic Consumer 1 Consumer 2 Consumer 3 msg gid = Group ID msg msg 1 2 3 4 7 6 5 gid = hadoop
  • 14. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Topic hadoop gid = hadoop realtime monitoring data warehouse msg gid = Group ID msg msg 12 gid = rtmon gid = warehouse 3 123 123
  • 15. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Topic hadoop gid = hadoop realtime monitoring data warehouse msg gid = Group ID msg 9 gid = rtmon gid = warehouse 9 9 New Consumer 1 2 3 gid = newconsumer
  • 16. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 17. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Vagrant Install Vagrant Install Virtual Box Clone https://github.com/stealthly/scala-kafka vagrant up
  • 18. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. BREW brew update brew install zookeeper kafka -y
  • 19. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Some Kafka Config # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 # The port the socket server listens on port=9092 # Zookeeper connection string (see zookeeper docs for details). zookeeper.connect=localhost:2181 # Timeout in ms for connecting to zookeeper zookeeper.connection.timeout.ms=6000 # The minimum age of a log file to be eligible for deletion log.retention.hours=168
  • 20. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Kafka @ Linkedin (2013) 10 billion message writes per day 55 billion messages delivered to real-time consumers 367 topics that cover both user activity topics and operational data the largest of which adds an average of 92GB per day of batch-compressed messages Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.
  • 21. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. KafkaOffsetMonitor java -cp KafkaOffsetMonitor-assembly-0.2.1.jar com.quantifind.kafka.offsetapp.OffsetGetterWeb --zk localhost --port 8080 --refresh 10.seconds --retain 2.days Download KafkaOffsetMonitor from Github https://github.com/quantifind/KafkaOffsetMonitor 1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar
  • 22. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 23. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 24. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 25. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 26. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 27. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
  • 28. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. CHANGE Produce Change Price & Inventory Consumer Cassandra Search API Calculate Price HTTP KafkaAPI Hotel Manager Hotels
  • 29. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. CHANGE KafkaAPI Hotel Manager Hotels B Consumer A Consumer Price & Inventory Consumer
  • 30. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Camus
  • 31. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. http://www.slideshare.net/nuboat https://github.com/nuboat/akkakafkaexam Slide available here Sourcecode available here
  • 32. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. REFERENCES http://www.slideshare.net/charmalloc/ developingwithapachekafka-29910685 http://www.infoq.com/articles/apache-kafka http://kafka.apache.org/ https://github.com/stealthly/scala-kafka https://github.com/quantifind/KafkaOffsetMonitor
  • 33. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Q & A