Data Pipeline with Kafka

6 likes1,407 views

The document discusses the implementation of a data pipeline using Apache Kafka, highlighting components like big data integration, monitoring, and configuration details. It provides practical installation instructions, including Vagrant and Brew commands for setting up Kafka and Zookeeper. Additionally, the document features performance metrics from Kafka usage at LinkedIn and includes references for further exploration of Kafka-related resources.

Engineering

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline
with Kafka
Peerapat Asoktummarungsri
AGODA

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Senior Software
Engineer Agoda.com
Contributor Thai Java
User Group (THJUG.com)
Contributor Agile66

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
AGENDA
Big Data & Data Pipeline
Kafka Introduction
Quick Start
Monitoring
Data Pipeline for Search API
Hadoop integration with Camus

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Hadoop
+ 
HDFS
Information
Big Data
MapReduce

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Pipeline
hadoopWebsite
log

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobile
Growth
log

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobile
realtime
monitoring
Complex
log
message

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
New
New
hadoopWebsite
Mobile
realtime
monitoring
Data
Warehouse
API
Features becomes the problem
NEW

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
compare
Topic
Queue Consumer
Consumer
Consumer
Consumer
Consumer
Consumer
1
2
3
1
1
1

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
General Topic Implement
Topic
Consumer 1
Consumer 2
Consumer 3
2
2
This consumer will lose a message.

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Distributed by Design
Fast
Scalable - It can be elastically and transparently
expanded without downtime.
Durable - Messages are persisted on disk and
replicated within the cluster to prevent data loss.

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic Consumer 1
Consumer 2
Consumer 3
msg
gid = Group ID
msg
msg
1
2
3
4
7
6 5
gid = hadoop

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic hadoop
gid = hadoop
realtime
monitoring
data
warehouse
msg
gid = Group ID
msg
msg
12
gid = rtmon
gid = warehouse
3
123
123

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Vagrant
Install Vagrant
Install Virtual Box
Clone https://github.com/stealthly/scala-kafka
vagrant up

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
BREW
brew update
brew install zookeeper kafka -y

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Some Kafka Conﬁg
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0
# The port the socket server listens on
port=9092
# Zookeeper connection string (see zookeeper docs for details).
zookeeper.connect=localhost:2181
# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
# The minimum age of a log ﬁle to be eligible for deletion
log.retention.hours=168

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Kafka @ Linkedin (2013)
10 billion message writes per day
55 billion messages delivered to real-time consumers
367 topics that cover both user activity topics and
operational data
the largest of which adds an average of 92GB per day of
batch-compressed messages
Messages are kept for 7 days, and these average at
about 9.5 TB of compressed messages across all topics.

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
KafkaOffsetMonitor
java -cp KafkaOffsetMonitor-assembly-0.2.1.jar
com.quantiﬁnd.kafka.offsetapp.OffsetGetterWeb
--zk localhost
--port 8080
--refresh 10.seconds
--retain 2.days
Download KafkaOffsetMonitor from Github
https://github.com/quantiﬁnd/KafkaOffsetMonitor
1 Jar ﬁle, KafkaOffsetMonitor-assembly-0.2.1.jar

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
CHANGE
Produce Change
Price & Inventory
Consumer
Cassandra
Search API
Calculate Price
HTTP
KafkaAPI
Hotel Manager
Hotels

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
CHANGE
KafkaAPI
Hotel Manager
Hotels
B Consumer
A Consumer
Price & Inventory
Consumer

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Camus

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
http://www.slideshare.net/nuboat
https://github.com/nuboat/akkakafkaexam
Slide available here
Sourcecode available here

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
REFERENCES
http://www.slideshare.net/charmalloc/
developingwithapachekafka-29910685
http://www.infoq.com/articles/apache-kafka
http://kafka.apache.org/
https://github.com/stealthly/scala-kafka
https://github.com/quantiﬁnd/KafkaOffsetMonitor

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Q & A

More Related Content

What's hot (20)

PDF

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

PPTX

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

PDF

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

PDF

Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streamsconfluent

PDF

Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward

PDF

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

PPTX

Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaDataWorks Summit/Hadoop Summit

PPTX

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit

PPTX

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative

PDF

fluentd -- the missing log collectorMuga Nishizawa

ODP

Lambda Architecture with SparkKnoldus Inc.

PDF

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

PDF

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent

PDF

Uber Real Time Data AnalyticsAnkur Bansal

PPT

Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit

PPTX

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

PDF

Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit

PPTX

How Apache Kafka is transforming Hadoop, Spark and StormEdureka!

PPTX

Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana

PDF

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streamsconfluent

Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaDataWorks Summit/Hadoop Summit

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative

fluentd -- the missing log collectorMuga Nishizawa

Lambda Architecture with SparkKnoldus Inc.

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent

Uber Real Time Data AnalyticsAnkur Bansal

Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit

How Apache Kafka is transforming Hadoop, Spark and StormEdureka!

Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Viewers also liked (20)

PPTX

jstein.cassandra.nyc.2011Joe Stein

PPTX

Storing Time Series Metrics With Cassandra and Composite ColumnsJoe Stein

PPTX

Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Joe Stein

PDF

Developing Realtime Data Pipelines With Apache KafkaJoe Stein

PPTX

Containerized Data Persistence on MesosJoe Stein

PPTX

Apache Cassandra 2.0Joe Stein

PPTX

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

PPTX

Making Apache Kafka Elastic with Apache MesosJoe Stein

PPTX

Introduction Apache KafkaJoe Stein

PPTX

Developing Frameworks for Apache MesosJoe Stein

PPTX

Developing with the Go client for Apache KafkaJoe Stein

PPTX

Introduction To Apache MesosJoe Stein

PPTX

Current and Future of Apache KafkaJoe Stein

PPTX

Apache Kafka, HDFS, Accumulo and more on MesosJoe Stein

PDF

SMACK Stack 1.1Joe Stein

PPTX

Hadoop Streaming Tutorial With PythonJoe Stein

PDF

Streaming Processing with a Distributed Commit LogJoe Stein

PPTX

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein

PPTX

Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein

PPTX

Apache KafkaJoe Stein

jstein.cassandra.nyc.2011Joe Stein

Storing Time Series Metrics With Cassandra and Composite ColumnsJoe Stein

Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Joe Stein

Developing Realtime Data Pipelines With Apache KafkaJoe Stein

Containerized Data Persistence on MesosJoe Stein

Apache Cassandra 2.0Joe Stein

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Making Apache Kafka Elastic with Apache MesosJoe Stein

Introduction Apache KafkaJoe Stein

Developing Frameworks for Apache MesosJoe Stein

Developing with the Go client for Apache KafkaJoe Stein

Introduction To Apache MesosJoe Stein

Current and Future of Apache KafkaJoe Stein

Apache Kafka, HDFS, Accumulo and more on MesosJoe Stein

SMACK Stack 1.1Joe Stein

Hadoop Streaming Tutorial With PythonJoe Stein

Streaming Processing with a Distributed Commit LogJoe Stein

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein

Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein

Apache KafkaJoe Stein

Similar to Data Pipeline with Kafka (20)

PDF

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann

PPTX

Extending the Yahoo Streaming BenchmarkJamie Grier

PDF

Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...tjademargis

PDF

Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit

PDF

[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera

PDF

NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...Timothy Spann

PDF

Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann

PDF

GCP for Apache Kafka® Users: Stream Ingestion and Processingconfluent

PPTX

Apache kafkasureshraj43

PPTX

TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9Nuno Godinho

PDF

How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent

PDF

OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann

PDF

Leverage Kafka to build a stream processing platformconfluent

PDF

Apache Kafka - Strakin Technologies Pvt LtdStrakin Technologies Pvt Ltd

PDF

Introduction to Apache KafkaRicardo Bravo

PDF

26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann

PPT

June 2004 IPv6 – Hands on Videoguy

PDF

Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020HostedbyConfluent

PDF

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera

PDF

Fast Streaming into Clickhouse with Apache PulsarTimothy Spann

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann

Extending the Yahoo Streaming BenchmarkJamie Grier

Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...tjademargis

Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit

[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera

NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...Timothy Spann

Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann

GCP for Apache Kafka® Users: Stream Ingestion and Processingconfluent

Apache kafkasureshraj43

TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9Nuno Godinho

How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent

OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann

Leverage Kafka to build a stream processing platformconfluent

Apache Kafka - Strakin Technologies Pvt LtdStrakin Technologies Pvt Ltd

Introduction to Apache KafkaRicardo Bravo

26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann

June 2004 IPv6 – Hands on Videoguy

Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020HostedbyConfluent

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera

Fast Streaming into Clickhouse with Apache PulsarTimothy Spann

More from Peerapat Asoktummarungsri (13)

PDF

ePassport eKYC for FinancialPeerapat Asoktummarungsri

PDF

Security Deployment by CI/CDPeerapat Asoktummarungsri

PDF

Cassandra - Distributed Data StorePeerapat Asoktummarungsri

PDF

Modern Java DevelopmentPeerapat Asoktummarungsri

PDF

SonarqubePeerapat Asoktummarungsri

PDF

Meetup Big Data by THJUGPeerapat Asoktummarungsri

PDF

SonarPeerapat Asoktummarungsri

PDF

RoboguicePeerapat Asoktummarungsri

PDF

HomeloanPeerapat Asoktummarungsri

PDF

Lightweight javaEE with GuicePeerapat Asoktummarungsri

PDF

HadoopPeerapat Asoktummarungsri

PDF

Meet DjangoPeerapat Asoktummarungsri

PDF

Easy javaPeerapat Asoktummarungsri

ePassport eKYC for FinancialPeerapat Asoktummarungsri

Security Deployment by CI/CDPeerapat Asoktummarungsri

Cassandra - Distributed Data StorePeerapat Asoktummarungsri

Modern Java DevelopmentPeerapat Asoktummarungsri

SonarqubePeerapat Asoktummarungsri

Meetup Big Data by THJUGPeerapat Asoktummarungsri

SonarPeerapat Asoktummarungsri

RoboguicePeerapat Asoktummarungsri

HomeloanPeerapat Asoktummarungsri

Lightweight javaEE with GuicePeerapat Asoktummarungsri

HadoopPeerapat Asoktummarungsri

Meet DjangoPeerapat Asoktummarungsri

Easy javaPeerapat Asoktummarungsri

Recently uploaded (20)

PPT

New_school_Engineering_presentation_011707.pptVinayKumar304579

PDF

Data structures notes for unit 2 in computer science.pdfsshubhamsingh265

PPTX

MODULE 03 - CLOUD COMPUTING AND SECURITY.pptxAlvas Institute of Engineering and technology, Moodabidri

PDF

Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)IJCI JOURNAL

PDF

Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...TaameBerhe2

PDF

Digital water marking system project reportKamal Acharya

PPTX

Lecture 1 Shell and Tube Heat exchanger-1.pptxmailforillegalwork

PDF

SERVERLESS PERSONAL TO-DO LIST APPLICATIONanushaashraf20

PDF

3rd International Conference on Machine Learning and IoT (MLIoT 2025)ClaraZara1

PPTX

Water Resources Engineering (CVE 728)--Slide 3.pptxmohammedado3

PDF

Reasons for the succes of MENARD PRESSUREMETER.pdfmajdiamz

PPTX

MODULE 04 - CLOUD COMPUTING AND SECURITY.pptxAlvas Institute of Engineering and technology, Moodabidri

PPTX

MODULE 05 - CLOUD COMPUTING AND SECURITY.pptxAlvas Institute of Engineering and technology, Moodabidri

PPTX

Knowledge Representation : Semantic NetworksAmity University, Patna

PDF

Pressure Measurement training for engineers and TechniciansAIESOLUTIONS

PPTX

Final Major project a b c d e f g h i j k l mbharathpsnab

PPTX

OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data ScienceA R SIVANESH M.E., (Ph.D)

PDF

20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdfAshutosh Satapathy

PDF

AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORESijait

PDF

MODULE-5 notes [BCG402-CG&V] PART-B.pdfAlvas Institute of Engineering and technology, Moodabidri

New_school_Engineering_presentation_011707.pptVinayKumar304579

Data structures notes for unit 2 in computer science.pdfsshubhamsingh265

MODULE 03 - CLOUD COMPUTING AND SECURITY.pptxAlvas Institute of Engineering and technology, Moodabidri

Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)IJCI JOURNAL

Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...TaameBerhe2

Digital water marking system project reportKamal Acharya

Lecture 1 Shell and Tube Heat exchanger-1.pptxmailforillegalwork

SERVERLESS PERSONAL TO-DO LIST APPLICATIONanushaashraf20

3rd International Conference on Machine Learning and IoT (MLIoT 2025)ClaraZara1

Water Resources Engineering (CVE 728)--Slide 3.pptxmohammedado3

Reasons for the succes of MENARD PRESSUREMETER.pdfmajdiamz

MODULE 04 - CLOUD COMPUTING AND SECURITY.pptxAlvas Institute of Engineering and technology, Moodabidri

MODULE 05 - CLOUD COMPUTING AND SECURITY.pptxAlvas Institute of Engineering and technology, Moodabidri

Knowledge Representation : Semantic NetworksAmity University, Patna

Pressure Measurement training for engineers and TechniciansAIESOLUTIONS

Final Major project a b c d e f g h i j k l mbharathpsnab

OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data ScienceA R SIVANESH M.E., (Ph.D)

20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdfAshutosh Satapathy

AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORESijait

MODULE-5 notes [BCG402-CG&V] PART-B.pdfAlvas Institute of Engineering and technology, Moodabidri

Data Pipeline with Kafka

1. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA

2. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66

3. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. AGENDA Big Data & Data Pipeline Kafka Introduction Quick Start Monitoring Data Pipeline for Search API Hadoop integration with Camus

4. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Hadoop +  HDFS Information Big Data MapReduce

5. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Pipeline hadoopWebsite log

6. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. hadoopWebsite Mobile Growth log

7. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. hadoopWebsite Mobile realtime monitoring Complex log message

8. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. New New hadoopWebsite Mobile realtime monitoring Data Warehouse API Features becomes the problem NEW

9. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. hadoop Website Mobile realtime monitoring API Data Pipeline Produce Consume Data Pipeline Warehouse

10. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. compare Topic Queue Consumer Consumer Consumer Consumer Consumer Consumer 1 2 3 1 1 1

11. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. General Topic Implement Topic Consumer 1 Consumer 2 Consumer 3 2 2 This consumer will lose a message.

12. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Distributed by Design Fast Scalable - It can be elastically and transparently expanded without downtime. Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.

13. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Topic Consumer 1 Consumer 2 Consumer 3 msg gid = Group ID msg msg 1 2 3 4 7 6 5 gid = hadoop

14. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Topic hadoop gid = hadoop realtime monitoring data warehouse msg gid = Group ID msg msg 12 gid = rtmon gid = warehouse 3 123 123

15. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Topic hadoop gid = hadoop realtime monitoring data warehouse msg gid = Group ID msg 9 gid = rtmon gid = warehouse 9 9 New Consumer 1 2 3 gid = newconsumer

16. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

17. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Vagrant Install Vagrant Install Virtual Box Clone https://github.com/stealthly/scala-kafka vagrant up

18. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. BREW brew update brew install zookeeper kafka -y

19. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Some Kafka Conﬁg # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 # The port the socket server listens on port=9092 # Zookeeper connection string (see zookeeper docs for details). zookeeper.connect=localhost:2181 # Timeout in ms for connecting to zookeeper zookeeper.connection.timeout.ms=6000 # The minimum age of a log ﬁle to be eligible for deletion log.retention.hours=168

20. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Kafka @ Linkedin (2013) 10 billion message writes per day 55 billion messages delivered to real-time consumers 367 topics that cover both user activity topics and operational data the largest of which adds an average of 92GB per day of batch-compressed messages Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.

21. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. KafkaOffsetMonitor java -cp KafkaOffsetMonitor-assembly-0.2.1.jar com.quantifind.kafka.offsetapp.OffsetGetterWeb --zk localhost --port 8080 --refresh 10.seconds --retain 2.days Download KafkaOffsetMonitor from Github https://github.com/quantifind/KafkaOffsetMonitor 1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar

22. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

23. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

24. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

25. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

26. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

27. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

28. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. CHANGE Produce Change Price & Inventory Consumer Cassandra Search API Calculate Price HTTP KafkaAPI Hotel Manager Hotels

29. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. CHANGE KafkaAPI Hotel Manager Hotels B Consumer A Consumer Price & Inventory Consumer

30. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Camus

31. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. http://www.slideshare.net/nuboat https://github.com/nuboat/akkakafkaexam Slide available here Sourcecode available here

32. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. REFERENCES http://www.slideshare.net/charmalloc/ developingwithapachekafka-29910685 http://www.infoq.com/articles/apache-kafka http://kafka.apache.org/ https://github.com/stealthly/scala-kafka https://github.com/quantiﬁnd/KafkaOffsetMonitor

33. Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Q & A