SlideShare a Scribd company logo
6
Most read
7
Most read
15
Most read
Introduction to
Kafka and Zookeeper
June Hadoop Meetup
Rahul Jain
@rahuldausa
Who am I?
 Software Engineer
 Member of Core technology @ IVY Comptech,
Hyderabad, India
 6 years of programming experience
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2
Agenda
• Overview
• Zookeeper
• Messaging System (Basic Concepts)
• Kafka
• Q&A
3
Apache Zookeeper TM
What is a Distributed System
“A Distributed system consists of multiple computers
that communicate and coordinate their actions by
passing messages. The components interact with each
other in order to achieve a common goal. ”
- Wikipedia
What is Zookeeper
• An Open source, High Performance coordination service
for distributed applications
• Centralized service for
– Configuration Management
– Locks and Synchronization for providing coordination
between distributed systems
– Naming service (Registry)
– Group Membership
• Features
– hierarchical namespace
– provides watcher on a znode
– allows to form a cluster of nodes
• Supports a large volume of request for data retrieval and
update
• http://zookeeper.apache.org/
6
Source : http://zookeeper.apache.org
Zookeeper Use cases
• Configuration Management
• Cluster member nodes Bootstrapping configuration from a
central source
• Distributed Cluster Management
• Node Join/Leave
• Node Status in real time
• Naming Service – e.g. DNS
• Distributed Synchronization – locks, barriers
• Leader election
• Centralized and Highly reliable Registry
Zookeeper Data Model
 Hierarchical Namespace
 Each node is called “znode”
 Each znode has data(stores data in
byte[] array) and can have children
 znode
– Maintains “Stat” structure with
version of data changes , ACL
changes and timestamp
– Version number increases with each
changes
Let’s recall basic concepts of
Messaging System
Point to Point Messaging
(Queue)
Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
Publish-Subscribe Messaging
(Topic)
Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
Apache Kafka
Overview
• An apache project initially developed at LinkedIn
• Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g.
logs, metrics collections
• Written in Scala
• Does not follow JMS Standards, neither uses JMS APIs
• Features
– Persistent messaging
– High-throughput
– Supports both queue and topic semantics
– Uses Zookeeper for forming a cluster of nodes
(producer/consumer/broker)
and many more…
• http://kafka.apache.org/
13
How it works
Credit : http://kafka.apache.org/design.html
Real time transfer
15
Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Design Elements
• Uses Filesystem Cache
• Zero-copy transfer of messages
• Batching of Messages
• Batch Compression
• Automatic Producer Load balancing.
• Broker does not Push messages to Consumer, Consumer
Polls messages from Broker.
Design Elements (Contd.)
• Cluster formation of Broker/Consumer using Zookeeper,
– So on the fly more consumer, broker can be introduced. The new
cluster rebalancing will be taken care by Zookeeper
• Data is persisted in broker
– But not removed on consumption (till retention period), so if one
consumer fails while consuming, same message can be re-consumed
again later from broker.
• Simplified storage mechanism for message,
– not for each message per consumer.
Performance Numbers
Credit : http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
Producer Performance Consumer Performance
Questions ?
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa

More Related Content

What's hot (20)

PPTX
Kafka replication apachecon_2013
Jun Rao
 
PPTX
A visual introduction to Apache Kafka
Paul Brebner
 
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
PPTX
Apache kafka
Srikrishna k
 
PDF
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
SeungYong Oh
 
PPTX
Ceph Performance and Sizing Guide
Jose De La Rosa
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PPTX
Apache kafka
Viswanath J
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PPTX
Kafka 101
Clement Demonchy
 
PDF
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Noritaka Sekiyama
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
Kafka 101 and Developer Best Practices
confluent
 
PPTX
Envoy and Kafka
Adam Kotwasinski
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PPTX
え、まって。その並列分散処理、Kafkaのしくみでもできるの? Apache Kafkaの機能を利用した大規模ストリームデータの並列分散処理
NTT DATA Technology & Innovation
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Kafka replication apachecon_2013
Jun Rao
 
A visual introduction to Apache Kafka
Paul Brebner
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Elastic Stack Introduction
Vikram Shinde
 
Apache kafka
Srikrishna k
 
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
SeungYong Oh
 
Ceph Performance and Sizing Guide
Jose De La Rosa
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Apache kafka
Viswanath J
 
Introduction to Apache Kafka
Shiao-An Yuan
 
Kafka 101
Clement Demonchy
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Noritaka Sekiyama
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Kafka 101 and Developer Best Practices
confluent
 
Envoy and Kafka
Adam Kotwasinski
 
Apache Kafka - Martin Podval
Martin Podval
 
Introduction to Apache Kafka
Jeff Holoman
 
え、まって。その並列分散処理、Kafkaのしくみでもできるの? Apache Kafkaの機能を利用した大規模ストリームデータの並列分散処理
NTT DATA Technology & Innovation
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 

Similar to Introduction to Kafka and Zookeeper (20)

PDF
Webinar: What's new in CDAP 3.5?
Cask Data
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
Apache Kafka Introduction
Amita Mirajkar
 
PPTX
Apache Kafka
emreakis
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
PDF
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Timothy Spann
 
PPTX
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
PPTX
Apache phoenix
University of Moratuwa
 
PPTX
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
PDF
Real time web apps
Sepehr Rasouli
 
PPTX
OpenNaaS Overview Complete
Joan Garcia
 
PDF
Apereo OAE - Architectural overview
Nicolaas Matthijs
 
PPTX
Event Driven Architectures with Apache Kafka
Matt Masuda
 
PDF
Introduction_to_Kafka - A brief Overview.pdf
ssuserc49ec4
 
PPTX
Architectures, Frameworks and Infrastructure
harendra_pathak
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
Timothy Spann
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Webinar: What's new in CDAP 3.5?
Cask Data
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Kafka Introduction
Amita Mirajkar
 
Apache Kafka
emreakis
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Timothy Spann
 
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
Apache phoenix
University of Moratuwa
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
Real time web apps
Sepehr Rasouli
 
OpenNaaS Overview Complete
Joan Garcia
 
Apereo OAE - Architectural overview
Nicolaas Matthijs
 
Event Driven Architectures with Apache Kafka
Matt Masuda
 
Introduction_to_Kafka - A brief Overview.pdf
ssuserc49ec4
 
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Timothy Spann
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Ad

More from Rahul Jain (14)

PDF
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PPTX
Introduction to Scala
Rahul Jain
 
PPTX
What is NoSQL and CAP Theorem
Rahul Jain
 
PPTX
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PPTX
Apache kafka
Rahul Jain
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
DOC
Hibernate tutorial for beginners
Rahul Jain
 
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to Machine Learning
Rahul Jain
 
Introduction to Scala
Rahul Jain
 
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Apache kafka
Rahul Jain
 
Hadoop & HDFS for Beginners
Rahul Jain
 
Hibernate tutorial for beginners
Rahul Jain
 
Ad

Recently uploaded (20)

PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 

Introduction to Kafka and Zookeeper

  • 1. Introduction to Kafka and Zookeeper June Hadoop Meetup Rahul Jain @rahuldausa
  • 2. Who am I?  Software Engineer  Member of Core technology @ IVY Comptech, Hyderabad, India  6 years of programming experience  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2
  • 3. Agenda • Overview • Zookeeper • Messaging System (Basic Concepts) • Kafka • Q&A 3
  • 5. What is a Distributed System “A Distributed system consists of multiple computers that communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. ” - Wikipedia
  • 6. What is Zookeeper • An Open source, High Performance coordination service for distributed applications • Centralized service for – Configuration Management – Locks and Synchronization for providing coordination between distributed systems – Naming service (Registry) – Group Membership • Features – hierarchical namespace – provides watcher on a znode – allows to form a cluster of nodes • Supports a large volume of request for data retrieval and update • http://zookeeper.apache.org/ 6 Source : http://zookeeper.apache.org
  • 7. Zookeeper Use cases • Configuration Management • Cluster member nodes Bootstrapping configuration from a central source • Distributed Cluster Management • Node Join/Leave • Node Status in real time • Naming Service – e.g. DNS • Distributed Synchronization – locks, barriers • Leader election • Centralized and Highly reliable Registry
  • 8. Zookeeper Data Model  Hierarchical Namespace  Each node is called “znode”  Each znode has data(stores data in byte[] array) and can have children  znode – Maintains “Stat” structure with version of data changes , ACL changes and timestamp – Version number increases with each changes
  • 9. Let’s recall basic concepts of Messaging System
  • 10. Point to Point Messaging (Queue) Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
  • 13. Overview • An apache project initially developed at LinkedIn • Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs • Features – Persistent messaging – High-throughput – Supports both queue and topic semantics – Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… • http://kafka.apache.org/ 13
  • 14. How it works Credit : http://kafka.apache.org/design.html
  • 16. Design Elements • Uses Filesystem Cache • Zero-copy transfer of messages • Batching of Messages • Batch Compression • Automatic Producer Load balancing. • Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 17. Design Elements (Contd.) • Cluster formation of Broker/Consumer using Zookeeper, – So on the fly more consumer, broker can be introduced. The new cluster rebalancing will be taken care by Zookeeper • Data is persisted in broker – But not removed on consumption (till retention period), so if one consumer fails while consuming, same message can be re-consumed again later from broker. • Simplified storage mechanism for message, – not for each message per consumer.
  • 18. Performance Numbers Credit : http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf Producer Performance Consumer Performance
  • 19. Questions ? @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa