SlideShare a Scribd company logo
Building Realtime data pipeline with
Apache Kafka
Nagarajan
Developer, ThoughtWorks
A pipeline?
Data pipeline
● Data flow between different systems/components
● Loosely coupled
● Automated
● Scalable
● Error recovery
Usecase
Image you’ve a web application, and you want to see number of visitors per page
on different time durations
Building realtime data pipeline with Apache Kafka
Building realtime data pipeline with Apache Kafka
Building realtime data pipeline with Apache Kafka
Building realtime data pipeline with Apache Kafka
Characteristics required
● High throughput ingestion
● Fault tolerant storage
● Highly available
● Scalable
● Support for concurrent processing and ordering guarantee
Enter Kafka
Apache Kafka® is a distributed streaming platform that:
● Publishes and subscribes to streams of records, similar to a message queue
or enterprise messaging system.
● Stores streams of records in a fault-tolerant durable way.
● Helps to process streams of records as they occur.
Kafka is a distributed, partitioned, replicated commit log service. It provides the
functionality of a messaging system with unique design.
Kafka APIs
Messaging system
● Queue
● publish-subscribe
Topics & Partitions
Partition and message ordering
Each partition is an ordered, immutable sequence of records that is continually
appended to a structured commit log
Records in the partitions are each assigned a sequential id number called the
offset that uniquely identifies each record within the partition
How does partition assigned?
Data retention
● Log retention by time
○ log.retention.ms (minutes/hours)
● Log retention by size
○ log.retention.bytes
● Log segments
○ log.segment.bytes
○ log.segment.ms
● Log compaction
Log compaction
Broker, Cluster and Zookeeper
Replication
● Replica types
○ Leader replica
○ Follower replica
● In Sync Replicas(ISR)
○ min.insync.replicas
● Replication factor
Producer
● acks
● Batching
○ batch.size
○ linger.ms
● Send Async
● Load balancing
Producer flow
Consumer and Consumer groups
Consumer and Consumer groups
● Push vs Pull
● Consumer position - offset maintenance
○ Auto commit
○ __consumer_offsets
● Replay
● Partition assignment
○ Range
○ Round robin
Delivery semantics
● At most once
● At least once
● Exactly once
○ Idempotency (exactly once produce)
○ Transactions (end to end exactly once) - only with Kafka streams
Schemas
● Decouple Producers and Consumers
● Avro format
● Schema Registry
● Evolution and compatibility
Fire fighting stories
● Consumer offsets retention
● Increase number of partitions
● Record without key
Resources:
https://www.confluent.io/resources
https://kafka.apache.org/
Feedback
bit.ly/geeknight_cbe
Advanced
● Security
● Configuration
● Monitoring
● Transactions
● Kafka connect
● Schema registry
● Kafka streams
● KSQL

More Related Content

What's hot (20)

PDF
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...
HostedbyConfluent
 
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
PDF
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
HostedbyConfluent
 
PDF
E commerce data migration in moving systems across data centres
Regunath B
 
PPTX
Automating using Ansible
Alok Patra
 
PPTX
Kafka website activity architecture
Omid Vahdaty
 
PDF
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
PDF
Why You Definitely Don’t Want to Build Your Own Time Series Database
InfluxData
 
PDF
Application Caching: The Hidden Microservice (SAConf)
Scott Mansfield
 
PPTX
Migrating Data Pipeline from MongoDB to Cassandra
Demi Ben-Ari
 
PDF
DevOps Days Kyiv 2019 -- Victoria Metrics // Artem Navoiev
Mykola Marzhan
 
PDF
Implementing Microservices with NATS
Apcera
 
PPTX
RedisConf18 - Designing a Redis Client for Humans
Redis Labs
 
PPTX
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
PDF
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
PDF
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon
 
PPTX
ClustrixDB: how distributed databases scale out
MariaDB plc
 
PDF
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
HostedbyConfluent
 
PDF
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
PDF
Web Performance Part 3 "Server-side tips"
Binary Studio
 
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...
HostedbyConfluent
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
HostedbyConfluent
 
E commerce data migration in moving systems across data centres
Regunath B
 
Automating using Ansible
Alok Patra
 
Kafka website activity architecture
Omid Vahdaty
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Why You Definitely Don’t Want to Build Your Own Time Series Database
InfluxData
 
Application Caching: The Hidden Microservice (SAConf)
Scott Mansfield
 
Migrating Data Pipeline from MongoDB to Cassandra
Demi Ben-Ari
 
DevOps Days Kyiv 2019 -- Victoria Metrics // Artem Navoiev
Mykola Marzhan
 
Implementing Microservices with NATS
Apcera
 
RedisConf18 - Designing a Redis Client for Humans
Redis Labs
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon
 
ClustrixDB: how distributed databases scale out
MariaDB plc
 
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
HostedbyConfluent
 
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
Web Performance Part 3 "Server-side tips"
Binary Studio
 

Similar to Building realtime data pipeline with Apache Kafka (20)

PPTX
Apache kafka part 1
Shrawan Kumar Nirala
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PDF
Kafka syed academy_v1_introduction
Syed Hadoop
 
PPTX
Notes leo kafka
Léopold Gault
 
PPTX
Kafka overview v0.1
Mahendran Ponnusamy
 
PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
PPTX
Kafka
shrenikp
 
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
PPTX
Kafka RealTime Streaming
Viyaan Jhiingade
 
PDF
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
PPTX
04-Kafka.pptx
AdityaGanguly12
 
PPTX
04-Kafka.pptx
MannMehta13
 
PPTX
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
PDF
Kafka Deep Dive
Knoldus Inc.
 
PDF
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
PPTX
Streaming kafka search utility for Mozilla's Bagheera
Varunkumar Manohar
 
PDF
Introduction to Apache Kafka
Ricardo Bravo
 
PDF
Apache Kafka Scalable Message Processing and more!
Guido Schmutz
 
PPTX
Distributed messaging through Kafka
Dileep Kalidindi
 
Apache kafka part 1
Shrawan Kumar Nirala
 
Introduction to apache kafka
Samuel Kerrien
 
Kafka syed academy_v1_introduction
Syed Hadoop
 
Notes leo kafka
Léopold Gault
 
Kafka overview v0.1
Mahendran Ponnusamy
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Kafka
shrenikp
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Kafka RealTime Streaming
Viyaan Jhiingade
 
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
04-Kafka.pptx
AdityaGanguly12
 
04-Kafka.pptx
MannMehta13
 
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Kafka Deep Dive
Knoldus Inc.
 
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
Streaming kafka search utility for Mozilla's Bagheera
Varunkumar Manohar
 
Introduction to Apache Kafka
Ricardo Bravo
 
Apache Kafka Scalable Message Processing and more!
Guido Schmutz
 
Distributed messaging through Kafka
Dileep Kalidindi
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Ad

Building realtime data pipeline with Apache Kafka