SlideShare a Scribd company logo
How it’s similar to the databases you know and love, and how
it’s not.
What is Apache Kafka?
Kenny Gorman
Founder and CEO
www.eventador.io
www.kennygorman.com
@kennygorman
I have done database foo for my whole career, going on 25
years.
Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado,
MongoDB early adopter, founded two companies based on
data technologies
Broke lots of stuff, lost data before, recovered said data,
stayed up many nights, on-call shift horror stories
Apache Kafka is really cool, as fellow database nerds you
will appreciate it.
I am a database nerd
‘02 had hair ^
Now… lol
Kafka
Comparison with the databases you are familiar with
Apache Kafka is an open-source stream processing platform pub/sub message
platform developed by the Apache Software Foundation written in Scala and Java.
The project aims blah blah blah pub/sub message queue architected as a
distributed transaction log,"[3]
Blah blah blah to process streaming data. Blah blah
blah.
The design is heavily influenced by transaction logs.[4]
Kafka
High Performance Streaming Data
Persistent
Distributed
Fault Tolerant
K.I.S.S.
Many Modern Use Cases
Why Kafka?
- It’s a stream of data. A boundless stream of data.
Pub/Sub Messaging Attributes
Image: https://kafka.apache.org
{“temperature”: 29}
{“temperature”: 29}
{“temperature”: 30}
{“temperature”: 29}
{“temperature”: 29}
{“temperature”: 30}
{“temperature”: 29}
{“temperature”: 29}
Logical Data Organization
PostgreSQL MongoDB Kafka
Database Database Topic Files
Fixed Schema Non Fixed Schema Key/Value Message
Table Collection Topic
Row Document Message
Column Name/Value Pairs
Shard Partition
Storage Architecture
PostgreSQL MongoDB Kafka
Stores data in files on disk Stores data in files on disk Stores data in files on disk
Has journal for recovery (WAL) Has journal for recovery (Oplog) Is a commit log
FS + Buffer Cache FS for caching * FS for caching
Random Access, Indexing Random Access, Indexing Sequential access
- Core to design of Kafka
- Partitioning
- Consumers and Consumer Groups
- Offsets ~= High Water Mark
Topics
Image: https://kafka.apache.org
- Kafka topics are glorified distributed write ahead logs
- Append only
- k/v pairs where the key decides the partition it lives in
- Sendfile system call optimization
- Client controlled routing
Performance
- Topics are replicated among any number of servers (brokers)
- Topics can be configured individually
- Topic partitions are the unit of replication
The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single
leader and zero or more followers.
Availability and Fault Tolerance
MongoDB Majority Consensus (Raft-like in 3.2)
Kafka ISR set vote, stored in ZK
Application Programming Interfaces
PostgreSQL MongoDB Kafka
Insert sql = “insert into mytable ..”
db.execute(sql)
db.commit()
db.mytable.save({“baz”:1}) producer.send(“mytopic”, “{‘baz’:1}”)
Query sql = “select * from …”
cursor = db.execute(sql)
for record in cursor:
print record
db.mytable.find({“baz”:1}) consumer = get_from_topic(“mytopic”)
for message in consumer:
print message
Update sql = “update mytable set ..”
db.execute(sql)
db.commit()
db.mytable.update({“baz”:1,
“baz”:2})
Delete sql = “delete from mytable ..”
db.execute(sql)
db.commit()
db.mytable.remove({“baz”:1})
conn = database_connect()
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cur.execute(
"""
SELECT a.lastname, a.firstname, a.email,
a.userid, a.password, a.username, b.orgname
FROM users a, orgs b
WHERE a.orgid = b.orgid
AND a.orgid = %(orgid)s
""", {"orgid": orgid}
)
results = cur.fetchall()
for result in results:
print result
Typical RDBMS
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:1234')
for _ in range(100):
producer.send('foobar', b'some_message_bytes')
Publishing
- Flush frequency/batch
- Partition keys
Subscribing (Consume)
from kafka import KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers='localhost:9092')
consumer.subscribe('my-topic')
for msg in consumer:
print (msg)
try:
msg_count = 0
while running:
msg = consumer.poll(timeout=1.0)
if msg is None: continue
msg_process(msg) # application-specific processing
msg_count += 1
if msg_count % MIN_COMMIT_COUNT == 0:
consumer.commit(async=False)
finally:
# Shut down consumer
consumer.close()
Subscribing (Consume)
- Continuous ‘cursor’
- Offset management
- Partition assignment
- No simple command console like psql or mongo shell
- BOFJCiS
- Kafkacat, jq
- Shell scripts, mirrormaker, etc.
- PrestoDB
Tooling
PostgreSQL:
- Shared Buffers
- WAL/recovery
MongoDB (mmapv2)
- directoryPerDB
- FStuning
Settings and Tunables
Kafka:
- Xmx ~ 90% memory
- log.retention.hours
https://kafka.apache.org/documentation
We are hiring!
www.eventador.io
@kennygorman
Contact

More Related Content

What's hot (20)

PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Kafka Connect - debezium
Kasun Don
 
PDF
Apache Drill (ver. 0.1, check ver. 0.2)
Camuel Gilyadov
 
PDF
Couchdb + Membase = Couchbase
iammutex
 
PPTX
Apache kafka
Ramakrishna kapa
 
PPTX
Apache kafka
Jemin Patel
 
PDF
Node.js and couchbase Full Stack JSON - Munich NoSQL
Philipp Fehre
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PDF
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
confluent
 
PDF
From Zero to Hero with Kafka Connect
Databricks
 
PPTX
Real time dashboards with Kafka and Druid
Venu Ryali
 
PPTX
Building a derived data store using Kafka
Venu Ryali
 
PPTX
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
Michael Stack
 
PPTX
Introduction to Kafka and Zookeeper
Rahul Jain
 
PDF
Introduction to apache kafka
Dimitris Kontokostas
 
PDF
Cassandra Introduction & Features
Phil Peace
 
PPTX
Apache kafka
Kumar Shivam
 
KEY
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
PDF
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
PDF
Kafka meetup - kafka connect
Yi Zhang
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kafka Connect - debezium
Kasun Don
 
Apache Drill (ver. 0.1, check ver. 0.2)
Camuel Gilyadov
 
Couchdb + Membase = Couchbase
iammutex
 
Apache kafka
Ramakrishna kapa
 
Apache kafka
Jemin Patel
 
Node.js and couchbase Full Stack JSON - Munich NoSQL
Philipp Fehre
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
confluent
 
From Zero to Hero with Kafka Connect
Databricks
 
Real time dashboards with Kafka and Druid
Venu Ryali
 
Building a derived data store using Kafka
Venu Ryali
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
Michael Stack
 
Introduction to Kafka and Zookeeper
Rahul Jain
 
Introduction to apache kafka
Dimitris Kontokostas
 
Cassandra Introduction & Features
Phil Peace
 
Apache kafka
Kumar Shivam
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
Kafka meetup - kafka connect
Yi Zhang
 

Similar to What is Apache Kafka®? (20)

PDF
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
PDF
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz
 
PDF
Kafka syed academy_v1_introduction
Syed Hadoop
 
PPTX
Apache kafka
Daan Gerits
 
PPTX
Distributed messaging through Kafka
Dileep Kalidindi
 
PPTX
kafka for db as postgres
PivotalOpenSourceHub
 
PDF
Apache kafka
amarkayam
 
PPTX
04-Kafka.pptx
AdityaGanguly12
 
PPTX
04-Kafka.pptx
MannMehta13
 
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
PDF
Building realtime data pipeline with Apache Kafka
Nagarajan Selvaraj
 
PPTX
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
PPTX
kafka_session_updated.pptx
Koiuyt1
 
PPTX
Intoduction to Apache Kafka
Veysel Gündüzalp
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PPTX
Kafka. seattle data science and data engineering meetup
Abhishek Goswami
 
PPSX
Apache kafka introduction
Mohammad Mazharuddin
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz
 
Kafka syed academy_v1_introduction
Syed Hadoop
 
Apache kafka
Daan Gerits
 
Distributed messaging through Kafka
Dileep Kalidindi
 
kafka for db as postgres
PivotalOpenSourceHub
 
Apache kafka
amarkayam
 
04-Kafka.pptx
AdityaGanguly12
 
04-Kafka.pptx
MannMehta13
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Building realtime data pipeline with Apache Kafka
Nagarajan Selvaraj
 
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
kafka_session_updated.pptx
Koiuyt1
 
Intoduction to Apache Kafka
Veysel Gündüzalp
 
Introduction to apache kafka
Samuel Kerrien
 
Apache Kafka - Martin Podval
Martin Podval
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Kafka. seattle data science and data engineering meetup
Abhishek Goswami
 
Apache kafka introduction
Mohammad Mazharuddin
 
Ad

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Ad

What is Apache Kafka®?

  • 1. How it’s similar to the databases you know and love, and how it’s not. What is Apache Kafka? Kenny Gorman Founder and CEO www.eventador.io www.kennygorman.com @kennygorman
  • 2. I have done database foo for my whole career, going on 25 years. Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado, MongoDB early adopter, founded two companies based on data technologies Broke lots of stuff, lost data before, recovered said data, stayed up many nights, on-call shift horror stories Apache Kafka is really cool, as fellow database nerds you will appreciate it. I am a database nerd ‘02 had hair ^ Now… lol
  • 3. Kafka Comparison with the databases you are familiar with
  • 4. Apache Kafka is an open-source stream processing platform pub/sub message platform developed by the Apache Software Foundation written in Scala and Java. The project aims blah blah blah pub/sub message queue architected as a distributed transaction log,"[3] Blah blah blah to process streaming data. Blah blah blah. The design is heavily influenced by transaction logs.[4] Kafka
  • 5. High Performance Streaming Data Persistent Distributed Fault Tolerant K.I.S.S. Many Modern Use Cases Why Kafka?
  • 6. - It’s a stream of data. A boundless stream of data. Pub/Sub Messaging Attributes Image: https://kafka.apache.org {“temperature”: 29} {“temperature”: 29} {“temperature”: 30} {“temperature”: 29} {“temperature”: 29} {“temperature”: 30} {“temperature”: 29} {“temperature”: 29}
  • 7. Logical Data Organization PostgreSQL MongoDB Kafka Database Database Topic Files Fixed Schema Non Fixed Schema Key/Value Message Table Collection Topic Row Document Message Column Name/Value Pairs Shard Partition
  • 8. Storage Architecture PostgreSQL MongoDB Kafka Stores data in files on disk Stores data in files on disk Stores data in files on disk Has journal for recovery (WAL) Has journal for recovery (Oplog) Is a commit log FS + Buffer Cache FS for caching * FS for caching Random Access, Indexing Random Access, Indexing Sequential access
  • 9. - Core to design of Kafka - Partitioning - Consumers and Consumer Groups - Offsets ~= High Water Mark Topics Image: https://kafka.apache.org
  • 10. - Kafka topics are glorified distributed write ahead logs - Append only - k/v pairs where the key decides the partition it lives in - Sendfile system call optimization - Client controlled routing Performance
  • 11. - Topics are replicated among any number of servers (brokers) - Topics can be configured individually - Topic partitions are the unit of replication The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single leader and zero or more followers. Availability and Fault Tolerance MongoDB Majority Consensus (Raft-like in 3.2) Kafka ISR set vote, stored in ZK
  • 12. Application Programming Interfaces PostgreSQL MongoDB Kafka Insert sql = “insert into mytable ..” db.execute(sql) db.commit() db.mytable.save({“baz”:1}) producer.send(“mytopic”, “{‘baz’:1}”) Query sql = “select * from …” cursor = db.execute(sql) for record in cursor: print record db.mytable.find({“baz”:1}) consumer = get_from_topic(“mytopic”) for message in consumer: print message Update sql = “update mytable set ..” db.execute(sql) db.commit() db.mytable.update({“baz”:1, “baz”:2}) Delete sql = “delete from mytable ..” db.execute(sql) db.commit() db.mytable.remove({“baz”:1})
  • 13. conn = database_connect() cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) cur.execute( """ SELECT a.lastname, a.firstname, a.email, a.userid, a.password, a.username, b.orgname FROM users a, orgs b WHERE a.orgid = b.orgid AND a.orgid = %(orgid)s """, {"orgid": orgid} ) results = cur.fetchall() for result in results: print result Typical RDBMS
  • 14. from kafka import KafkaProducer producer = KafkaProducer(bootstrap_servers='localhost:1234') for _ in range(100): producer.send('foobar', b'some_message_bytes') Publishing - Flush frequency/batch - Partition keys
  • 15. Subscribing (Consume) from kafka import KafkaConsumer consumer = KafkaConsumer(bootstrap_servers='localhost:9092') consumer.subscribe('my-topic') for msg in consumer: print (msg)
  • 16. try: msg_count = 0 while running: msg = consumer.poll(timeout=1.0) if msg is None: continue msg_process(msg) # application-specific processing msg_count += 1 if msg_count % MIN_COMMIT_COUNT == 0: consumer.commit(async=False) finally: # Shut down consumer consumer.close() Subscribing (Consume) - Continuous ‘cursor’ - Offset management - Partition assignment
  • 17. - No simple command console like psql or mongo shell - BOFJCiS - Kafkacat, jq - Shell scripts, mirrormaker, etc. - PrestoDB Tooling
  • 18. PostgreSQL: - Shared Buffers - WAL/recovery MongoDB (mmapv2) - directoryPerDB - FStuning Settings and Tunables Kafka: - Xmx ~ 90% memory - log.retention.hours