SlideShare a Scribd company logo
1
Why is My Stream Processing Job Slow?
Xavier Léauté, Software Engineer
Gwen Shapira, Principal Data Architect
2
Kafka 101
Distributed
Scalable
Fault-Tolerant
Partitioned + Replicated Log
Ordering guarantees
Consumers advance independently
Exactly-once delivery
Transactional commits
What people think of Stream Monitoring 3
What our typical experience is
4
Confidential 5
Real Customer Experiences
Confidential 5
Real Customer Experiences
Client Side Broken Streaming Job / App
Confidential 5
Real Customer Experiences
Client Side Broken Streaming Job / App
End-to-End Slow Replication
Your Kafka stream job stopped
humming… now what?
6
Confidential 7
What we check
Consumer Lag
Partition Assignment
Partition Skew
Client Logs
GC Log
Metrics
Request Latencies
Commit Rates
Group Rebalancing
Basic Tuning
Batch Sizes
Commit Rate
Application Profiling
8
The Newbie - During an incident…
GC Logs? Metrics?

How do I get those?
I’ll just change some configs
and reboot everything.
9
Consumer Lag
Wait for me!
10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
11
Watch for Partition Skew
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
11
Watch for Partition Skew
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
12
Not all partitions are created equal
Important for
Keyed topics
Custom partitioned topics
Early warning signs
some partitions lagging
uneven CPU / Network usage
Typical cause
skewed key distribution in your data
bad joins (null keys)
imbalance across brokers
13
Clients have metrics too!
Start with the basics GC / CPU / Network
General Slowness
Consumer or Producer Side?
Global Request Latencies
Some partitions still lagging
Per Broker metrics (bad node / network)
Per Topic metrics (data / tuning)
Buffer Size
Offset Commit
14
Turn up the log level
The logs took too
much space, so we
deleted them.
15
Time for some profiling
https://github.com/jvm-profiling-tools/async-profiler
https://github.com/brendangregg/FlameGraph
./profiler.sh -d 30 -f flamegraph.svg <pid>
To impress your coworkers
https://github.com/Netflix/flamescope
16
Here’s where your CPU cycles went
% CPU Time
Stack
16
Here’s where your CPU cycles went
GC
% CPU Time
Stack
16
Here’s where your CPU cycles went
RocksDB
% CPU Time
Stack
16
Here’s where your CPU cycles went
Kafka poll() loop
% CPU Time
Stack
16
Here’s where your CPU cycles went
Actual Processing Time
% CPU Time
Stack
17
Spark Streaming Clickstream Example (using Kafka)
18
Spark Streaming Clickstream Example (using Kafka)
18
Spark Streaming Clickstream Example (using Kafka)
Scheduler
Event Loop
18
Spark Streaming Clickstream Example (using Kafka)
Shuffle Writes
Scheduler
Event Loop
18
Spark Streaming Clickstream Example (using Kafka)
30% deserialization
Shuffle Writes
Scheduler
Event Loop
18
Spark Streaming Clickstream Example (using Kafka)
30% deserialization
Shuffle Writes
Scheduler
Event Loop
Read from Kafka
& Processing
19
Maybe it’s your code
20
Let’s commit, just to be safe, right?
Common beginner mistake
Commit only as needed
keep recovery short
maximize throughput
Metrics to validate
commit-rate
commit-latency-avg
MESSAGES
COMMIT
MESSAGES
21
Right-size your batches
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
don’t forget!
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
Watch
request-rate
request-latency-avg
compression-rate
don’t forget!
22
My app keeps rebalancing
Symptoms
low throughput
high network chatter
consumer logs galore
no progress
hanging
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Group
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Response
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
New Assignment
24
Restoring a Happy Balance
Timing Issues
long GC pauses (tens of seconds)
infrequent calls to poll()
timeouts too short?
flaky network
1 bad machine affects the entire group
Watch
join-rate
sync-rate
25
Competent Users
• Monitor Consumer Lag
• Lookout for Partition Skew
• Commit Offsets Sparingly
• Collect Logs
• Understand how to tune Batch Sizes
26
Kafka Pros
• Watch Group Partition Assignment
• Monitor Client Metrics
• Understand Consumer Rebalancing
• Profile their applications
• Distinguish Client/App/Broker problems
Replication Everything is Slow
27
28
Famous last words…
“You just consume, and
produce. How hard
can this be?”
29
Famous last words…
“We have a disaster in our
main cluster. Can we fail over
to secondary? We can’t lose
more than 7 seconds of data.”
30
Monitor Replication Lag - In messages
31
Monitor Replication Lag - or in seconds…
Screenshot of replicator streams monitoring
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
io-ratio

io-wait-ratio

byte-consumed-rate
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
io-ratio

io-wait-ratio

byte-consumed-rate
Buffer
fetch-size-avg

fetch-size-max

fetch-rate
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
io-ratio

io-wait-ratio

byte-consumed-rate
Buffer
fetch-size-avg

fetch-size-max

fetch-rate
record-max-lag
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
fetch.max.bytes

fetch.min.bytes

fetch.max.wait
34
Network Tuning
• WAN has high latency. We deal with it.
• Compute buffer size to match:  https://www.switch.ch/network/tools/tcp_throughput/
• send.buffer.bytes and receive.buffer.bytes on producer, consumer, brokers
• OS tuning: https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php 

net.core.rmem_default, net.core.rmem_max, net.core.wmem_default,
net.core.wmem_max
• Enable logging to check if this had any effect:
log4j.logger.org.apache.kafka.common.network.Selector=DEBUG
• Additional tips in our docs
35
Competent users
• Monitor consumer lag
• Add processes when things are slow
• Automate deployment
36
Kafka Pros
• Monitor time lag
• Collect client metrics
• Knows which side to blame
• Know which configs to tune
• Tunes the network over the WAN
Resources and Next Steps
https://github.com/confluentinc/cp-demo
https://www.confluent.io/download/
https://slackpass.io/confluentcommunity
https://www.confluent.io/blog
Thank you!
@gwenshap

gwen@confluent.io
@xvrl

xavier@confluent.io

More Related Content

What's hot (20)

PDF
Linux: LVM
Michal Sedlak
 
PPTX
Virtualization
Srisailam Reddy Kanapuram
 
PPTX
k8s practice 2023.pptx
wonyong hwang
 
PDF
Complete Guide for Linux shell programming
sudhir singh yadav
 
PPTX
Web application framework
Pankaj Chand
 
PDF
Linux Memory Management
Anil Kumar Pugalia
 
ODP
Introduction to Version Control
Jeremy Coates
 
PPT
Overview of chef ( Infrastructure as a Code )
Pravin Mishra
 
PDF
Wafa kamoun-admin-sec-reseaux
infcom
 
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
PPT
JAVA Servlets
deepak kumar
 
PPTX
Cassandra & puppet, scaling data at $15 per month
daveconnors
 
PPTX
Database Change Management as a Service
Andrew Solomon
 
PPT
Shell Scripting
Gaurav Shinde
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Linux Tutorial For Beginners | Linux Administration Tutorial | Linux Commands...
Edureka!
 
PPTX
Apache kafka
Ramakrishna kapa
 
PPTX
Reverse shell
Ilan Mindel
 
ODP
Introduction to Shell script
Bhavesh Padharia
 
PPTX
Virtual machine
Rinaldo John
 
Linux: LVM
Michal Sedlak
 
k8s practice 2023.pptx
wonyong hwang
 
Complete Guide for Linux shell programming
sudhir singh yadav
 
Web application framework
Pankaj Chand
 
Linux Memory Management
Anil Kumar Pugalia
 
Introduction to Version Control
Jeremy Coates
 
Overview of chef ( Infrastructure as a Code )
Pravin Mishra
 
Wafa kamoun-admin-sec-reseaux
infcom
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
JAVA Servlets
deepak kumar
 
Cassandra & puppet, scaling data at $15 per month
daveconnors
 
Database Change Management as a Service
Andrew Solomon
 
Shell Scripting
Gaurav Shinde
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Linux Tutorial For Beginners | Linux Administration Tutorial | Linux Commands...
Edureka!
 
Apache kafka
Ramakrishna kapa
 
Reverse shell
Ilan Mindel
 
Introduction to Shell script
Bhavesh Padharia
 
Virtual machine
Rinaldo John
 

Similar to Why is My Stream Processing Job Slow? with Xavier Leaute (20)

PPT
Troubleshooting SQL Server
Stephen Rose
 
PDF
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
HostedbyConfluent
 
PPTX
Low latency in java 8 by Peter Lawrey
J On The Beach
 
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
HostedbyConfluent
 
PPTX
Low latency in java 8 v5
Peter Lawrey
 
PDF
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
PDF
Become a Performance Diagnostics Hero
TechWell
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
PDF
Autoscaling Confluent Cloud: Should We? How Would We?
HostedbyConfluent
 
PPTX
Top Java Performance Problems and Metrics To Check in Your Pipeline
Andreas Grabner
 
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
PDF
Introduction to Apache Kafka
Ricardo Bravo
 
PDF
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
 
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
PDF
Etl, esb, mq? no! es Apache Kafka®
confluent
 
PDF
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Henning Spjelkavik
 
PDF
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
PPTX
Getting Started with Kafka on k8s
VMware Tanzu
 
Troubleshooting SQL Server
Stephen Rose
 
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
HostedbyConfluent
 
Low latency in java 8 by Peter Lawrey
J On The Beach
 
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
HostedbyConfluent
 
Low latency in java 8 v5
Peter Lawrey
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Become a Performance Diagnostics Hero
TechWell
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Autoscaling Confluent Cloud: Should We? How Would We?
HostedbyConfluent
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Andreas Grabner
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
Introduction to Apache Kafka
Ricardo Bravo
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Etl, esb, mq? no! es Apache Kafka®
confluent
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Henning Spjelkavik
 
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
Getting Started with Kafka on k8s
VMware Tanzu
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 

Why is My Stream Processing Job Slow? with Xavier Leaute

  • 1. 1 Why is My Stream Processing Job Slow? Xavier Léauté, Software Engineer Gwen Shapira, Principal Data Architect
  • 2. 2 Kafka 101 Distributed Scalable Fault-Tolerant Partitioned + Replicated Log Ordering guarantees Consumers advance independently Exactly-once delivery Transactional commits
  • 3. What people think of Stream Monitoring 3
  • 4. What our typical experience is 4
  • 6. Confidential 5 Real Customer Experiences Client Side Broken Streaming Job / App
  • 7. Confidential 5 Real Customer Experiences Client Side Broken Streaming Job / App End-to-End Slow Replication
  • 8. Your Kafka stream job stopped humming… now what? 6
  • 9. Confidential 7 What we check Consumer Lag Partition Assignment Partition Skew Client Logs GC Log Metrics Request Latencies Commit Rates Group Rebalancing Basic Tuning Batch Sizes Commit Rate Application Profiling
  • 10. 8 The Newbie - During an incident… GC Logs? Metrics?
 How do I get those? I’ll just change some configs and reboot everything.
  • 12. 10 Bad Capacity Allocation kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 13. 10 Bad Capacity Allocation kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 14. 10 Bad Capacity Allocation kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 15. 11 Watch for Partition Skew kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 16. 11 Watch for Partition Skew kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 17. 12 Not all partitions are created equal Important for Keyed topics Custom partitioned topics Early warning signs some partitions lagging uneven CPU / Network usage Typical cause skewed key distribution in your data bad joins (null keys) imbalance across brokers
  • 18. 13 Clients have metrics too! Start with the basics GC / CPU / Network General Slowness Consumer or Producer Side? Global Request Latencies Some partitions still lagging Per Broker metrics (bad node / network) Per Topic metrics (data / tuning) Buffer Size Offset Commit
  • 19. 14 Turn up the log level The logs took too much space, so we deleted them.
  • 20. 15 Time for some profiling https://github.com/jvm-profiling-tools/async-profiler https://github.com/brendangregg/FlameGraph ./profiler.sh -d 30 -f flamegraph.svg <pid> To impress your coworkers https://github.com/Netflix/flamescope
  • 21. 16 Here’s where your CPU cycles went % CPU Time Stack
  • 22. 16 Here’s where your CPU cycles went GC % CPU Time Stack
  • 23. 16 Here’s where your CPU cycles went RocksDB % CPU Time Stack
  • 24. 16 Here’s where your CPU cycles went Kafka poll() loop % CPU Time Stack
  • 25. 16 Here’s where your CPU cycles went Actual Processing Time % CPU Time Stack
  • 26. 17 Spark Streaming Clickstream Example (using Kafka)
  • 27. 18 Spark Streaming Clickstream Example (using Kafka)
  • 28. 18 Spark Streaming Clickstream Example (using Kafka) Scheduler Event Loop
  • 29. 18 Spark Streaming Clickstream Example (using Kafka) Shuffle Writes Scheduler Event Loop
  • 30. 18 Spark Streaming Clickstream Example (using Kafka) 30% deserialization Shuffle Writes Scheduler Event Loop
  • 31. 18 Spark Streaming Clickstream Example (using Kafka) 30% deserialization Shuffle Writes Scheduler Event Loop Read from Kafka & Processing
  • 33. 20 Let’s commit, just to be safe, right? Common beginner mistake Commit only as needed keep recovery short maximize throughput Metrics to validate commit-rate commit-latency-avg MESSAGES COMMIT MESSAGES
  • 35. 21 Right-size your batches Bigger Batches increase throughput improve compression
  • 36. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low
  • 37. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low batch.size + linger.ms
  • 38. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low batch.size + linger.ms don’t forget!
  • 39. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low batch.size + linger.ms Watch request-rate request-latency-avg compression-rate don’t forget!
  • 40. 22 My app keeps rebalancing Symptoms low throughput high network chatter consumer logs galore no progress hanging
  • 41. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6
  • 42. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Hi!
  • 43. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Group Hi!
  • 44. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Group Hi!
  • 45. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Group Hi!
  • 46. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Response
  • 47. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Response
  • 48. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Sync Group
  • 49. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Sync Response
  • 50. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 New Assignment
  • 51. 24 Restoring a Happy Balance Timing Issues long GC pauses (tens of seconds) infrequent calls to poll() timeouts too short? flaky network 1 bad machine affects the entire group Watch join-rate sync-rate
  • 52. 25 Competent Users • Monitor Consumer Lag • Lookout for Partition Skew • Commit Offsets Sparingly • Collect Logs • Understand how to tune Batch Sizes
  • 53. 26 Kafka Pros • Watch Group Partition Assignment • Monitor Client Metrics • Understand Consumer Rebalancing • Profile their applications • Distinguish Client/App/Broker problems
  • 55. 28 Famous last words… “You just consume, and produce. How hard can this be?”
  • 56. 29 Famous last words… “We have a disaster in our main cluster. Can we fail over to secondary? We can’t lose more than 7 seconds of data.”
  • 57. 30 Monitor Replication Lag - In messages
  • 58. 31 Monitor Replication Lag - or in seconds… Screenshot of replicator streams monitoring
  • 59. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full Buffer
  • 60. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate Buffer
  • 61. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max Buffer
  • 62. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 Buffer
  • 63. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time Buffer
  • 64. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time io-ratio
 io-wait-ratio
 byte-consumed-rate Buffer
  • 65. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time io-ratio
 io-wait-ratio
 byte-consumed-rate Buffer fetch-size-avg
 fetch-size-max
 fetch-rate
  • 66. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time io-ratio
 io-wait-ratio
 byte-consumed-rate Buffer fetch-size-avg
 fetch-size-max
 fetch-rate record-max-lag
  • 67. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full Buffer
  • 68. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance Buffer
  • 69. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size Buffer
  • 70. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size destination kafka issues Buffer
  • 71. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size destination kafka issues network or origin kafka performance Buffer
  • 72. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size destination kafka issues network or origin kafka performance Buffer fetch.max.bytes
 fetch.min.bytes
 fetch.max.wait
  • 73. 34 Network Tuning • WAN has high latency. We deal with it. • Compute buffer size to match:  https://www.switch.ch/network/tools/tcp_throughput/ • send.buffer.bytes and receive.buffer.bytes on producer, consumer, brokers • OS tuning: https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php 
 net.core.rmem_default, net.core.rmem_max, net.core.wmem_default, net.core.wmem_max • Enable logging to check if this had any effect: log4j.logger.org.apache.kafka.common.network.Selector=DEBUG • Additional tips in our docs
  • 74. 35 Competent users • Monitor consumer lag • Add processes when things are slow • Automate deployment
  • 75. 36 Kafka Pros • Monitor time lag • Collect client metrics • Knows which side to blame • Know which configs to tune • Tunes the network over the WAN
  • 76. Resources and Next Steps https://github.com/confluentinc/cp-demo https://www.confluent.io/download/ https://slackpass.io/confluentcommunity https://www.confluent.io/blog