What’s slowing down your Kafka pipeline?
A no-effort approach using eBPF
October 5, 2022
Ruizhe (Ryan) Cheng, John P. Stevenson
A CNCF sandbox project
Ruizhe (Ryan) Cheng
Software Engineer at New Relic (Pixie team)
Hi! 👋
Kafka as the centerpiece
Kafka Observability
Kafka Observability
CPU, Memory, Network IO etc.
Kafka Observability
CPU, Memory, Network IO etc.
Pods, Nodes, Deployments etc.
Kafka Observability
CPU, Memory, Network IO etc.
Pods, Nodes, Deployments etc.
Threads, Memory, GC etc.
Kafka Observability
CPU, Memory, Network IO etc.
Pods, Nodes, Deployments etc.
Threads, Memory, GC etc.
Topics, partitions, offsets etc.
Kafka Observability
● JMX: Java Management extension
● Agent collects data from JMX
● Data exported for visualization and
storage
Monitoring needed on
● Producers
● Consumers
● Brokers
Kafka Observability
Agenda
● What is eBPF and Why monitor Kafka with eBPF
● Monitor Kafka with Pixie’s eBPF network tracer
● Find CPU bottlenecks in Kafka with Pixie’s CPU profiler
What is eBPF? An Analogy
An eBPF probe is like a breakpoint in a debugger
● Interrupts execution when breakpoint is reached
● But, unlike a breakpoint, a small eBPF program runs
● Decouples your application and the monitoring software
an open source, eBPF based, observability platform.
What Pixie provides:
● Metrics: CPU, memory, network, JVM stats, etc.
● Network tracing: Kafka, HTTP, MySQL, etc.
○ No instrumentation
● CPU profiling (flamegraphs)
○ Always on
Monitor Kafka using Pixie’s
network traffic tracer
Demo
Demo: Pixie Kafka Overview
Kafka Rebalances
Eager rebalancing protocol
Stops the world!
Kafka rebalancing Protocols
[1] https://www.confluent.io/blog/cooperative-rebalancing-in-kafka-streams-consumer-ksqldb/
Incremental cooperative rebalancing protocol
Stops only some consumers as needed
Kafka Consumer Lag
Measured in
● partition offset
● Wall clock time
Demo: Consumer rebalances + Consumer lag
Find CPU bottlenecks in Kafka
with Pixie’s CPU profiler
John P. (Pete) Stevenson
Software Engineer at New Relic (Pixie team)
Hi! 👋
Flamegraphs:
visualizing
performance
profiling data.
Flamegraphs:
visualizing performance profiling data.
Flamegraphs:
width (on x-axis) represents time.
Processor::run()
sits in a “wide bar.”
Our program spends more time here.
ThreadPoolExecutor::runWorker()
sits in a “narrow bar.”
Using less time here.
Flamegraphs:
stack traces (on y-axis) show what was happening.
An example stack trace from the
Kafka broker.
run() called poll()
... which called pollSelectionKeys()
... which called write()
... which called writev0()
... which called into libc,
... which called into the kernel.
Flamegraphs:
stack traces (on y-axis) show what was happening.
An example stack trace from the
Kafka broker.
run() called poll()
... which called pollSelectionKeys()
... which called write()
... which called IOUtil.write()
... which called writev0()
... which called into libc,
... which called into the kernel.
Flamegraphs:
stack traces (on y-axis) show what was happening.
An example stack trace from the
Kafka broker.
run() called poll()
... which called pollSelectionKeys()
... which called write()
... which called IOUtil.write()
... which called writev0()
... which called into libc,
... which called into the kernel.
Flamegraphs:
stack traces (on y-axis) show what was happening.
An example stack trace from the
Kafka broker.
run() called poll()
... which called pollSelectionKeys()
... which called write()
... which called IOUtil.write()
... which called writev0()
... which called into libc,
... which called into the kernel.
An example stack trace from the
Kafka broer.
run() called poll()
... which called pollSelectionKeys()
... which called write()
... which called IOUtil.write()
... which called writev0()
... which called into libc,
... which called into the kernel.
This gap above /lib/ld-musl-x86_64.so,
... this little blank space,
is time attributed to libc.
System call overhead.
Java profiling landscape
Native
- Runs in the JVM.
- Native access to code symbols.
- Safepoint bias issue.
External
- Uses Linux kernel for sampling.
- Symbols not directly available.
- No safepoint bias; provides full system
context.
Java profiling landscape
Native
- Runs in the JVM.
- Native access to code symbols.
- Safepoint bias issue.
External
- Uses Linux kernel for sampling.
- Symbols not directly available.
- No safepoint bias; provides full system
context.
May be difficult to setup & use:
● App. redeploy
● Hard to exfiltrate profiling data from prod.
Stack traces are
not useful without
symbols!
Fully automated,
always on, in prod.
● No app. redeploy (just start the Pixie daemonset).
● Supports compiled languages (C/C++/Go/Rust).
● External profiler (eBPF).
● Automatic Java symbols (using JVMTI).
For Kubernetes, Pixie to the rescue
Kafka Data Compression
Compression
Type
Compression
Ratio
CPU Usage Network /
Disk
Gzip High High Low
Snappy Medium Moderate Medium
Zstd Medium Moderate Medium
Lz4 Low Low High
Demo: Broker data compression
Gzip: 26.8% CPU Lz4: 7.7% CPU
Conclusion
Kafka monitoring is challenging
Important Kafka metrics:
○ _Consumer rebalance events
○ _Consumer lag
○ _Data compression
Pixie provides eBPF based observability on K8s,
○ _that is fully decoupled and automated,
○ _including Kafka protocol tracing and parsing,
○ _including CPU performance flamegraphs,
○ _with symbols for Java automatically populated
Thank you!...Questions?
github.com/pixie-io/pixie
https://slackin.px.dev
Blog post:
https://blog.px.dev/cpu-profiling-java/
@pixie_run

More Related Content

PDF
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
PDF
Perfug 20-11-2019 - Kafka Performances
PDF
Java Performance Analysis on Linux with Flame Graphs
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
PDF
Using Flame Graphs
PDF
Realtime statistics using Java, Kafka and Graphite
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
PDF
JavaOne 2015 Java Mixed-Mode Flame Graphs
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Perfug 20-11-2019 - Kafka Performances
Java Performance Analysis on Linux with Flame Graphs
SFBigAnalytics_20190724: Monitor kafka like a Pro
Using Flame Graphs
Realtime statistics using Java, Kafka and Graphite
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
JavaOne 2015 Java Mixed-Mode Flame Graphs

Similar to What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson | Current 2022 (20)

PDF
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
POTX
Performance Tuning EC2 Instances
PPTX
Debugging linux issues with eBPF
PDF
Using Riak for Events storage and analysis at Booking.com
PDF
User-space Network Processing
PPTX
Improving Kafka at-least-once performance at Uber
PDF
Understanding Request Latency with Wallclock Profiling by Richard Startin
PDF
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
PDF
Java in flames
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
PDF
Multitenancy: Kafka clusters for everyone at LINE
PPTX
Putting Kafka Into Overdrive
PDF
Making Apache Kafka Even Faster And More Scalable
PDF
Performance Tuning Oracle Weblogic Server 12c
PDF
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
PPTX
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
PDF
Monitorama 2015 Netflix Instance Analysis
PPTX
Twan Koot - Beyond the % usage, an in-depth look into monitoring
PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
PDF
Spark Summit EU talk by Luca Canali
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Performance Tuning EC2 Instances
Debugging linux issues with eBPF
Using Riak for Events storage and analysis at Booking.com
User-space Network Processing
Improving Kafka at-least-once performance at Uber
Understanding Request Latency with Wallclock Profiling by Richard Startin
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
Java in flames
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Multitenancy: Kafka clusters for everyone at LINE
Putting Kafka Into Overdrive
Making Apache Kafka Even Faster And More Scalable
Performance Tuning Oracle Weblogic Server 12c
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Monitorama 2015 Netflix Instance Analysis
Twan Koot - Beyond the % usage, an in-depth look into monitoring
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Spark Summit EU talk by Luca Canali
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Ad

Recently uploaded (20)

PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
STKI Israel Market Study 2025 version august
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
1 - Historical Antecedents, Social Consideration.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
The various Industrial Revolutions .pptx
PPTX
Modernising the Digital Integration Hub
PDF
Architecture types and enterprise applications.pdf
PPT
What is a Computer? Input Devices /output devices
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A novel scalable deep ensemble learning framework for big data classification...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
STKI Israel Market Study 2025 version august
Assigned Numbers - 2025 - Bluetooth® Document
CloudStack 4.21: First Look Webinar slides
O2C Customer Invoices to Receipt V15A.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
1 - Historical Antecedents, Social Consideration.pdf
search engine optimization ppt fir known well about this
Developing a website for English-speaking practice to English as a foreign la...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
The various Industrial Revolutions .pptx
Modernising the Digital Integration Hub
Architecture types and enterprise applications.pdf
What is a Computer? Input Devices /output devices
Univ-Connecticut-ChatGPT-Presentaion.pdf
DP Operators-handbook-extract for the Mautical Institute
Enhancing emotion recognition model for a student engagement use case through...
A novel scalable deep ensemble learning framework for big data classification...

What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson | Current 2022