SlideShare a Scribd company logo
Gwen Shapira
Product Manager @ Confluent
And Apache Kafka Committer
@gwenshap
Metrics Are Not
Enough
Monitoring Apache Kafka and
Streaming Applications
#1 Kafka Monitoring Tip:
Please Monitor Apache
Kafka.
Breaking tuning
production databases
since 1997
Apache Kafka PMC
Product Manager for a
monitoring product
Tweeting a lot
@gwenshap
Apache Kafka is a distributed system and has many components
Controller
Things to keep an eye on
• Broker health
• Message delivery
• Performance
• Capacity
Broker Health Monitoring
Kafka & Metrics Reporters
• Pluggable interface
• JMX reporter built in
• TONS of metrics
TONS of Metrics
• Broker throughput
• Topic throughput
• Disk utilization
• Unclean leader elections
• Network pool usage
• Request pool usage
• Request latencies – 30 request types, 5 phases each
• Topic partition status counts: online, under replicated,
offline
• Log flush rates
• ZK disconnects
• Garbage collection pauses
• Message delivery
• Consumer groups reading from topics
• …​
#2 Kafka Monitoring Tip:
Have a dashboard
that lets you know
“is everything ok?”
in one look
Is the broker process up?
• ps –ef | grep kafka
• Do we receive metrics?
Under-replicated partitions
• If you can monitor just one thing…
• Is it a specific broker?
• Cluster wide:
• Out of resources
• Imbalance
• Broker:
• Hardware
• Noisy neighbor
• Configuration
• Garbage collection
Drill Down into Broker and Topic: Do we see a problem right here?
Check partition placement - is the issue
specific to one broker?
#3 Kafka Monitoring Tip:
Monitor
Under-replicated
Partitions
Canary
• Produce an event
• Try to consume the event 1s later
• Did you get it?
Other important health metrics
• Active Controller
• ZK Disconnects
• Unclean leader elections
• ISR shrink/expand
• # brokers
#3 Kafka Monitoring Tip:
Don’t Watch the
Dashboard
Few tips on broker health
• Before 0.11.0.0 – restarting a broker is risky
• Even after…
• Before 1.0.0 – restarting a broker is slow
• Especially with 5000+ partitions per broker
Lesson:
Only restart if you know why this will fix the issue.
Message Delivery
Delivery Guarantees
• At most once
• At least once
• Exactly once
• … and within N milliseconds
Are you meeting your guarantees, right now?
#4 Kafka Monitoring Tip:
Monitoring brokers isn’t
enough.
You need to monitor
events
Every Service that uses Kafka is a Distributed System
Orders
Service
Stock
Service
Fulfilment
Service
Fraud Detection
Service
Mobile App
Kafka
How to monitor?
The infamous LinkedIn “Audit”:
• Count messages when they are produced
• Count messages when they are consumed
• Check timestamps when they are consumed
• Compare the results
Under Consumption
• Reasons for under consumption:
• Producers not handling errors and retried correctly
• Misbehaving consumers, perhaps the consumer did not follow shutdown
sequence
• Real-time apps intentionally skipping messages
#5 Kafka Monitoring Life Tips:
• producer.close();
• retries > 0
• Handle send() exceptions.
• Use new consumer (0.9 and up)
Over Consumption
• Reasons for over consumption
• Consumers may be processing a set of messages more than once
• Latency may be higher
Slow Consumers
• Identify consumers and consumer groups that are not keeping up
with data production
• Compare a slow, lagging consumer (left) to a good consumer (right)
Other important client metrics
• Producer retries
• Producer errors
• Consumer message lag – especially trends
Performance Tuning
#5 Kafka Monitoring Tip:
Tune the parts
that take time
Wrong:
“OMG! Log Flush Time is
14ms”
Is this high?
How often we flush logs?
Is it blocking?
Who cares?
Right:
“OMG! We only process 50,000
requests per second. We need
10 times this”
Why? IO threads are busy.
Why? Waiting for flush.
Why are we flushing so often?
Log segments keep rotating.
Lets configure larger segments!
measure
performance
breakdown
time spent
change one
thing
Lifecycle of a request
• Client sends request to broker
• Network thread gets request and puts on queue
• IO thread / handler picks up request and processes
• Read / Write from/to local “disk”
• Wait for other brokers to ack messages
• Put response on queue
• Network thread sends response to client
Produce and Fetch Request Latencies
• Breakdown produce and fetch latencies
through the entire request lifecycle
• Each time segment correspond to a
metric
How to make it faster?
• Are you network, cpu, disk
bound?
• Do you need more threads?
• Where do the threads spend
their time?
Capacity Planning
Key metrics that indicate a cluster is near
capacity:
• CPU
• Network and thread pool usage
• Request latencies
• Network utilization
• Disk utilization
#6 Kafka Monitoring Tip:
By the time you reach
90% utilization.
It is too late.
Summary:
Please Monitor Kafka
Few things to remember…
• Alert on what’s important: Under-Replicated Partitions is a good
start
• DON’T JUST FIDDLE WITH STUFF
• AND DON’T RESTART KAFKA FOR LOLS
• If you don’t know what you are doing, it is ok. There’s support (and
Cloud) for that.
Thank You!
Find me on twitter: @gwenshap

More Related Content

What's hot (20)

PDF
Redis and Kafka - Advanced Microservices Design Patterns Simplified
Allen Terleto
 
PPTX
System Revolution- How We Did It
LivePerson
 
PPTX
Measure() or die()
Tamar Duvshani Hermel
 
PPSX
Service Mesh - Observability
Araf Karsh Hamid
 
PPTX
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Jonghyun Lee
 
PPTX
Training Webinar: Enterprise application performance with distributed caching
OutSystems
 
PDF
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Kai Wähner
 
PPT
Apache kafka- Onkar Kadam
Onkar Kadam
 
PDF
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
VMware Tanzu
 
PPTX
Liveperson DLD 2015
LivePerson
 
PPTX
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
HostedbyConfluent
 
PDF
Performance management
Alan Lok
 
PDF
Multi-DC Kafka
confluent
 
PPTX
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
PPTX
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
PPTX
Network and server performance monitoring training
ManageEngine, Zoho Corporation
 
PPTX
Architecture for monitoring applications in Cloud
Onkar Kadam
 
PDF
Agile Data Integration: How is it possible?
confluent
 
PPTX
Server and application monitoring webinars [Applications Manager] - Part 3
ManageEngine, Zoho Corporation
 
PPSX
Apache kafka introduction
Mohammad Mazharuddin
 
Redis and Kafka - Advanced Microservices Design Patterns Simplified
Allen Terleto
 
System Revolution- How We Did It
LivePerson
 
Measure() or die()
Tamar Duvshani Hermel
 
Service Mesh - Observability
Araf Karsh Hamid
 
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Jonghyun Lee
 
Training Webinar: Enterprise application performance with distributed caching
OutSystems
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Kai Wähner
 
Apache kafka- Onkar Kadam
Onkar Kadam
 
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
VMware Tanzu
 
Liveperson DLD 2015
LivePerson
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
HostedbyConfluent
 
Performance management
Alan Lok
 
Multi-DC Kafka
confluent
 
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
Network and server performance monitoring training
ManageEngine, Zoho Corporation
 
Architecture for monitoring applications in Cloud
Onkar Kadam
 
Agile Data Integration: How is it possible?
confluent
 
Server and application monitoring webinars [Applications Manager] - Part 3
ManageEngine, Zoho Corporation
 
Apache kafka introduction
Mohammad Mazharuddin
 

Viewers also liked (10)

PPTX
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Adaryl "Bob" Wakefield, MBA
 
PDF
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Ontico
 
PPTX
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
PDF
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PDF
Build your First IoT Application with IBM Watson IoT
Janakiram MSV
 
PDF
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
Joan Viladrosa Riera
 
PDF
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
PDF
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Adaryl "Bob" Wakefield, MBA
 
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Ontico
 
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Build your First IoT Application with IBM Watson IoT
Janakiram MSV
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
Joan Viladrosa Riera
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo
 
Ad

Similar to Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent) (20)

PPTX
Monitoring Apache Kafka
confluent
 
PDF
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
PDF
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
confluent
 
PDF
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 
PPTX
Putting Kafka Into Overdrive
Todd Palino
 
PDF
Perfug 20-11-2019 - Kafka Performances
Florent Ramiere
 
PPTX
Kafka infrastructure monitoring
lambdaloopers
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PDF
Tips & Tricks for Apache Kafka®
confluent
 
PDF
Cruise Control: Effortless management of Kafka clusters
Prateek Maheshwari
 
PPTX
Apache Kafka : Monitoring vs Alerting
Ratish Ravindran
 
PDF
kafka_basics.pdf
RakhiYadav98
 
PDF
Preparing Your Kafka Streams Application For Production and Beyond
HostedbyConfluent
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PDF
Citi Tech Talk: Monitoring and Performance
confluent
 
PDF
Tips and Tricks for Operating Apache Kafka
All Things Open
 
PDF
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PDF
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
Monitoring Apache Kafka
confluent
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
confluent
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 
Putting Kafka Into Overdrive
Todd Palino
 
Perfug 20-11-2019 - Kafka Performances
Florent Ramiere
 
Kafka infrastructure monitoring
lambdaloopers
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Tips & Tricks for Apache Kafka®
confluent
 
Cruise Control: Effortless management of Kafka clusters
Prateek Maheshwari
 
Apache Kafka : Monitoring vs Alerting
Ratish Ravindran
 
kafka_basics.pdf
RakhiYadav98
 
Preparing Your Kafka Streams Application For Production and Beyond
HostedbyConfluent
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Citi Tech Talk: Monitoring and Performance
confluent
 
Tips and Tricks for Operating Apache Kafka
All Things Open
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
Ad

More from Ontico (20)

PDF
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
Ontico
 
PDF
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Ontico
 
PPTX
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Ontico
 
PDF
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Ontico
 
PDF
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Ontico
 
PDF
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Ontico
 
PDF
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Ontico
 
PDF
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Ontico
 
PPTX
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
Ontico
 
PPTX
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
Ontico
 
PDF
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Ontico
 
PPTX
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Ontico
 
PPTX
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Ontico
 
PDF
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Ontico
 
PPT
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
Ontico
 
PPTX
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Ontico
 
PPTX
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Ontico
 
PPTX
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
Ontico
 
PPTX
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Ontico
 
PDF
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Ontico
 
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
Ontico
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Ontico
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Ontico
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Ontico
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Ontico
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Ontico
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Ontico
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Ontico
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
Ontico
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
Ontico
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Ontico
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Ontico
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Ontico
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Ontico
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
Ontico
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Ontico
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Ontico
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
Ontico
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Ontico
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Ontico
 

Recently uploaded (20)

DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PDF
POWER PLANT ENGINEERING (R17A0326).pdf..
haneefachosa123
 
PDF
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
PDF
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
PDF
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
PPTX
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
PPT
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Hashing Introduction , hash functions and techniques
sailajam21
 
POWER PLANT ENGINEERING (R17A0326).pdf..
haneefachosa123
 
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)

  • 1. Gwen Shapira Product Manager @ Confluent And Apache Kafka Committer @gwenshap Metrics Are Not Enough Monitoring Apache Kafka and Streaming Applications
  • 2. #1 Kafka Monitoring Tip: Please Monitor Apache Kafka.
  • 3. Breaking tuning production databases since 1997 Apache Kafka PMC Product Manager for a monitoring product Tweeting a lot @gwenshap
  • 4. Apache Kafka is a distributed system and has many components Controller
  • 5. Things to keep an eye on • Broker health • Message delivery • Performance • Capacity
  • 7. Kafka & Metrics Reporters • Pluggable interface • JMX reporter built in • TONS of metrics
  • 8. TONS of Metrics • Broker throughput • Topic throughput • Disk utilization • Unclean leader elections • Network pool usage • Request pool usage • Request latencies – 30 request types, 5 phases each • Topic partition status counts: online, under replicated, offline • Log flush rates • ZK disconnects • Garbage collection pauses • Message delivery • Consumer groups reading from topics • …​
  • 9. #2 Kafka Monitoring Tip: Have a dashboard that lets you know “is everything ok?” in one look
  • 10. Is the broker process up? • ps –ef | grep kafka • Do we receive metrics?
  • 11. Under-replicated partitions • If you can monitor just one thing… • Is it a specific broker? • Cluster wide: • Out of resources • Imbalance • Broker: • Hardware • Noisy neighbor • Configuration • Garbage collection
  • 12. Drill Down into Broker and Topic: Do we see a problem right here?
  • 13. Check partition placement - is the issue specific to one broker?
  • 14. #3 Kafka Monitoring Tip: Monitor Under-replicated Partitions
  • 15. Canary • Produce an event • Try to consume the event 1s later • Did you get it?
  • 16. Other important health metrics • Active Controller • ZK Disconnects • Unclean leader elections • ISR shrink/expand • # brokers
  • 17. #3 Kafka Monitoring Tip: Don’t Watch the Dashboard
  • 18. Few tips on broker health • Before 0.11.0.0 – restarting a broker is risky • Even after… • Before 1.0.0 – restarting a broker is slow • Especially with 5000+ partitions per broker Lesson: Only restart if you know why this will fix the issue.
  • 20. Delivery Guarantees • At most once • At least once • Exactly once • … and within N milliseconds
  • 21. Are you meeting your guarantees, right now?
  • 22. #4 Kafka Monitoring Tip: Monitoring brokers isn’t enough. You need to monitor events
  • 23. Every Service that uses Kafka is a Distributed System Orders Service Stock Service Fulfilment Service Fraud Detection Service Mobile App Kafka
  • 24. How to monitor? The infamous LinkedIn “Audit”: • Count messages when they are produced • Count messages when they are consumed • Check timestamps when they are consumed • Compare the results
  • 25. Under Consumption • Reasons for under consumption: • Producers not handling errors and retried correctly • Misbehaving consumers, perhaps the consumer did not follow shutdown sequence • Real-time apps intentionally skipping messages
  • 26. #5 Kafka Monitoring Life Tips: • producer.close(); • retries > 0 • Handle send() exceptions. • Use new consumer (0.9 and up)
  • 27. Over Consumption • Reasons for over consumption • Consumers may be processing a set of messages more than once • Latency may be higher
  • 28. Slow Consumers • Identify consumers and consumer groups that are not keeping up with data production • Compare a slow, lagging consumer (left) to a good consumer (right)
  • 29. Other important client metrics • Producer retries • Producer errors • Consumer message lag – especially trends
  • 31. #5 Kafka Monitoring Tip: Tune the parts that take time
  • 32. Wrong: “OMG! Log Flush Time is 14ms” Is this high? How often we flush logs? Is it blocking? Who cares? Right: “OMG! We only process 50,000 requests per second. We need 10 times this” Why? IO threads are busy. Why? Waiting for flush. Why are we flushing so often? Log segments keep rotating. Lets configure larger segments!
  • 34. Lifecycle of a request • Client sends request to broker • Network thread gets request and puts on queue • IO thread / handler picks up request and processes • Read / Write from/to local “disk” • Wait for other brokers to ack messages • Put response on queue • Network thread sends response to client
  • 35. Produce and Fetch Request Latencies • Breakdown produce and fetch latencies through the entire request lifecycle • Each time segment correspond to a metric
  • 36. How to make it faster? • Are you network, cpu, disk bound? • Do you need more threads? • Where do the threads spend their time?
  • 37. Capacity Planning Key metrics that indicate a cluster is near capacity: • CPU • Network and thread pool usage • Request latencies • Network utilization • Disk utilization
  • 38. #6 Kafka Monitoring Tip: By the time you reach 90% utilization. It is too late.
  • 40. Few things to remember… • Alert on what’s important: Under-Replicated Partitions is a good start • DON’T JUST FIDDLE WITH STUFF • AND DON’T RESTART KAFKA FOR LOLS • If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
  • 41. Thank You! Find me on twitter: @gwenshap

Editor's Notes

  • #3: Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  • #10: Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  • #11: Two ways to know if Kafka is up… check the process directly, or check if the other metrics you are getting are up to speed
  • #15: Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  • #18: Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  • #22: Monitoring Kafka isn’t enough – Kafka is part of a system and producers and consumers are involved
  • #23: Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  • #32: Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.