SlideShare a Scribd company logo
Introduction to Monitoring
Confluent Platform
Akhilesh Dubey | Confluent
Ishan Dwivedi | Citibank
Agenda
2
01
Confluent Platform Monitoring
What can you monitor in Confluent
Platform?
02
Monitoring using Control Center
Overview of Monitoring through Confluent
Control Center
03
JMX Monitoring
Overview of JMX metrics and 3rd party
monitoring stacks - AppDynamics &
Prometheus/Grafana
04
Alerting
Alerting ability available through Confluent
Control Center and ITRS
01. Confluent Platform
Components
Confluent Platform Components
4
https://www.confluent.io/whitepaper/confluent-enterprise-reference-architecture/
Application
Sticky Load Balancer
REST Proxy
Proxy
Kafka Brokers
Broker +
Rebalancer
ZooKeeper Nodes
ZK ZK ZK
Proxy
Broker +
Rebalancer
Broker +
Rebalancer
Broker +
Rebalancer
Schema Registry
Leader Follower
ZK ZK
Confluent
Control Center
Application
Clients
KStreams
pp
Streams
Kafka Connect
Worker +
Connectors
or
Replicator
Microservices
Worker +
Connectors
or
Replicator
ksqlDB
ksqlDB
Server
ksqlDB
Server
What components need monitoring?
● Resources (CPU, DISK, Memory, Network I/O)
● JVM
● Kafka Brokers
● Zookeeper
● Connect
● Schema Registry
● REST Proxy
● Clients (producers/consumers)
Where do I even
start?
Start with the basics:
● Do I have a monitoring solution today (agents, storage,
dashboards)?
● Most components emit JMX metrics. These can be
watched and exported to a JMX Collector
(AppDynamics, Prometheus, etc) for alerting or
visualization:
● Resources (put in alerting at 60% to investigate):
○ CPU
○ DISK Free (Kafka Cannot run if your disk is full)
○ Network I/O
○ Open File Handles
○ JVM (Enable and monitor garbage collection
times)
Where do I even
start?
You can use Control Center for an opinionated view of what
is happening right now
Brokers generate many metrics using JMX MBeans.
● Under Replicated Partitions
● Offline Partitions
● Total Time MS
● ISR Shrink Rate
● and many more
(https://support.confluent.io/hc/en-us/articles/230419288-Monitoring-Kafka)
Brokers
Zookeeper is crucial to the operation of a Kafka cluster
4 Letter Words for quick status (RUOK, MNTR, STAT)
Zookeeper also generates many many metrics using JMX
MBeans.
● AvgRequestLatency (per node)
● OutstandingRequests (per node)
Monitor which Zookeeper nodes are leaders (they tend to be
busiest)
● How many clients and watchers:
NumAliveConnections
● WatchCount
https://zookeeper.apache.org/doc/current/zookeeperJMX.html
Zookeeper
Very important to monitor producers and consumers too!
Confluent monitoring interceptors available to see end to
end lag in Control Center
JMX metrics also available in producers and consumers:
Clients
Consumers:
● records-lag/records-lag-m
ax
● bytes-consumed-rate
● records-consumed-rate
● fetch-rate
Producers:
● request-rate
● request-latency-avg
● response-rate
● outgoing-byte-rate
● io-wait-time-ns-avg
● batch-size-avg
● compression-rate-av
g
02. Monitoring using Control
Center
● Confluent Platform is the central nervous system for a business, and potentially a Kafka-based single
source of truth.
● Kafka operators need to provide guarantees to the business that Kafka is working properly and
delivering data in real time. They need to identify and triage problems in order to solve them before it
affects end users. As a result, monitoring your Kafka deployments is an operational must-have.
● Monitoring help provides assurance that all your services are working properly, meeting SLAs and
addressing business needs.
● Here are some common business-level questions:
1. Are applications receiving all data?
2. Are my business applications showing the latest data?
3. Why are the applications running slowly?
4. Do we need to scale up?
5. Can any data get lost?
6. Will there be service interruptions?
7. Are there assurances in case of a disaster event?
We will see how Control Center can help to answer all those questions and where/when you require and
additional monitoring stack.
Why do we monitor?
11
12
You can deploy Confluent Control Center for
out-of-the-box Kafka cluster monitoring so you
don’t have to build your own monitoring system.
Control Center makes it easy to manage the
entire Confluent Platform.
Control Center is a web-based application that
allows you to manage your cluster, to monitor
Kafka clusters in predefined dashboards and to
alert on triggers.
● Kafka exposes hundreds of JMX metrics. Some of them are per broker,
per client, per topic and per partition, and so the number of metrics
scales up as the cluster grows. For an average-size Kafka cluster, the
number of metrics can very quickly grow into thousands !
● A common pitfall of generic monitoring tools is to import pretty much all
available metrics. But even with a comprehensive list of metrics, there is
a limit to what can be achieved with no Kafka context or Kafka expertise
to determine which metrics are important and which ones are not.
○ People end up referring to just the two or three charts that they
actually understand.
○ Meanwhile, they ignore all the other charts because they don’t
understand them
○ It can generate a lot of noise as people spend time chasing “issues”
that aren’t impactful to the services, or worse, obscures real
problems.
● Control Center was designed to help operators identify the most
important things to monitor in Kafka, including the cluster and the
client applications producing messages to and consuming messages
from the cluster
The metrics swamp
13
Control Center
A walkthrough of the features
15
● Cluster Overview provides insight into the
well-being of the Kafka cluster from the
cluster perspective, and allows you to drill
down to the broker level, topic level,
connect cluster level and KSQL level
perspectives
● Multiple clusters can be monitored with a
single Control Center and it also supports
Multi-Cluster Schema Registry
● Requires Confluent Metrics Reporter to be
installed and enabled
Cluster Overview
16
● Brokers Overview provides a succinct view
of essential Kafka metrics for brokers in a
cluster:
○ Throughput for production and
consumption
○ Broker uptime
○ Partitions replicas status (including
URP)
○ Apache ZooKeeper status
○ Active Controller
○ Disk usage and distribution
○ System metrics for network and
request pool usage
● Clicking on panels, you get an historical
view of the metrics 👇👇👇
Brokers Overview
17
● Brokers Metrics page provides historical
data for following panels:
○ Production metrics
○ Consumption metrics
○ Broker uptime metrics
○ Partition replicas metrics
○ System usage
○ Disk usage
Brokers Metrics page
18
● You can add, view, edit, and delete topics
using the Control Center topic management
interface
● Message Browser
● Manage Schemas for Topics
○ Avro, JSON-Schema and Protobuf
○ ⚠ Options to view and edit schemas
through the user interface are available
only for schemas that use the default
TopicNameStrategy
○ Multi-Cluster Schema Registry
● Metrics:
○ Production Throughput and Failed
production requests
○ Consumption Throughput and Failed
consumptions requests, % messages
consumed (require Monitoring
Interceptors) and End-to-end latency
(require Monitoring Interceptors)
○ Availability (URP and Out of Sync
followers and observers)
○ Consumer Lag
Topics
19
● Provides the convenience of managing
connectors for multiple Kafka Connect
clusters.
● Use Control Center to:
○ Add a connector by completing UI
fields. Note: specific procedure when
RBAC is used.
○ Add a connector by uploading a
connector configuration file
○ Download connector configuration files
to reuse in another connector or cluster,
or to use as a template.
○ Edit a connector configuration and
relaunch it.
○ Pause a running connector; resume a
paused connector.
○ Delete a connector.
○ View the status of connectors in
Connect clusters.
Connect
20
● Control Center provides the convenience of
running streaming queries on one or more
ksqlDB clusters within its graphical user
interface
● Use ksqlDB to:
○ View a summary of all ksqlDB
applications connected to Control
Center.
○ Search for a ksqlDB application being
managed by the Control Center
instance.
○ Browse topic messages.
○ View the number of running queries,
registered streams, and registered
tables for each ksqlDB application.
○ Navigate to the ksqlDB Editor, Streams,
Tables, Flow View and Running Queries
for each ksqlDB application.
ksqlDB
21
● View all consumer groups for all topics in a
cluster
● Use Consumers menu to:
○ View all consumer groups for a cluster
in the All consumer groups page
○ View consumer lag across all topics in a
cluster
○ View consumption metric for a
consumer group (only available if
monitoring interceptors are set)
○ Set up consumer group alerts
Consumers
22
● You can set up alerts in Control Center based on 4
component triggers:
○ Broker
■ Bytes in
■ Bytes out
■ Fetch request latency
■ Production request count
■ Production request latency
○ Cluster
■ Cluster down
■ Leader election rate
■ Offline topic partitions
■ Unclean election count
■ Under replicated topic partitions
■ ZooKeeper status
■ ZooKeeper expiration rate
○ Consumer Group
■ Average latency (ms)
■ Consumer lag
■ Consumer lead
■ Consumption difference
■ Maximum latency (ms)
○ Topic
■ Bytes in
■ Bytes out
■ Out of sync replica count
■ Production request count
■ Under-replicated topic partitions
● Notifications are possible via email, PagerDuty or
Slack
Alerts
23
● Cluster settings
○ Change cluster name (also possible
using configuration file)
○ Update dynamic settings without any
restart required
○ Download broker configuration
● Status and License menu
○ Processing status: status of Control
Center (Running or Not Running).
Consumption data and Broker data
(message throughput are shown
real-time for the last 30 minutes)
○ Set or update license
And more...
03. JMX Metrics and
Monitoring Stacks
Overview of JMX metrics and 3rd party monitoring stacks
● Kafka brokers and Java client applications (Kafka Connect, Kafka Streams, Producer/Consumer, etc..)
expose hundreds of internal JMX (Java Management Extensions) metrics
● Important JMX metrics to monitor:
○ Broker metrics
○ ZooKeeper metrics
○ Producer metrics
○ Consumer metrics
○ ksqlDB & Kafka Streams metrics
○ Kafka Connect metrics
● It’s key to have a dashboard that let you know “everything is OK?” in one glance
● Multiple monitoring stacks are available. Choose the one that is already used in your company
JMX metrics
25
26
Java: Client JMX metrics
• Java Kafka Client applications expose some internal JMX (Java Management Extensions) metrics
• Many users run JMX exporters to feed these metrics into their monitoring systems (AppDynamics,
Grafana, etc..)
• Important Client JMX metrics to monitor
General producer metrics and producer throttling-time
Consumer metrics
ksqlDB & Kafka Streams metrics
Kafka Connect metrics
• Prometheus is a popular open-source monitoring solution uses JMX-Exporter to extract the metrics. The
exporter can be configured to extract and forward only the metrics desired.
• Here is a demo of JMX-Exporter/Prometheus/Grafana
27
Typical Data pipeline pattern(s) for Client Metrics
Clients
emitting JMX
JMX Client
e.g. JMX-Exporter
Prometheus
Observability
App
Java
Producer
running in JVM, producing
to Kafka Cluster
JMX Exporter
jmx_prometheus_javaagent
configured as agent on jvm,
exposing producer /metrics
endpoint
Prometheus
Configured with a job that
scrapes producer /metrics
endpoint
Grafana
Configured with
Prometheus as datasource
Connect
running in JVM, producing
to Kafka Cluster
JMX Client
e.g. appdynamics-agent
AppDynamics
E.g.
28
Client Throttling
• Depending on your cluster configuration, you may be restricted to specific throughputs for your client
application
• If your client applications exceed these rates, the quotas on the brokers will detect it and the client
application requests will be throttled by the brokers.
• If your clients are being throttled, consider two options:
Modify your application to optimize its throughput, if possible (read the section Optimizing
for Throughput for more details)
Upgrade to a cluster configuration with higher limits
• ℹ Metrics API can give you some indication of throughput from server side, but it doesn’t provide
throughput metrics on the client side.
29
Client Throttling
To get throttling metrics per producer and consumer, monitor the following client JMX metrics:
Metric Description
kafka.producer:type=producer-metrics,client-id=([-.w
]+),name=produce-throttle-time-avg
The average time in ms that a request was
throttled by a broker
kafka.producer:type=producer-metrics,client-id=([-.w
]+),name=produce-throttle-time-max
The maximum time in ms that a request was
throttled by a broker
kafka.consumer:type=consumer-fetch-manager-metrics,c
lient-id=([-.w]+),name=fetch-throttle-time-avg
The average time in ms that a broker spent
throttling a fetch request
kafka.consumer:type=consumer-fetch-manager-metrics,c
lient-id=([-.w]+),name=fetch-throttle-time-max
The maximum time in ms that a broker spent
throttling a fetch request
30
AppDynamics
• AppDynamics provides ability to do JMX
monitoring of Java applications
• Machine and application server
monitoring can be combined to generate
and monitor relevant Confluent Platform
component metrics.
AppDynamics: KaaS Application Flow
31
AppDynamics: KaaS JMX Metrics Drill Down View
32
AppDynamics: KaaS Availability
33
AppDynamics: KaaS Availability contd..
34
35
Prometheus/Grafana
• Prometheus is a popular open-source
monitoring solution which uses
JMX-Exporter to extract the metrics. The
exporter can be configured to extract and
forward only the metrics desired.
• An example of
JMX-Exporter/Prometheus/Grafana
monitoring stack deployed on top of
Confluent cp-demo is available here
Prometheus exporter
(JMX-Exporter)
Prometheus/Grafana: Broker (with cp-demo)
36
Prometheus/Grafana: JAVA producer demo
37
Prometheus/Grafana: JAVA consumer demo
38
• JMX metrics are only for java based clients.
• Librdkafka applications can be configured (disabled by default) to emit internal metrics at a fixed
interval by setting the statistics.interval.ms configuration property to a value > 0 and registering a
stats_cb (or similar, depending on language)
• All statistics described here
• Emits JSON object string:
Librdkafka: Client statistics
39
Using prometheus-net/prometheus-net, starting up a MetricsServer to export metrics to Prometheus
Prometheus/Grafana: Librdkafka: .NET example
40
Prometheus/Grafana: .NET Client demo
41
Monitor Consumer Lag
All different ways to monitor consumer lag
● It is important to monitor your application’s consumer lag,
which is the number of records for any partition that the
consumer is behind in the log
● For "real-time" consumer applications, where the consumer
is meant to be processing the newest messages with as little
latency as possible, consumer lag should be monitored
closely.
● Most "real-time" applications will want little-to-no consumer
lag, because lag introduces end-to-end latency.
Monitoring Consumer Lag
43
Consumer lag is available in Consumers section from navigation bar:
#1: Using Control Center
44
If you use Java consumers, you can capture JMX metrics and monitor records-lag-max
Note: the consumer’s records-lag-max JMX metric calculates lag by comparing the offset most recently
seen by the consumer to the most recent offset in the log, which is a more real-time measurement.
#2: Using JMX (Java client only)
45
Metric Description
kafka.consumer:type=consumer-fe
tch-manager-metrics,client-id=(
[-.w]+),records-lag-max
The maximum lag in terms of number of
records for any partition in this window. An
increasing value over time is your best
indication that the consumer group is not
keeping up with the producers.
Refer to this Knowledge Base article for full details
Create a properties file containing your security details
Example:
#3: Using kafka-consumer-groups CLI
46
47
#4: Using
kafka-lag-exporter and
Prometheus/Grafana
• lightbend/kafka-lag-exporter is a 3rd party
tool (not supported by Confluent) that is
using Kafka's Admin API
describeConsumerGroups() method to get
consumer lags and export them to
Prometheus.
• Out of the box Grafana dashboard is
available
04. Alerting
Overview of Alerting capabilities through Confluent Control Center and ITRS
49
Alerts
● As seen earlier, setting up alerts can be done
through Control Center, but also using your
monitoring stack based on JMX metrics
● Alert on what’s important: Under-replicated
partitions is a good start
● Alerting on SLAs is even better: especially
when measured from a client point of view
Key Alerts
50
Cluster/Broker:
• UnderReplicatedPartitions > 0
*
• OfflinePartitionsCount > 0 *
• UnderMinIsrPartitionCount > 0
• ActiveControllerCount != 1
• AtMinIsrPartitionCount > 0
• RequestHandlerAvgIdlePercent
< 40%
• NetworkProcessorAvgIdlePercen
t < 40%
• RequestQueueSize
(establish
the baseline during
normal/peak production load
and alert if a deviation
occurs)
• TotalTimeMs,request=*
(Produce|FetchConsumer|FetchF
ollower)
OS:
• Disk usage > 60% (minor) >
80-90% (major)
• CPU usage > 60% over 5
minutes (generally caused
by SSL connections or old
clients causing down
conversions)
• Network IO usage > 60%
• File handle usage > 60%
JVM Monitoring:
• G1 YoungGeneration
CollectionTime
• G1 OldGeneration
CollectionTime
• GC time > 30%
Connect:
• connector=(*)
status
• connector=(*),task=(.*)
status
Zookeeper:
• AvgRequestLatency > 10ms
over 30 seconds(disk
latency is high. `iostat
-x ` look at await time in
`top`)
• NumAliveConnections - make
sure you are not close to
maximum as set with
maxClientCnxns
• OutstandingRequests -
should be below 10 in
general
The Four Letter Words: mntr and
ruok
-Dzookeeper.4lw.commands.whiteli
st=*
$ echo ruok | nc localhost 2181
$ imok
* alert can also be set with Control Center
C3 Alerts: Configuring an Alert Trigger
51
ITRS: Topic Monitoring
52
ITRS: Broker Monitoring
53
ITRS: Rule Engine
54
Citi Tech Talk: Monitoring and Performance

More Related Content

Similar to Citi Tech Talk: Monitoring and Performance (20)

PDF
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
PDF
What's new in confluent platform 5.4 online talk
confluent
 
PPTX
Data Pipelines with Kafka Connect
Kaufman Ng
 
PPTX
Connecting kafka message systems with scylla
Maheedhar Gunturu
 
PDF
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
PDF
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent
 
PPTX
Confluent and Syncsort Webinar August 2016
Precisely
 
PDF
Beyond the brokers - A tour of the Kafka ecosystem
Damien Gasparina
 
PDF
Beyond the Brokers: A Tour of the Kafka Ecosystem
confluent
 
PDF
Beyond the brokers - Un tour de l'écosystème Kafka
Florent Ramiere
 
PDF
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
PPTX
Event Streaming Architectures with Confluent and ScyllaDB
ScyllaDB
 
PDF
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
DATAVERSITY
 
PDF
Monitor Kafka Clients Centrally with KIP-714
Kumar Keshav
 
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
PDF
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
confluent
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PPTX
CERN IT Monitoring
Tim Bell
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
What's new in confluent platform 5.4 online talk
confluent
 
Data Pipelines with Kafka Connect
Kaufman Ng
 
Connecting kafka message systems with scylla
Maheedhar Gunturu
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent
 
Confluent and Syncsort Webinar August 2016
Precisely
 
Beyond the brokers - A tour of the Kafka ecosystem
Damien Gasparina
 
Beyond the Brokers: A Tour of the Kafka Ecosystem
confluent
 
Beyond the brokers - Un tour de l'écosystème Kafka
Florent Ramiere
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
Event Streaming Architectures with Confluent and ScyllaDB
ScyllaDB
 
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
DATAVERSITY
 
Monitor Kafka Clients Centrally with KIP-714
Kumar Keshav
 
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
confluent
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
CERN IT Monitoring
Tim Bell
 
Data Streaming with Apache Kafka & MongoDB
confluent
 

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
Add Background Images to Charts in IBM SPSS Statistics Version 31.pdf
Version 1 Analytics
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Simplify React app login with asgardeo-sdk
vaibhav289687
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Add Background Images to Charts in IBM SPSS Statistics Version 31.pdf
Version 1 Analytics
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Simplify React app login with asgardeo-sdk
vaibhav289687
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Ad

Citi Tech Talk: Monitoring and Performance

  • 1. Introduction to Monitoring Confluent Platform Akhilesh Dubey | Confluent Ishan Dwivedi | Citibank
  • 2. Agenda 2 01 Confluent Platform Monitoring What can you monitor in Confluent Platform? 02 Monitoring using Control Center Overview of Monitoring through Confluent Control Center 03 JMX Monitoring Overview of JMX metrics and 3rd party monitoring stacks - AppDynamics & Prometheus/Grafana 04 Alerting Alerting ability available through Confluent Control Center and ITRS
  • 4. Confluent Platform Components 4 https://www.confluent.io/whitepaper/confluent-enterprise-reference-architecture/ Application Sticky Load Balancer REST Proxy Proxy Kafka Brokers Broker + Rebalancer ZooKeeper Nodes ZK ZK ZK Proxy Broker + Rebalancer Broker + Rebalancer Broker + Rebalancer Schema Registry Leader Follower ZK ZK Confluent Control Center Application Clients KStreams pp Streams Kafka Connect Worker + Connectors or Replicator Microservices Worker + Connectors or Replicator ksqlDB ksqlDB Server ksqlDB Server
  • 5. What components need monitoring? ● Resources (CPU, DISK, Memory, Network I/O) ● JVM ● Kafka Brokers ● Zookeeper ● Connect ● Schema Registry ● REST Proxy ● Clients (producers/consumers) Where do I even start?
  • 6. Start with the basics: ● Do I have a monitoring solution today (agents, storage, dashboards)? ● Most components emit JMX metrics. These can be watched and exported to a JMX Collector (AppDynamics, Prometheus, etc) for alerting or visualization: ● Resources (put in alerting at 60% to investigate): ○ CPU ○ DISK Free (Kafka Cannot run if your disk is full) ○ Network I/O ○ Open File Handles ○ JVM (Enable and monitor garbage collection times) Where do I even start?
  • 7. You can use Control Center for an opinionated view of what is happening right now Brokers generate many metrics using JMX MBeans. ● Under Replicated Partitions ● Offline Partitions ● Total Time MS ● ISR Shrink Rate ● and many more (https://support.confluent.io/hc/en-us/articles/230419288-Monitoring-Kafka) Brokers
  • 8. Zookeeper is crucial to the operation of a Kafka cluster 4 Letter Words for quick status (RUOK, MNTR, STAT) Zookeeper also generates many many metrics using JMX MBeans. ● AvgRequestLatency (per node) ● OutstandingRequests (per node) Monitor which Zookeeper nodes are leaders (they tend to be busiest) ● How many clients and watchers: NumAliveConnections ● WatchCount https://zookeeper.apache.org/doc/current/zookeeperJMX.html Zookeeper
  • 9. Very important to monitor producers and consumers too! Confluent monitoring interceptors available to see end to end lag in Control Center JMX metrics also available in producers and consumers: Clients Consumers: ● records-lag/records-lag-m ax ● bytes-consumed-rate ● records-consumed-rate ● fetch-rate Producers: ● request-rate ● request-latency-avg ● response-rate ● outgoing-byte-rate ● io-wait-time-ns-avg ● batch-size-avg ● compression-rate-av g
  • 10. 02. Monitoring using Control Center
  • 11. ● Confluent Platform is the central nervous system for a business, and potentially a Kafka-based single source of truth. ● Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time. They need to identify and triage problems in order to solve them before it affects end users. As a result, monitoring your Kafka deployments is an operational must-have. ● Monitoring help provides assurance that all your services are working properly, meeting SLAs and addressing business needs. ● Here are some common business-level questions: 1. Are applications receiving all data? 2. Are my business applications showing the latest data? 3. Why are the applications running slowly? 4. Do we need to scale up? 5. Can any data get lost? 6. Will there be service interruptions? 7. Are there assurances in case of a disaster event? We will see how Control Center can help to answer all those questions and where/when you require and additional monitoring stack. Why do we monitor? 11
  • 12. 12 You can deploy Confluent Control Center for out-of-the-box Kafka cluster monitoring so you don’t have to build your own monitoring system. Control Center makes it easy to manage the entire Confluent Platform. Control Center is a web-based application that allows you to manage your cluster, to monitor Kafka clusters in predefined dashboards and to alert on triggers.
  • 13. ● Kafka exposes hundreds of JMX metrics. Some of them are per broker, per client, per topic and per partition, and so the number of metrics scales up as the cluster grows. For an average-size Kafka cluster, the number of metrics can very quickly grow into thousands ! ● A common pitfall of generic monitoring tools is to import pretty much all available metrics. But even with a comprehensive list of metrics, there is a limit to what can be achieved with no Kafka context or Kafka expertise to determine which metrics are important and which ones are not. ○ People end up referring to just the two or three charts that they actually understand. ○ Meanwhile, they ignore all the other charts because they don’t understand them ○ It can generate a lot of noise as people spend time chasing “issues” that aren’t impactful to the services, or worse, obscures real problems. ● Control Center was designed to help operators identify the most important things to monitor in Kafka, including the cluster and the client applications producing messages to and consuming messages from the cluster The metrics swamp 13
  • 14. Control Center A walkthrough of the features
  • 15. 15 ● Cluster Overview provides insight into the well-being of the Kafka cluster from the cluster perspective, and allows you to drill down to the broker level, topic level, connect cluster level and KSQL level perspectives ● Multiple clusters can be monitored with a single Control Center and it also supports Multi-Cluster Schema Registry ● Requires Confluent Metrics Reporter to be installed and enabled Cluster Overview
  • 16. 16 ● Brokers Overview provides a succinct view of essential Kafka metrics for brokers in a cluster: ○ Throughput for production and consumption ○ Broker uptime ○ Partitions replicas status (including URP) ○ Apache ZooKeeper status ○ Active Controller ○ Disk usage and distribution ○ System metrics for network and request pool usage ● Clicking on panels, you get an historical view of the metrics 👇👇👇 Brokers Overview
  • 17. 17 ● Brokers Metrics page provides historical data for following panels: ○ Production metrics ○ Consumption metrics ○ Broker uptime metrics ○ Partition replicas metrics ○ System usage ○ Disk usage Brokers Metrics page
  • 18. 18 ● You can add, view, edit, and delete topics using the Control Center topic management interface ● Message Browser ● Manage Schemas for Topics ○ Avro, JSON-Schema and Protobuf ○ ⚠ Options to view and edit schemas through the user interface are available only for schemas that use the default TopicNameStrategy ○ Multi-Cluster Schema Registry ● Metrics: ○ Production Throughput and Failed production requests ○ Consumption Throughput and Failed consumptions requests, % messages consumed (require Monitoring Interceptors) and End-to-end latency (require Monitoring Interceptors) ○ Availability (URP and Out of Sync followers and observers) ○ Consumer Lag Topics
  • 19. 19 ● Provides the convenience of managing connectors for multiple Kafka Connect clusters. ● Use Control Center to: ○ Add a connector by completing UI fields. Note: specific procedure when RBAC is used. ○ Add a connector by uploading a connector configuration file ○ Download connector configuration files to reuse in another connector or cluster, or to use as a template. ○ Edit a connector configuration and relaunch it. ○ Pause a running connector; resume a paused connector. ○ Delete a connector. ○ View the status of connectors in Connect clusters. Connect
  • 20. 20 ● Control Center provides the convenience of running streaming queries on one or more ksqlDB clusters within its graphical user interface ● Use ksqlDB to: ○ View a summary of all ksqlDB applications connected to Control Center. ○ Search for a ksqlDB application being managed by the Control Center instance. ○ Browse topic messages. ○ View the number of running queries, registered streams, and registered tables for each ksqlDB application. ○ Navigate to the ksqlDB Editor, Streams, Tables, Flow View and Running Queries for each ksqlDB application. ksqlDB
  • 21. 21 ● View all consumer groups for all topics in a cluster ● Use Consumers menu to: ○ View all consumer groups for a cluster in the All consumer groups page ○ View consumer lag across all topics in a cluster ○ View consumption metric for a consumer group (only available if monitoring interceptors are set) ○ Set up consumer group alerts Consumers
  • 22. 22 ● You can set up alerts in Control Center based on 4 component triggers: ○ Broker ■ Bytes in ■ Bytes out ■ Fetch request latency ■ Production request count ■ Production request latency ○ Cluster ■ Cluster down ■ Leader election rate ■ Offline topic partitions ■ Unclean election count ■ Under replicated topic partitions ■ ZooKeeper status ■ ZooKeeper expiration rate ○ Consumer Group ■ Average latency (ms) ■ Consumer lag ■ Consumer lead ■ Consumption difference ■ Maximum latency (ms) ○ Topic ■ Bytes in ■ Bytes out ■ Out of sync replica count ■ Production request count ■ Under-replicated topic partitions ● Notifications are possible via email, PagerDuty or Slack Alerts
  • 23. 23 ● Cluster settings ○ Change cluster name (also possible using configuration file) ○ Update dynamic settings without any restart required ○ Download broker configuration ● Status and License menu ○ Processing status: status of Control Center (Running or Not Running). Consumption data and Broker data (message throughput are shown real-time for the last 30 minutes) ○ Set or update license And more...
  • 24. 03. JMX Metrics and Monitoring Stacks Overview of JMX metrics and 3rd party monitoring stacks
  • 25. ● Kafka brokers and Java client applications (Kafka Connect, Kafka Streams, Producer/Consumer, etc..) expose hundreds of internal JMX (Java Management Extensions) metrics ● Important JMX metrics to monitor: ○ Broker metrics ○ ZooKeeper metrics ○ Producer metrics ○ Consumer metrics ○ ksqlDB & Kafka Streams metrics ○ Kafka Connect metrics ● It’s key to have a dashboard that let you know “everything is OK?” in one glance ● Multiple monitoring stacks are available. Choose the one that is already used in your company JMX metrics 25
  • 26. 26 Java: Client JMX metrics • Java Kafka Client applications expose some internal JMX (Java Management Extensions) metrics • Many users run JMX exporters to feed these metrics into their monitoring systems (AppDynamics, Grafana, etc..) • Important Client JMX metrics to monitor General producer metrics and producer throttling-time Consumer metrics ksqlDB & Kafka Streams metrics Kafka Connect metrics • Prometheus is a popular open-source monitoring solution uses JMX-Exporter to extract the metrics. The exporter can be configured to extract and forward only the metrics desired. • Here is a demo of JMX-Exporter/Prometheus/Grafana
  • 27. 27 Typical Data pipeline pattern(s) for Client Metrics Clients emitting JMX JMX Client e.g. JMX-Exporter Prometheus Observability App Java Producer running in JVM, producing to Kafka Cluster JMX Exporter jmx_prometheus_javaagent configured as agent on jvm, exposing producer /metrics endpoint Prometheus Configured with a job that scrapes producer /metrics endpoint Grafana Configured with Prometheus as datasource Connect running in JVM, producing to Kafka Cluster JMX Client e.g. appdynamics-agent AppDynamics E.g.
  • 28. 28 Client Throttling • Depending on your cluster configuration, you may be restricted to specific throughputs for your client application • If your client applications exceed these rates, the quotas on the brokers will detect it and the client application requests will be throttled by the brokers. • If your clients are being throttled, consider two options: Modify your application to optimize its throughput, if possible (read the section Optimizing for Throughput for more details) Upgrade to a cluster configuration with higher limits • ℹ Metrics API can give you some indication of throughput from server side, but it doesn’t provide throughput metrics on the client side.
  • 29. 29 Client Throttling To get throttling metrics per producer and consumer, monitor the following client JMX metrics: Metric Description kafka.producer:type=producer-metrics,client-id=([-.w ]+),name=produce-throttle-time-avg The average time in ms that a request was throttled by a broker kafka.producer:type=producer-metrics,client-id=([-.w ]+),name=produce-throttle-time-max The maximum time in ms that a request was throttled by a broker kafka.consumer:type=consumer-fetch-manager-metrics,c lient-id=([-.w]+),name=fetch-throttle-time-avg The average time in ms that a broker spent throttling a fetch request kafka.consumer:type=consumer-fetch-manager-metrics,c lient-id=([-.w]+),name=fetch-throttle-time-max The maximum time in ms that a broker spent throttling a fetch request
  • 30. 30 AppDynamics • AppDynamics provides ability to do JMX monitoring of Java applications • Machine and application server monitoring can be combined to generate and monitor relevant Confluent Platform component metrics.
  • 32. AppDynamics: KaaS JMX Metrics Drill Down View 32
  • 35. 35 Prometheus/Grafana • Prometheus is a popular open-source monitoring solution which uses JMX-Exporter to extract the metrics. The exporter can be configured to extract and forward only the metrics desired. • An example of JMX-Exporter/Prometheus/Grafana monitoring stack deployed on top of Confluent cp-demo is available here Prometheus exporter (JMX-Exporter)
  • 39. • JMX metrics are only for java based clients. • Librdkafka applications can be configured (disabled by default) to emit internal metrics at a fixed interval by setting the statistics.interval.ms configuration property to a value > 0 and registering a stats_cb (or similar, depending on language) • All statistics described here • Emits JSON object string: Librdkafka: Client statistics 39
  • 40. Using prometheus-net/prometheus-net, starting up a MetricsServer to export metrics to Prometheus Prometheus/Grafana: Librdkafka: .NET example 40
  • 42. Monitor Consumer Lag All different ways to monitor consumer lag
  • 43. ● It is important to monitor your application’s consumer lag, which is the number of records for any partition that the consumer is behind in the log ● For "real-time" consumer applications, where the consumer is meant to be processing the newest messages with as little latency as possible, consumer lag should be monitored closely. ● Most "real-time" applications will want little-to-no consumer lag, because lag introduces end-to-end latency. Monitoring Consumer Lag 43
  • 44. Consumer lag is available in Consumers section from navigation bar: #1: Using Control Center 44
  • 45. If you use Java consumers, you can capture JMX metrics and monitor records-lag-max Note: the consumer’s records-lag-max JMX metric calculates lag by comparing the offset most recently seen by the consumer to the most recent offset in the log, which is a more real-time measurement. #2: Using JMX (Java client only) 45 Metric Description kafka.consumer:type=consumer-fe tch-manager-metrics,client-id=( [-.w]+),records-lag-max The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
  • 46. Refer to this Knowledge Base article for full details Create a properties file containing your security details Example: #3: Using kafka-consumer-groups CLI 46
  • 47. 47 #4: Using kafka-lag-exporter and Prometheus/Grafana • lightbend/kafka-lag-exporter is a 3rd party tool (not supported by Confluent) that is using Kafka's Admin API describeConsumerGroups() method to get consumer lags and export them to Prometheus. • Out of the box Grafana dashboard is available
  • 48. 04. Alerting Overview of Alerting capabilities through Confluent Control Center and ITRS
  • 49. 49 Alerts ● As seen earlier, setting up alerts can be done through Control Center, but also using your monitoring stack based on JMX metrics ● Alert on what’s important: Under-replicated partitions is a good start ● Alerting on SLAs is even better: especially when measured from a client point of view
  • 50. Key Alerts 50 Cluster/Broker: • UnderReplicatedPartitions > 0 * • OfflinePartitionsCount > 0 * • UnderMinIsrPartitionCount > 0 • ActiveControllerCount != 1 • AtMinIsrPartitionCount > 0 • RequestHandlerAvgIdlePercent < 40% • NetworkProcessorAvgIdlePercen t < 40% • RequestQueueSize (establish the baseline during normal/peak production load and alert if a deviation occurs) • TotalTimeMs,request=* (Produce|FetchConsumer|FetchF ollower) OS: • Disk usage > 60% (minor) > 80-90% (major) • CPU usage > 60% over 5 minutes (generally caused by SSL connections or old clients causing down conversions) • Network IO usage > 60% • File handle usage > 60% JVM Monitoring: • G1 YoungGeneration CollectionTime • G1 OldGeneration CollectionTime • GC time > 30% Connect: • connector=(*) status • connector=(*),task=(.*) status Zookeeper: • AvgRequestLatency > 10ms over 30 seconds(disk latency is high. `iostat -x ` look at await time in `top`) • NumAliveConnections - make sure you are not close to maximum as set with maxClientCnxns • OutstandingRequests - should be below 10 in general The Four Letter Words: mntr and ruok -Dzookeeper.4lw.commands.whiteli st=* $ echo ruok | nc localhost 2181 $ imok * alert can also be set with Control Center
  • 51. C3 Alerts: Configuring an Alert Trigger 51