Citi Tech Talk: Monitoring and Performance

Introduction to Monitoring
Conﬂuent Platform
Akhilesh Dubey | Conﬂuent
Ishan Dwivedi | Citibank

Agenda
2
01
Confluent Platform Monitoring
What can you monitor in Confluent
Platform?
02
Monitoring using Control Center
Overview of Monitoring through Confluent
Control Center
03
JMX Monitoring
Overview of JMX metrics and 3rd party
monitoring stacks - AppDynamics &
Prometheus/Grafana
04
Alerting
Alerting ability available through Confluent
Control Center and ITRS

01. Conﬂuent Platform
Components

Confluent Platform Components
4
https://www.confluent.io/whitepaper/confluent-enterprise-reference-architecture/
Application
Sticky Load Balancer
REST Proxy
Proxy
Kafka Brokers
Broker +
Rebalancer
ZooKeeper Nodes
ZK ZK ZK
Proxy
Broker +
Rebalancer
Broker +
Rebalancer
Broker +
Rebalancer
Schema Registry
Leader Follower
ZK ZK
Confluent
Control Center
Application
Clients
KStreams
pp
Streams
Kafka Connect
Worker +
Connectors
or
Replicator
Microservices
Worker +
Connectors
or
Replicator
ksqlDB
ksqlDB
Server
ksqlDB
Server

What components need monitoring?
● Resources (CPU, DISK, Memory, Network I/O)
● JVM
● Kafka Brokers
● Zookeeper
● Connect
● Schema Registry
● REST Proxy
● Clients (producers/consumers)
Where do I even
start?

Start with the basics:
● Do I have a monitoring solution today (agents, storage,
dashboards)?
● Most components emit JMX metrics. These can be
watched and exported to a JMX Collector
(AppDynamics, Prometheus, etc) for alerting or
visualization:
● Resources (put in alerting at 60% to investigate):
○ CPU
○ DISK Free (Kafka Cannot run if your disk is full)
○ Network I/O
○ Open File Handles
○ JVM (Enable and monitor garbage collection
times)
Where do I even
start?

You can use Control Center for an opinionated view of what
is happening right now
Brokers generate many metrics using JMX MBeans.
● Under Replicated Partitions
● Ofﬂine Partitions
● Total Time MS
● ISR Shrink Rate
● and many more
(https://support.confluent.io/hc/en-us/articles/230419288-Monitoring-Kafka)
Brokers

Zookeeper is crucial to the operation of a Kafka cluster
4 Letter Words for quick status (RUOK, MNTR, STAT)
Zookeeper also generates many many metrics using JMX
MBeans.
● AvgRequestLatency (per node)
● OutstandingRequests (per node)
Monitor which Zookeeper nodes are leaders (they tend to be
busiest)
● How many clients and watchers:
NumAliveConnections
● WatchCount
https://zookeeper.apache.org/doc/current/zookeeperJMX.html
Zookeeper

Very important to monitor producers and consumers too!
Conﬂuent monitoring interceptors available to see end to
end lag in Control Center
JMX metrics also available in producers and consumers:
Clients
Consumers:
● records-lag/records-lag-m
ax
● bytes-consumed-rate
● records-consumed-rate
● fetch-rate
Producers:
● request-rate
● request-latency-avg
● response-rate
● outgoing-byte-rate
● io-wait-time-ns-avg
● batch-size-avg
● compression-rate-av
g

02. Monitoring using Control
Center

● Conﬂuent Platform is the central nervous system for a business, and potentially a Kafka-based single
source of truth.
● Kafka operators need to provide guarantees to the business that Kafka is working properly and
delivering data in real time. They need to identify and triage problems in order to solve them before it
affects end users. As a result, monitoring your Kafka deployments is an operational must-have.
● Monitoring help provides assurance that all your services are working properly, meeting SLAs and
addressing business needs.
● Here are some common business-level questions:
1. Are applications receiving all data?
2. Are my business applications showing the latest data?
3. Why are the applications running slowly?
4. Do we need to scale up?
5. Can any data get lost?
6. Will there be service interruptions?
7. Are there assurances in case of a disaster event?
We will see how Control Center can help to answer all those questions and where/when you require and
additional monitoring stack.
Why do we monitor?
11

12
You can deploy Confluent Control Center for
out-of-the-box Kafka cluster monitoring so you
don’t have to build your own monitoring system.
Control Center makes it easy to manage the
entire Confluent Platform.
Control Center is a web-based application that
allows you to manage your cluster, to monitor
Kafka clusters in predefined dashboards and to
alert on triggers.

● Kafka exposes hundreds of JMX metrics. Some of them are per broker,
per client, per topic and per partition, and so the number of metrics
scales up as the cluster grows. For an average-size Kafka cluster, the
number of metrics can very quickly grow into thousands !
● A common pitfall of generic monitoring tools is to import pretty much all
available metrics. But even with a comprehensive list of metrics, there is
a limit to what can be achieved with no Kafka context or Kafka expertise
to determine which metrics are important and which ones are not.
○ People end up referring to just the two or three charts that they
actually understand.
○ Meanwhile, they ignore all the other charts because they don’t
understand them
○ It can generate a lot of noise as people spend time chasing “issues”
that aren’t impactful to the services, or worse, obscures real
problems.
● Control Center was designed to help operators identify the most
important things to monitor in Kafka, including the cluster and the
client applications producing messages to and consuming messages
from the cluster
The metrics swamp
13

Control Center
A walkthrough of the features

15
● Cluster Overview provides insight into the
well-being of the Kafka cluster from the
cluster perspective, and allows you to drill
down to the broker level, topic level,
connect cluster level and KSQL level
perspectives
● Multiple clusters can be monitored with a
single Control Center and it also supports
Multi-Cluster Schema Registry
● Requires Conﬂuent Metrics Reporter to be
installed and enabled
Cluster Overview

16
● Brokers Overview provides a succinct view
of essential Kafka metrics for brokers in a
cluster:
○ Throughput for production and
consumption
○ Broker uptime
○ Partitions replicas status (including
URP)
○ Apache ZooKeeper status
○ Active Controller
○ Disk usage and distribution
○ System metrics for network and
request pool usage
● Clicking on panels, you get an historical
view of the metrics 👇👇👇
Brokers Overview

17
● Brokers Metrics page provides historical
data for following panels:
○ Production metrics
○ Consumption metrics
○ Broker uptime metrics
○ Partition replicas metrics
○ System usage
○ Disk usage
Brokers Metrics page

18
● You can add, view, edit, and delete topics
using the Control Center topic management
interface
● Message Browser
● Manage Schemas for Topics
○ Avro, JSON-Schema and Protobuf
○ ⚠ Options to view and edit schemas
through the user interface are available
only for schemas that use the default
TopicNameStrategy
○ Multi-Cluster Schema Registry
● Metrics:
○ Production Throughput and Failed
production requests
○ Consumption Throughput and Failed
consumptions requests, % messages
consumed (require Monitoring
Interceptors) and End-to-end latency
(require Monitoring Interceptors)
○ Availability (URP and Out of Sync
followers and observers)
○ Consumer Lag
Topics

19
● Provides the convenience of managing
connectors for multiple Kafka Connect
clusters.
● Use Control Center to:
○ Add a connector by completing UI
fields. Note: specific procedure when
RBAC is used.
○ Add a connector by uploading a
connector configuration file
○ Download connector configuration files
to reuse in another connector or cluster,
or to use as a template.
○ Edit a connector configuration and
relaunch it.
○ Pause a running connector; resume a
paused connector.
○ Delete a connector.
○ View the status of connectors in
Connect clusters.
Connect

20
● Control Center provides the convenience of
running streaming queries on one or more
ksqlDB clusters within its graphical user
interface
● Use ksqlDB to:
○ View a summary of all ksqlDB
applications connected to Control
Center.
○ Search for a ksqlDB application being
managed by the Control Center
instance.
○ Browse topic messages.
○ View the number of running queries,
registered streams, and registered
tables for each ksqlDB application.
○ Navigate to the ksqlDB Editor, Streams,
Tables, Flow View and Running Queries
for each ksqlDB application.
ksqlDB

21
● View all consumer groups for all topics in a
cluster
● Use Consumers menu to:
○ View all consumer groups for a cluster
in the All consumer groups page
○ View consumer lag across all topics in a
cluster
○ View consumption metric for a
consumer group (only available if
monitoring interceptors are set)
○ Set up consumer group alerts
Consumers

22
● You can set up alerts in Control Center based on 4
component triggers:
○ Broker
■ Bytes in
■ Bytes out
■ Fetch request latency
■ Production request count
■ Production request latency
○ Cluster
■ Cluster down
■ Leader election rate
■ Ofﬂine topic partitions
■ Unclean election count
■ Under replicated topic partitions
■ ZooKeeper status
■ ZooKeeper expiration rate
○ Consumer Group
■ Average latency (ms)
■ Consumer lag
■ Consumer lead
■ Consumption difference
■ Maximum latency (ms)
○ Topic
■ Bytes in
■ Bytes out
■ Out of sync replica count
■ Production request count
■ Under-replicated topic partitions
● Notiﬁcations are possible via email, PagerDuty or
Slack
Alerts

23
● Cluster settings
○ Change cluster name (also possible
using configuration file)
○ Update dynamic settings without any
restart required
○ Download broker configuration
● Status and License menu
○ Processing status: status of Control
Center (Running or Not Running).
Consumption data and Broker data
(message throughput are shown
real-time for the last 30 minutes)
○ Set or update license
And more...

03. JMX Metrics and
Monitoring Stacks
Overview of JMX metrics and 3rd party monitoring stacks

● Kafka brokers and Java client applications (Kafka Connect, Kafka Streams, Producer/Consumer, etc..)
expose hundreds of internal JMX (Java Management Extensions) metrics
● Important JMX metrics to monitor:
○ Broker metrics
○ ZooKeeper metrics
○ Producer metrics
○ Consumer metrics
○ ksqlDB & Kafka Streams metrics
○ Kafka Connect metrics
● It’s key to have a dashboard that let you know “everything is OK?” in one glance
● Multiple monitoring stacks are available. Choose the one that is already used in your company
JMX metrics
25

26
Java: Client JMX metrics
• Java Kafka Client applications expose some internal JMX (Java Management Extensions) metrics
• Many users run JMX exporters to feed these metrics into their monitoring systems (AppDynamics,
Grafana, etc..)
• Important Client JMX metrics to monitor
General producer metrics and producer throttling-time
Consumer metrics
ksqlDB & Kafka Streams metrics
Kafka Connect metrics
• Prometheus is a popular open-source monitoring solution uses JMX-Exporter to extract the metrics. The
exporter can be conﬁgured to extract and forward only the metrics desired.
• Here is a demo of JMX-Exporter/Prometheus/Grafana

27
Typical Data pipeline pattern(s) for Client Metrics
Clients
emitting JMX
JMX Client
e.g. JMX-Exporter
Prometheus
Observability
App
Java
Producer
running in JVM, producing
to Kafka Cluster
JMX Exporter
jmx_prometheus_javaagent
configured as agent on jvm,
exposing producer /metrics
endpoint
Prometheus
Configured with a job that
scrapes producer /metrics
endpoint
Grafana
Configured with
Prometheus as datasource
Connect
running in JVM, producing
to Kafka Cluster
JMX Client
e.g. appdynamics-agent
AppDynamics
E.g.

28
Client Throttling
• Depending on your cluster configuration, you may be restricted to specific throughputs for your client
application
• If your client applications exceed these rates, the quotas on the brokers will detect it and the client
application requests will be throttled by the brokers.
• If your clients are being throttled, consider two options:
Modify your application to optimize its throughput, if possible (read the section Optimizing
for Throughput for more details)
Upgrade to a cluster configuration with higher limits
• ℹ Metrics API can give you some indication of throughput from server side, but it doesn’t provide
throughput metrics on the client side.

29
Client Throttling
To get throttling metrics per producer and consumer, monitor the following client JMX metrics:
Metric Description
kafka.producer:type=producer-metrics,client-id=([-.w
]+),name=produce-throttle-time-avg
The average time in ms that a request was
throttled by a broker
kafka.producer:type=producer-metrics,client-id=([-.w
]+),name=produce-throttle-time-max
The maximum time in ms that a request was
throttled by a broker
kafka.consumer:type=consumer-fetch-manager-metrics,c
lient-id=([-.w]+),name=fetch-throttle-time-avg
The average time in ms that a broker spent
throttling a fetch request
kafka.consumer:type=consumer-fetch-manager-metrics,c
lient-id=([-.w]+),name=fetch-throttle-time-max
The maximum time in ms that a broker spent
throttling a fetch request

30
AppDynamics
• AppDynamics provides ability to do JMX
monitoring of Java applications
• Machine and application server
monitoring can be combined to generate
and monitor relevant Conﬂuent Platform
component metrics.

AppDynamics: KaaS Application Flow
31

AppDynamics: KaaS JMX Metrics Drill Down View
32

AppDynamics: KaaS Availability
33

AppDynamics: KaaS Availability contd..
34

35
Prometheus/Grafana
• Prometheus is a popular open-source
monitoring solution which uses
JMX-Exporter to extract the metrics. The
exporter can be conﬁgured to extract and
forward only the metrics desired.
• An example of
JMX-Exporter/Prometheus/Grafana
monitoring stack deployed on top of
Conﬂuent cp-demo is available here
Prometheus exporter
(JMX-Exporter)

Prometheus/Grafana: Broker (with cp-demo)
36

Prometheus/Grafana: JAVA producer demo
37

Prometheus/Grafana: JAVA consumer demo
38

• JMX metrics are only for java based clients.
• Librdkafka applications can be configured (disabled by default) to emit internal metrics at a fixed
interval by setting the statistics.interval.ms configuration property to a value > 0 and registering a
stats_cb (or similar, depending on language)
• All statistics described here
• Emits JSON object string:
Librdkafka: Client statistics
39

Using prometheus-net/prometheus-net, starting up a MetricsServer to export metrics to Prometheus
Prometheus/Grafana: Librdkafka: .NET example
40

Prometheus/Grafana: .NET Client demo
41

Monitor Consumer Lag
All different ways to monitor consumer lag

● It is important to monitor your application’s consumer lag,
which is the number of records for any partition that the
consumer is behind in the log
● For "real-time" consumer applications, where the consumer
is meant to be processing the newest messages with as little
latency as possible, consumer lag should be monitored
closely.
● Most "real-time" applications will want little-to-no consumer
lag, because lag introduces end-to-end latency.
Monitoring Consumer Lag
43

Consumer lag is available in Consumers section from navigation bar:
#1: Using Control Center
44

If you use Java consumers, you can capture JMX metrics and monitor records-lag-max
Note: the consumer’s records-lag-max JMX metric calculates lag by comparing the offset most recently
seen by the consumer to the most recent offset in the log, which is a more real-time measurement.
#2: Using JMX (Java client only)
45
Metric Description
kafka.consumer:type=consumer-fe
tch-manager-metrics,client-id=(
[-.w]+),records-lag-max
The maximum lag in terms of number of
records for any partition in this window. An
increasing value over time is your best
indication that the consumer group is not
keeping up with the producers.

Refer to this Knowledge Base article for full details
Create a properties ﬁle containing your security details
Example:
#3: Using kafka-consumer-groups CLI
46

47
#4: Using
kafka-lag-exporter and
Prometheus/Grafana
• lightbend/kafka-lag-exporter is a 3rd party
tool (not supported by Conﬂuent) that is
using Kafka's Admin API
describeConsumerGroups() method to get
consumer lags and export them to
Prometheus.
• Out of the box Grafana dashboard is
available

04. Alerting
Overview of Alerting capabilities through Conﬂuent Control Center and ITRS

49
Alerts
● As seen earlier, setting up alerts can be done
through Control Center, but also using your
monitoring stack based on JMX metrics
● Alert on what’s important: Under-replicated
partitions is a good start
● Alerting on SLAs is even better: especially
when measured from a client point of view

Key Alerts
50
Cluster/Broker:
• UnderReplicatedPartitions > 0
*
• OfflinePartitionsCount > 0 *
• UnderMinIsrPartitionCount > 0
• ActiveControllerCount != 1
• AtMinIsrPartitionCount > 0
• RequestHandlerAvgIdlePercent
< 40%
• NetworkProcessorAvgIdlePercen
t < 40%
• RequestQueueSize
(establish
the baseline during
normal/peak production load
and alert if a deviation
occurs)
• TotalTimeMs,request=*
(Produce|FetchConsumer|FetchF
ollower)
OS:
• Disk usage > 60% (minor) >
80-90% (major)
• CPU usage > 60% over 5
minutes (generally caused
by SSL connections or old
clients causing down
conversions)
• Network IO usage > 60%
• File handle usage > 60%
JVM Monitoring:
• G1 YoungGeneration
CollectionTime
• G1 OldGeneration
CollectionTime
• GC time > 30%
Connect:
• connector=(*)
status
• connector=(*),task=(.*)
status
Zookeeper:
• AvgRequestLatency > 10ms
over 30 seconds(disk
latency is high. `iostat
-x ` look at await time in
`top`)
• NumAliveConnections - make
sure you are not close to
maximum as set with
maxClientCnxns
• OutstandingRequests -
should be below 10 in
general
The Four Letter Words: mntr and
ruok
-Dzookeeper.4lw.commands.whiteli
st=*
$ echo ruok | nc localhost 2181
$ imok
* alert can also be set with Control Center

C3 Alerts: Conﬁguring an Alert Trigger
51

Citi Tech Talk: Monitoring and Performance

Citi Tech Talk: Monitoring and Performance

More Related Content

Similar to Citi Tech Talk: Monitoring and Performance (20)

More from confluent (20)

Recently uploaded (20)

Citi Tech Talk: Monitoring and Performance