Kafka monitoring and metrics

5/16/2017 Kafka Metrics and
Monitoring
With Prometheus, Grafana
, Prometheus-jmx-exporter and graf-db base on docker
Touraj Ebrahimi

‫مقدمه‬:
‫کلی‬ ‫طور‬ ‫به‬metric‫توسط‬ ‫زیادی‬ ‫های‬Kafka, Zookeeper‫و‬Kafka Connect‫طریق‬ ‫از‬ ‫که‬ ‫دارند‬ ‫وجود‬ ‫مانیتورینگ‬ ‫برای‬
JMX‫را‬ ‫آنها‬ ‫توان‬ ‫می‬Expose‫و‬Collect‫کرد‬.‫از‬ ‫عبارتند‬ ‫آنها‬ ‫مهمترین‬ ‫که‬ ‫شوند‬ ‫می‬ ‫بندی‬ ‫طبقه‬ ‫دسته‬ ‫چند‬ ‫به‬ ‫متریکها‬:
 System Metrics
 Zookeeper Metrics
 Consumer Metrics
 Producer Metrics
 Connect Metrics
 Kafka-Server Metrics
 Kafka-Cluster Metrics
 Kafka-log Metrics
 Kafka-Network Metrics
‫میان‬ ‫از‬Metric‫و‬ ‫دهند‬ ‫می‬ ‫ما‬ ‫به‬ ‫ملموسی‬ ‫اطالعات‬ ‫که‬ ‫آنهایی‬ ‫روی‬ ‫بر‬ ‫ما‬ ‫باال‬ ‫های‬‫برای‬ ‫مخصوصا‬HealthCheck
‫نماییم‬ ‫می‬ ‫تمرکز‬ ‫کرد‬ ‫استفاده‬ ‫آنها‬ ‫از‬ ‫توان‬ ‫می‬ ‫سیستم‬.
‫برای‬ ‫مهم‬ ‫متریک‬ ‫چند‬ ‫ادامه‬ ‫در‬Health Check‫سیستم‬ ‫وضیعت‬ ‫شرایطی‬ ‫چه‬ ‫در‬ ‫کنیم‬ ‫می‬ ‫مشخص‬ ‫و‬ ‫دهیم‬ ‫می‬ ‫توضیح‬ ‫را‬
‫نیست‬ ‫مناسب‬:
‫برای‬ ‫نیز‬ ‫زیر‬ ‫متریکهای‬Health Check‫کافکا‬‫می‬ ‫پیشنهاد‬
‫شوند‬:
DescriptionAfter Version 9Before version 9Metric
Alert Should be emitted when
>0
kafka.server:type=ReplicaManager,
name=UnderReplicatedPartitions
UnderReplicatedPartitions
In-Sync Replica should not
Shrink Often. Consideration
should be done in case of
shrinking usually.
kafka.server:type=ReplicaManager, name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
IsrShrinksPerSec
IsrExpandsPerSec
Average number of requests
sent per second
kafka.producer:type=producer-
metrics,client-id=([-.w]+)
kafka.producer:type=ProducerRequestMetrics,
name=ProducerRequestRateAndTimeMs,clientId=([-.w]+)
Request rate
Bytes consumed per secondkafka.consumer:type=consumer-fetch-
manager-metrics,client-id=([-.w]+)
kafka.consumer:type= ConsumerTopicMetrics,
name=BytesPerSec, clientId=([-.w]+)
BytesPerSec
Messages consumed per
second
kafka.consumer:type=consumer-
fetch-manager-metrics,client-id=([-
.w]+)
kafka.consumer:type= ConsumerTopicMetrics,
name=MessagesPerSec, clientId=([-.w]+)
MessagesPerSec
Minimum rate a consumer
fetches requests to the broker
Attribute: fetch-rate,
kafka.consumer:type=consumer-
fetch-manager-metrics,client-id=([-
.w]+)
kafka.consumer:type= ConsumerFetcherManager,
name=MinFetchRate, clientId=([-.w]+)
MinFetchRate

‫برای‬ ‫باال‬ ‫در‬ ‫شده‬ ‫پیشنهاد‬ ‫متریکهای‬ ‫به‬ ‫مربوط‬ ‫توضیحات‬Health Check‫می‬ ‫زیر‬ ‫صورت‬ ‫به‬
‫باشند‬:
‫توسط‬ ‫شده‬ ‫پیشنهاد‬:Gwen Shapira, System Architect at Confluent
UnderReplicatedPartitions: In a healthy cluster, the number of in sync replicas (ISRs) should be
exactly equal to the total number of replicas. If partition replicas fall too far behind their leaders, the
follower partition is removed from the ISR pool, and you should see a corresponding increase in
IsrShrinksPerSec. Since Kafka’s high-availability guarantees cannot be met without replication,
investigation is certainly warranted should this metric value exceed zero for extended time periods.
IsrShrinksPerSec/IsrExpandsPerSec: The number of in-sync replicas (ISRs) for a particular
partition should remain fairly static, the only exceptions are when you are expanding your broker
cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a
minimum number of ISRs for failover. A replica could be removed from the ISR pool for a couple of
reasons: it is too far behind the leader’s offset (user-configurable by setting the
replica.lag.max.messages configuration parameter), or it has not contacted the leader for some time
(configurable with the replica.socket.timeout.ms parameter). No matter the reason, an increase in
IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter is cause
for concern and requires user intervention.The Kafka documentation provides a wealth of
information on the user-configurable parameters for brokers.
Request rate: The request rate is the rate at which producers send data to brokers. Of course, what
constitutes a healthy request rate will vary drastically depending on the use case. Keeping an eye on
peaks and drops is essential to ensure continuous service availability. If rate-limiting is not enabled
(version 0.9+), in the event of a traffic spike brokers could slow to a crawl as they struggle to process
a rapid influx of data.
BytesPerSec: As with producers and brokers, you will want to monitor your consumer network
throughput. For example, a sudden drop in MessagesPerSec could indicate a failing consumer, but if
its BytesPerSec remains constant, it’s still healthy, just consuming fewer, larger-sized messages.
Observing traffic volume over time, in the context of other metrics, s important for diagnosing
anomalous network usage.
MessagesPerSec: The rate of messages consumed per second may not strongly correlate with the
rate of bytes consumed because messages can be of variable size. Depending on your producers
and workload, in typical deployments you should expect this number to remain fairly constant. By
monitoring this metric over time, you can discover trends in your data consumption and create a
baseline against which you can alert. Again, the shape of this graph depends entirely on your use
case, but in many cases, establishing a baseline and alerting on anomalous behavior is possible.
MinFetchRate: The fetch rate of a consumer can be a good indicator of overall consumer health. A
minimum fetch rate approaching a value of zero could potentially signal an issue on the consumer.
In a healthy consumer, the minimum fetch rate will usually be non-zero, so if you see this value
dropping, it could be a sign of consumer failure.

Monitoring System Health:
‫ما‬ ‫آنها‬ ‫بندیهای‬ ‫دسته‬ ‫و‬ ‫متریکها‬ ‫از‬ ‫بهتر‬ ‫دید‬ ‫داشتن‬ ‫برای‬kafka, Zookeeper‫و‬Kafka Connect‫روی‬ ‫بر‬ ‫را‬JMX Port‫و‬JMX
Host‫روی‬ ‫بر‬ ‫زیر‬ ‫های‬Docker Container‫کردیم‬ ‫تنظیم‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫آنها‬ ‫های‬:
Zookeeper :: JMXPORT=55001 :: JMXHOST=172.16.159.95
Kafka :: JMXPORT=55002 :: JMXHOST=172.16.159.95
Kafka Connect :: JMXPORT=55003 :: JMXHOST=172.16.159.95
‫طریق‬ ‫از‬ ‫توانیم‬ ‫می‬ ‫حاال‬Jconsole‫صورت‬ ‫به‬ ‫آنها‬ ‫به‬Remote‫و‬ ‫کنیم‬ ‫مانیتور‬ ‫را‬ ‫آنها‬ ‫و‬ ‫شده‬ ‫وصل‬Metric‫را‬ ‫دسترسی‬ ‫قابل‬ ‫های‬
‫از‬ ‫باید‬ ‫اینکار‬ ‫برای‬ ‫نماییم‬ ‫بررسی‬MBeans Tab‫در‬JConsole‫نماییم‬ ‫استفاده‬:

Grafana Suggested Dashboard for Monitoring Kafka:
Download link: https://grafana.com/api/dashboards/721/revisions/1/download
Download link: https://github.com/rama-nallamilli/kafka-prometheus-
monitoring/blob/master/dashboards/Kafka.json

‫کردن‬ ‫گانفیگ‬ ‫برای‬Prometheus, JMX Exporter, Zookeeper, Kafka, Grafana‫ا‬ ‫توانیم‬ ‫می‬‫ز‬Workflow‫در‬ ‫که‬ ‫زیر‬
‫فایل‬ ‫یک‬ ‫واقع‬Docker-Compose‫کنیم‬ ‫اجرا‬ ‫آنرا‬ ‫و‬ ‫گرفته‬ ‫ایده‬ ‫است‬:

We can configure prometheus.yml in order to get metrics from Prometheus-jmx-exporter (here we
named it projmxexpo) like following
prometheus.yml
global:
scrape_interval: 10s
evaluation_interval: 10s
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- projmxexpo:5556

Following is the config.yml that we should provide it for the Prometheus-jmx-exporter (via docker –v
commands or manually altering the default one in the docker container)
config.yml
lowercaseOutputName: true
jmxUrl: service:jmx:rmi:///jndi/rmi://172.16.159.95:55002/jmxrmi
rules:
- pattern : kafka.network<type=Processor, name=IdlePercent,
networkProcessor=(.+)><>Value
- pattern : kafka.network<type=RequestMetrics, name=RequestsPerSec,
request=(.+)><>OneMinuteRate
- pattern : kafka.network<type=SocketServer,
name=NetworkProcessorAvgIdlePercent><>Value
- pattern : kafka.server<type=ReplicaFetcherManager, name=MaxLag,
clientId=(.+)><>Value
- pattern : kafka.server<type=BrokerTopicMetrics, name=(.+),
topic=(.+)><>OneMinuteRate
- pattern : kafka.server<type=KafkaRequestHandlerPool,
name=RequestHandlerAvgIdlePercent><>OneMinuteRate
- pattern : kafka.server<type=Produce><>queue-size
- pattern : kafka.server<type=ReplicaManager, name=(.+)><>(Value|OneMinuteRate)
- pattern : kafka.server<type=controller-channel-metrics, broker-id=(.+)><>(.*)
- pattern : kafka.server<type=socket-server-metrics,
networkProcessor=(.+)><>(.*)
- pattern : kafka.server<type=Fetch><>queue-size
- pattern : kafka.server<type=SessionExpireListener, name=(.+)><>OneMinuteRate
- pattern : kafka.controller<type=KafkaController, name=(.+)><>Value
- pattern : kafka.controller<type=ControllerStats, name=(.+)><>OneMinuteRate
- pattern : kafka.cluster<type=Partition, name=UnderReplicated, topic=(.+),
partition=(.+)><>Value
- pattern : kafka.utils<type=Throttler, name=cleaner-io><>OneMinuteRate
- pattern : kafka.log<type=Log, name=LogEndOffset, topic=(.+),
partition=(.+)><>Value
- pattern : java.lang<type=(.*)>

Example for JMXURL:
jmxUrl: service:jmx:rmi:///jndi/rmi:// 172.16.159.95:55002/jmxrmi
Docker Commands:
Prometheus-jmx-exporter:
docker run -d --name projmxexpo -p 5556:5556 -v "/root/config.yml:/opt/jmx_exporter/config.yml" --
link kafka:kafka --link zookeeper:zookeeper quay.io/toraj58/pro-jmx-exporter
Prometheus:
docker run -d --name prometheus -p 9090:9090 -v
"/root/prometheus.yml:/etc/prometheus/prometheus.yml" --link projmxexpo:projmxexpo
quay.io/toraj58/prometheus
Grafana:
docker run -d --name grafanarc -p 3000:3000 --link prometheus:prometheus quay.io/toraj58/grafanarc

Prometheus:
After running Prometheus Docker Container we can see its UI in the following URL:
Then we can add multitude of graphs in order to monitor desired metrics.
http://172.16.159.95:9090

Prometheus-jmx-collector
After running Prometheus-jmx-collector docker container and exposing port 5556 to host we can
connect to the following URL to see metrics:
http://172.16.159.95:5556/metrics

Grafana:
After running Dockers and configuration of the whole system using their .yml files, json files etc. as
described in this document we can see garafana customized dashboard for Kafka monitoring like
following:

If we issue docker ps and docker images command we should have something like following that gives
us an overview of the dockers we have configured for the monitoring system:
Configured Grafana for monitoring our event bus with Kafka:

References:
https://github.com/rama-nallamilli/kafka-prometheus-monitoring
https://www.robustperception.io/monitoring-kafka-with-prometheus/
https://grafana.net/dashboards/721
https://blog.serverdensity.com/how-to-monitor-kafka/
https://www.serverdensity.com/
http://docs.confluent.io/3.0.0/kafka/monitoring.html
http://debezium.io/docs/monitoring/
http://126kr.com/article/6kaq7meq2pf
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Kafka monitoring and metrics

More Related Content

What's hot (20)

Similar to Kafka monitoring and metrics (20)

Recently uploaded (20)

Kafka monitoring and metrics