5/16/2017 Kafka Metrics and
Monitoring
With Prometheus, Grafana
, Prometheus-jmx-exporter and graf-db base on docker
Touraj Ebrahimi
‫مقدمه‬:
‫کلی‬ ‫طور‬ ‫به‬metric‫توسط‬ ‫زیادی‬ ‫های‬Kafka, Zookeeper‫و‬Kafka Connect‫طریق‬ ‫از‬ ‫که‬ ‫دارند‬ ‫وجود‬ ‫مانیتورینگ‬ ‫برای‬
JMX‫را‬ ‫آنها‬ ‫توان‬ ‫می‬Expose‫و‬Collect‫کرد‬.‫از‬ ‫عبارتند‬ ‫آنها‬ ‫مهمترین‬ ‫که‬ ‫شوند‬ ‫می‬ ‫بندی‬ ‫طبقه‬ ‫دسته‬ ‫چند‬ ‫به‬ ‫متریکها‬:
 System Metrics
 Zookeeper Metrics
 Consumer Metrics
 Producer Metrics
 Connect Metrics
 Kafka-Server Metrics
 Kafka-Cluster Metrics
 Kafka-log Metrics
 Kafka-Network Metrics
‫میان‬ ‫از‬Metric‫و‬ ‫دهند‬ ‫می‬ ‫ما‬ ‫به‬ ‫ملموسی‬ ‫اطالعات‬ ‫که‬ ‫آنهایی‬ ‫روی‬ ‫بر‬ ‫ما‬ ‫باال‬ ‫های‬‫برای‬ ‫مخصوصا‬HealthCheck
‫نماییم‬ ‫می‬ ‫تمرکز‬ ‫کرد‬ ‫استفاده‬ ‫آنها‬ ‫از‬ ‫توان‬ ‫می‬ ‫سیستم‬.
‫برای‬ ‫مهم‬ ‫متریک‬ ‫چند‬ ‫ادامه‬ ‫در‬Health Check‫سیستم‬ ‫وضیعت‬ ‫شرایطی‬ ‫چه‬ ‫در‬ ‫کنیم‬ ‫می‬ ‫مشخص‬ ‫و‬ ‫دهیم‬ ‫می‬ ‫توضیح‬ ‫را‬
‫نیست‬ ‫مناسب‬:
‫برای‬ ‫نیز‬ ‫زیر‬ ‫متریکهای‬Health Check‫کافکا‬‫می‬ ‫پیشنهاد‬
‫شوند‬:
DescriptionAfter Version 9Before version 9Metric
Alert Should be emitted when
>0
kafka.server:type=ReplicaManager,
name=UnderReplicatedPartitions
UnderReplicatedPartitions
In-Sync Replica should not
Shrink Often. Consideration
should be done in case of
shrinking usually.
kafka.server:type=ReplicaManager, name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
IsrShrinksPerSec
IsrExpandsPerSec
Average number of requests
sent per second
kafka.producer:type=producer-
metrics,client-id=([-.w]+)
kafka.producer:type=ProducerRequestMetrics,
name=ProducerRequestRateAndTimeMs,clientId=([-.w]+)
Request rate
Bytes consumed per secondkafka.consumer:type=consumer-fetch-
manager-metrics,client-id=([-.w]+)
kafka.consumer:type= ConsumerTopicMetrics,
name=BytesPerSec, clientId=([-.w]+)
BytesPerSec
Messages consumed per
second
kafka.consumer:type=consumer-
fetch-manager-metrics,client-id=([-
.w]+)
kafka.consumer:type= ConsumerTopicMetrics,
name=MessagesPerSec, clientId=([-.w]+)
MessagesPerSec
Minimum rate a consumer
fetches requests to the broker
Attribute: fetch-rate,
kafka.consumer:type=consumer-
fetch-manager-metrics,client-id=([-
.w]+)
kafka.consumer:type= ConsumerFetcherManager,
name=MinFetchRate, clientId=([-.w]+)
MinFetchRate
‫برای‬ ‫باال‬ ‫در‬ ‫شده‬ ‫پیشنهاد‬ ‫متریکهای‬ ‫به‬ ‫مربوط‬ ‫توضیحات‬Health Check‫می‬ ‫زیر‬ ‫صورت‬ ‫به‬
‫باشند‬:
‫توسط‬ ‫شده‬ ‫پیشنهاد‬:Gwen Shapira, System Architect at Confluent
UnderReplicatedPartitions: In a healthy cluster, the number of in sync replicas (ISRs) should be
exactly equal to the total number of replicas. If partition replicas fall too far behind their leaders, the
follower partition is removed from the ISR pool, and you should see a corresponding increase in
IsrShrinksPerSec. Since Kafka’s high-availability guarantees cannot be met without replication,
investigation is certainly warranted should this metric value exceed zero for extended time periods.
IsrShrinksPerSec/IsrExpandsPerSec: The number of in-sync replicas (ISRs) for a particular
partition should remain fairly static, the only exceptions are when you are expanding your broker
cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a
minimum number of ISRs for failover. A replica could be removed from the ISR pool for a couple of
reasons: it is too far behind the leader’s offset (user-configurable by setting the
replica.lag.max.messages configuration parameter), or it has not contacted the leader for some time
(configurable with the replica.socket.timeout.ms parameter). No matter the reason, an increase in
IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter is cause
for concern and requires user intervention.The Kafka documentation provides a wealth of
information on the user-configurable parameters for brokers.
Request rate: The request rate is the rate at which producers send data to brokers. Of course, what
constitutes a healthy request rate will vary drastically depending on the use case. Keeping an eye on
peaks and drops is essential to ensure continuous service availability. If rate-limiting is not enabled
(version 0.9+), in the event of a traffic spike brokers could slow to a crawl as they struggle to process
a rapid influx of data.
BytesPerSec: As with producers and brokers, you will want to monitor your consumer network
throughput. For example, a sudden drop in MessagesPerSec could indicate a failing consumer, but if
its BytesPerSec remains constant, it’s still healthy, just consuming fewer, larger-sized messages.
Observing traffic volume over time, in the context of other metrics, s important for diagnosing
anomalous network usage.
MessagesPerSec: The rate of messages consumed per second may not strongly correlate with the
rate of bytes consumed because messages can be of variable size. Depending on your producers
and workload, in typical deployments you should expect this number to remain fairly constant. By
monitoring this metric over time, you can discover trends in your data consumption and create a
baseline against which you can alert. Again, the shape of this graph depends entirely on your use
case, but in many cases, establishing a baseline and alerting on anomalous behavior is possible.
MinFetchRate: The fetch rate of a consumer can be a good indicator of overall consumer health. A
minimum fetch rate approaching a value of zero could potentially signal an issue on the consumer.
In a healthy consumer, the minimum fetch rate will usually be non-zero, so if you see this value
dropping, it could be a sign of consumer failure.
Monitoring System Health:
‫ما‬ ‫آنها‬ ‫بندیهای‬ ‫دسته‬ ‫و‬ ‫متریکها‬ ‫از‬ ‫بهتر‬ ‫دید‬ ‫داشتن‬ ‫برای‬kafka, Zookeeper‫و‬Kafka Connect‫روی‬ ‫بر‬ ‫را‬JMX Port‫و‬JMX
Host‫روی‬ ‫بر‬ ‫زیر‬ ‫های‬Docker Container‫کردیم‬ ‫تنظیم‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫آنها‬ ‫های‬:
Zookeeper :: JMXPORT=55001 :: JMXHOST=172.16.159.95
Kafka :: JMXPORT=55002 :: JMXHOST=172.16.159.95
Kafka Connect :: JMXPORT=55003 :: JMXHOST=172.16.159.95
‫طریق‬ ‫از‬ ‫توانیم‬ ‫می‬ ‫حاال‬Jconsole‫صورت‬ ‫به‬ ‫آنها‬ ‫به‬Remote‫و‬ ‫کنیم‬ ‫مانیتور‬ ‫را‬ ‫آنها‬ ‫و‬ ‫شده‬ ‫وصل‬Metric‫را‬ ‫دسترسی‬ ‫قابل‬ ‫های‬
‫از‬ ‫باید‬ ‫اینکار‬ ‫برای‬ ‫نماییم‬ ‫بررسی‬MBeans Tab‫در‬JConsole‫نماییم‬ ‫استفاده‬:
Kafka monitoring and metrics
Kafka monitoring and metrics
Kafka monitoring and metrics
Kafka monitoring and metrics
Grafana Suggested Dashboard for Monitoring Kafka:
Download link: https://grafana.com/api/dashboards/721/revisions/1/download
Download link: https://github.com/rama-nallamilli/kafka-prometheus-
monitoring/blob/master/dashboards/Kafka.json
‫کردن‬ ‫گانفیگ‬ ‫برای‬Prometheus, JMX Exporter, Zookeeper, Kafka, Grafana‫ا‬ ‫توانیم‬ ‫می‬‫ز‬Workflow‫در‬ ‫که‬ ‫زیر‬
‫فایل‬ ‫یک‬ ‫واقع‬Docker-Compose‫کنیم‬ ‫اجرا‬ ‫آنرا‬ ‫و‬ ‫گرفته‬ ‫ایده‬ ‫است‬:
Kafka monitoring and metrics
We can configure prometheus.yml in order to get metrics from Prometheus-jmx-exporter (here we
named it projmxexpo) like following
prometheus.yml
global:
scrape_interval: 10s
evaluation_interval: 10s
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- projmxexpo:5556
Following is the config.yml that we should provide it for the Prometheus-jmx-exporter (via docker –v
commands or manually altering the default one in the docker container)
config.yml
lowercaseOutputName: true
jmxUrl: service:jmx:rmi:///jndi/rmi://172.16.159.95:55002/jmxrmi
rules:
- pattern : kafka.network<type=Processor, name=IdlePercent,
networkProcessor=(.+)><>Value
- pattern : kafka.network<type=RequestMetrics, name=RequestsPerSec,
request=(.+)><>OneMinuteRate
- pattern : kafka.network<type=SocketServer,
name=NetworkProcessorAvgIdlePercent><>Value
- pattern : kafka.server<type=ReplicaFetcherManager, name=MaxLag,
clientId=(.+)><>Value
- pattern : kafka.server<type=BrokerTopicMetrics, name=(.+),
topic=(.+)><>OneMinuteRate
- pattern : kafka.server<type=KafkaRequestHandlerPool,
name=RequestHandlerAvgIdlePercent><>OneMinuteRate
- pattern : kafka.server<type=Produce><>queue-size
- pattern : kafka.server<type=ReplicaManager, name=(.+)><>(Value|OneMinuteRate)
- pattern : kafka.server<type=controller-channel-metrics, broker-id=(.+)><>(.*)
- pattern : kafka.server<type=socket-server-metrics,
networkProcessor=(.+)><>(.*)
- pattern : kafka.server<type=Fetch><>queue-size
- pattern : kafka.server<type=SessionExpireListener, name=(.+)><>OneMinuteRate
- pattern : kafka.controller<type=KafkaController, name=(.+)><>Value
- pattern : kafka.controller<type=ControllerStats, name=(.+)><>OneMinuteRate
- pattern : kafka.cluster<type=Partition, name=UnderReplicated, topic=(.+),
partition=(.+)><>Value
- pattern : kafka.utils<type=Throttler, name=cleaner-io><>OneMinuteRate
- pattern : kafka.log<type=Log, name=LogEndOffset, topic=(.+),
partition=(.+)><>Value
- pattern : java.lang<type=(.*)>
Example for JMXURL:
jmxUrl: service:jmx:rmi:///jndi/rmi:// 172.16.159.95:55002/jmxrmi
Docker Commands:
Prometheus-jmx-exporter:
docker run -d --name projmxexpo -p 5556:5556 -v "/root/config.yml:/opt/jmx_exporter/config.yml" --
link kafka:kafka --link zookeeper:zookeeper quay.io/toraj58/pro-jmx-exporter
Prometheus:
docker run -d --name prometheus -p 9090:9090 -v
"/root/prometheus.yml:/etc/prometheus/prometheus.yml" --link projmxexpo:projmxexpo
quay.io/toraj58/prometheus
Grafana:
docker run -d --name grafanarc -p 3000:3000 --link prometheus:prometheus quay.io/toraj58/grafanarc
Prometheus:
After running Prometheus Docker Container we can see its UI in the following URL:
Then we can add multitude of graphs in order to monitor desired metrics.
http://172.16.159.95:9090
Prometheus-jmx-collector
After running Prometheus-jmx-collector docker container and exposing port 5556 to host we can
connect to the following URL to see metrics:
http://172.16.159.95:5556/metrics
Grafana:
After running Dockers and configuration of the whole system using their .yml files, json files etc. as
described in this document we can see garafana customized dashboard for Kafka monitoring like
following:
If we issue docker ps and docker images command we should have something like following that gives
us an overview of the dockers we have configured for the monitoring system:
Configured Grafana for monitoring our event bus with Kafka:
Kafka monitoring and metrics
References:
https://github.com/rama-nallamilli/kafka-prometheus-monitoring
https://www.robustperception.io/monitoring-kafka-with-prometheus/
https://grafana.net/dashboards/721
https://blog.serverdensity.com/how-to-monitor-kafka/
https://www.serverdensity.com/
http://docs.confluent.io/3.0.0/kafka/monitoring.html
http://debezium.io/docs/monitoring/
http://126kr.com/article/6kaq7meq2pf
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

More Related Content

PDF
Monitoring Kafka w/ Prometheus
ODP
Using Grails to power your electric car
PDF
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
PPTX
Monitoring_with_Prometheus_Grafana_Tutorial
PDF
Breaking Prometheus (Promcon Berlin '16)
PDF
Prometheus – a next-gen Monitoring System
PDF
Monitoring with Prometheus
PPTX
PostgreSQL Terminology
Monitoring Kafka w/ Prometheus
Using Grails to power your electric car
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
Monitoring_with_Prometheus_Grafana_Tutorial
Breaking Prometheus (Promcon Berlin '16)
Prometheus – a next-gen Monitoring System
Monitoring with Prometheus
PostgreSQL Terminology

What's hot (20)

PDF
Server monitoring using grafana and prometheus
PDF
Streaming huge databases using logical decoding
PDF
[231] the simplicity of cluster apps with circuit
PDF
Explore your prometheus data in grafana - Promcon 2018
PDF
Adding replication protocol support for psycopg2
PPTX
Ob1k presentation at Java.IL
PDF
OB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
PPTX
Monitoring MySQL with OpenTSDB
PDF
Thanos: Global, durable Prometheus monitoring
PDF
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
PDF
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
PDF
Logical Replication in PostgreSQL - FLOSSUK 2016
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PDF
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
PDF
Benchmarking for HTTP/2
PDF
HBaseCon2017 Transactions in HBase
PPTX
Ruby/rails performance and profiling
PDF
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
PDF
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Server monitoring using grafana and prometheus
Streaming huge databases using logical decoding
[231] the simplicity of cluster apps with circuit
Explore your prometheus data in grafana - Promcon 2018
Adding replication protocol support for psycopg2
Ob1k presentation at Java.IL
OB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
Monitoring MySQL with OpenTSDB
Thanos: Global, durable Prometheus monitoring
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Logical Replication in PostgreSQL - FLOSSUK 2016
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Benchmarking for HTTP/2
HBaseCon2017 Transactions in HBase
Ruby/rails performance and profiling
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Ad

Similar to Kafka monitoring and metrics (20)

PDF
kafka_basics.pdf
PDF
Kafka and kafka connect
PPTX
Kafka infrastructure monitoring
PDF
Cruise Control: Effortless management of Kafka clusters
PPTX
Fraud Detection for Israel BigThings Meetup
PDF
Streaming Processing with a Distributed Commit Log
PDF
Insta clustr seattle kafka meetup presentation bb
PDF
Data Streaming Ecosystem Management at Booking.com
PDF
Fraud Detection using Hadoop
PPTX
Kafka at scale facebook israel
PDF
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
PPTX
Putting Kafka Into Overdrive
PDF
Time series denver an introduction to prometheus
PDF
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
PDF
KubeCon Prometheus Salon -- Kubernetes metrics deep dive
PPTX
Monitoring Apache Kafka
ODP
Kafka aws
PPTX
Fraud Detection Architecture
PPTX
Architecting a Fraud Detection Application with Hadoop
PDF
Unleashing your Kafka Streams Application Metrics!
kafka_basics.pdf
Kafka and kafka connect
Kafka infrastructure monitoring
Cruise Control: Effortless management of Kafka clusters
Fraud Detection for Israel BigThings Meetup
Streaming Processing with a Distributed Commit Log
Insta clustr seattle kafka meetup presentation bb
Data Streaming Ecosystem Management at Booking.com
Fraud Detection using Hadoop
Kafka at scale facebook israel
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Putting Kafka Into Overdrive
Time series denver an introduction to prometheus
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
KubeCon Prometheus Salon -- Kubernetes metrics deep dive
Monitoring Apache Kafka
Kafka aws
Fraud Detection Architecture
Architecting a Fraud Detection Application with Hadoop
Unleashing your Kafka Streams Application Metrics!
Ad

Recently uploaded (20)

PDF
What Makes a Great Data Visualization Consulting Service.pdf
PPTX
FLIGHT TICKET API | API INTEGRATION PLATFORM
PDF
Mobile App for Guard Tour and Reporting.pdf
PPTX
Independent Consultants’ Biggest Challenges in ERP Projects – and How Apagen ...
PPTX
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
PPTX
Comprehensive Guide to Digital Image Processing Concepts and Applications
PDF
Sanket Mhaiskar Resume - Senior Software Engineer (Backend, AI)
PDF
Multiverse AI Review 2025_ The Ultimate All-in-One AI Platform.pdf
PPTX
Lesson-3-Operation-System-Support.pptx-I
PPTX
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
PPTX
Swiggy API Scraping A Comprehensive Guide on Data Sets and Applications.pptx
PPTX
Folder Lock 10.1.9 Crack With Serial Key
PPTX
SAP Business AI_L1 Overview_EXTERNAL.pptx
PPTX
Foundations of Marketo Engage: Nurturing
PPTX
ESDS_SAP Application Cloud Offerings.pptx
PDF
Top 10 Project Management Software for Small Teams in 2025.pdf
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PDF
Engineering Document Management System (EDMS)
PDF
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
PPTX
HackYourBrain__UtrechtJUG__11092025.pptx
What Makes a Great Data Visualization Consulting Service.pdf
FLIGHT TICKET API | API INTEGRATION PLATFORM
Mobile App for Guard Tour and Reporting.pdf
Independent Consultants’ Biggest Challenges in ERP Projects – and How Apagen ...
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
Comprehensive Guide to Digital Image Processing Concepts and Applications
Sanket Mhaiskar Resume - Senior Software Engineer (Backend, AI)
Multiverse AI Review 2025_ The Ultimate All-in-One AI Platform.pdf
Lesson-3-Operation-System-Support.pptx-I
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
Swiggy API Scraping A Comprehensive Guide on Data Sets and Applications.pptx
Folder Lock 10.1.9 Crack With Serial Key
SAP Business AI_L1 Overview_EXTERNAL.pptx
Foundations of Marketo Engage: Nurturing
ESDS_SAP Application Cloud Offerings.pptx
Top 10 Project Management Software for Small Teams in 2025.pdf
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Engineering Document Management System (EDMS)
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
HackYourBrain__UtrechtJUG__11092025.pptx

Kafka monitoring and metrics

  • 1. 5/16/2017 Kafka Metrics and Monitoring With Prometheus, Grafana , Prometheus-jmx-exporter and graf-db base on docker Touraj Ebrahimi
  • 2. ‫مقدمه‬: ‫کلی‬ ‫طور‬ ‫به‬metric‫توسط‬ ‫زیادی‬ ‫های‬Kafka, Zookeeper‫و‬Kafka Connect‫طریق‬ ‫از‬ ‫که‬ ‫دارند‬ ‫وجود‬ ‫مانیتورینگ‬ ‫برای‬ JMX‫را‬ ‫آنها‬ ‫توان‬ ‫می‬Expose‫و‬Collect‫کرد‬.‫از‬ ‫عبارتند‬ ‫آنها‬ ‫مهمترین‬ ‫که‬ ‫شوند‬ ‫می‬ ‫بندی‬ ‫طبقه‬ ‫دسته‬ ‫چند‬ ‫به‬ ‫متریکها‬:  System Metrics  Zookeeper Metrics  Consumer Metrics  Producer Metrics  Connect Metrics  Kafka-Server Metrics  Kafka-Cluster Metrics  Kafka-log Metrics  Kafka-Network Metrics ‫میان‬ ‫از‬Metric‫و‬ ‫دهند‬ ‫می‬ ‫ما‬ ‫به‬ ‫ملموسی‬ ‫اطالعات‬ ‫که‬ ‫آنهایی‬ ‫روی‬ ‫بر‬ ‫ما‬ ‫باال‬ ‫های‬‫برای‬ ‫مخصوصا‬HealthCheck ‫نماییم‬ ‫می‬ ‫تمرکز‬ ‫کرد‬ ‫استفاده‬ ‫آنها‬ ‫از‬ ‫توان‬ ‫می‬ ‫سیستم‬. ‫برای‬ ‫مهم‬ ‫متریک‬ ‫چند‬ ‫ادامه‬ ‫در‬Health Check‫سیستم‬ ‫وضیعت‬ ‫شرایطی‬ ‫چه‬ ‫در‬ ‫کنیم‬ ‫می‬ ‫مشخص‬ ‫و‬ ‫دهیم‬ ‫می‬ ‫توضیح‬ ‫را‬ ‫نیست‬ ‫مناسب‬: ‫برای‬ ‫نیز‬ ‫زیر‬ ‫متریکهای‬Health Check‫کافکا‬‫می‬ ‫پیشنهاد‬ ‫شوند‬: DescriptionAfter Version 9Before version 9Metric Alert Should be emitted when >0 kafka.server:type=ReplicaManager, name=UnderReplicatedPartitions UnderReplicatedPartitions In-Sync Replica should not Shrink Often. Consideration should be done in case of shrinking usually. kafka.server:type=ReplicaManager, name=IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrExpandsPerSec IsrShrinksPerSec IsrExpandsPerSec Average number of requests sent per second kafka.producer:type=producer- metrics,client-id=([-.w]+) kafka.producer:type=ProducerRequestMetrics, name=ProducerRequestRateAndTimeMs,clientId=([-.w]+) Request rate Bytes consumed per secondkafka.consumer:type=consumer-fetch- manager-metrics,client-id=([-.w]+) kafka.consumer:type= ConsumerTopicMetrics, name=BytesPerSec, clientId=([-.w]+) BytesPerSec Messages consumed per second kafka.consumer:type=consumer- fetch-manager-metrics,client-id=([- .w]+) kafka.consumer:type= ConsumerTopicMetrics, name=MessagesPerSec, clientId=([-.w]+) MessagesPerSec Minimum rate a consumer fetches requests to the broker Attribute: fetch-rate, kafka.consumer:type=consumer- fetch-manager-metrics,client-id=([- .w]+) kafka.consumer:type= ConsumerFetcherManager, name=MinFetchRate, clientId=([-.w]+) MinFetchRate
  • 3. ‫برای‬ ‫باال‬ ‫در‬ ‫شده‬ ‫پیشنهاد‬ ‫متریکهای‬ ‫به‬ ‫مربوط‬ ‫توضیحات‬Health Check‫می‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫باشند‬: ‫توسط‬ ‫شده‬ ‫پیشنهاد‬:Gwen Shapira, System Architect at Confluent UnderReplicatedPartitions: In a healthy cluster, the number of in sync replicas (ISRs) should be exactly equal to the total number of replicas. If partition replicas fall too far behind their leaders, the follower partition is removed from the ISR pool, and you should see a corresponding increase in IsrShrinksPerSec. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods. IsrShrinksPerSec/IsrExpandsPerSec: The number of in-sync replicas (ISRs) for a particular partition should remain fairly static, the only exceptions are when you are expanding your broker cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a minimum number of ISRs for failover. A replica could be removed from the ISR pool for a couple of reasons: it is too far behind the leader’s offset (user-configurable by setting the replica.lag.max.messages configuration parameter), or it has not contacted the leader for some time (configurable with the replica.socket.timeout.ms parameter). No matter the reason, an increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter is cause for concern and requires user intervention.The Kafka documentation provides a wealth of information on the user-configurable parameters for brokers. Request rate: The request rate is the rate at which producers send data to brokers. Of course, what constitutes a healthy request rate will vary drastically depending on the use case. Keeping an eye on peaks and drops is essential to ensure continuous service availability. If rate-limiting is not enabled (version 0.9+), in the event of a traffic spike brokers could slow to a crawl as they struggle to process a rapid influx of data. BytesPerSec: As with producers and brokers, you will want to monitor your consumer network throughput. For example, a sudden drop in MessagesPerSec could indicate a failing consumer, but if its BytesPerSec remains constant, it’s still healthy, just consuming fewer, larger-sized messages. Observing traffic volume over time, in the context of other metrics, s important for diagnosing anomalous network usage. MessagesPerSec: The rate of messages consumed per second may not strongly correlate with the rate of bytes consumed because messages can be of variable size. Depending on your producers and workload, in typical deployments you should expect this number to remain fairly constant. By monitoring this metric over time, you can discover trends in your data consumption and create a baseline against which you can alert. Again, the shape of this graph depends entirely on your use case, but in many cases, establishing a baseline and alerting on anomalous behavior is possible. MinFetchRate: The fetch rate of a consumer can be a good indicator of overall consumer health. A minimum fetch rate approaching a value of zero could potentially signal an issue on the consumer. In a healthy consumer, the minimum fetch rate will usually be non-zero, so if you see this value dropping, it could be a sign of consumer failure.
  • 4. Monitoring System Health: ‫ما‬ ‫آنها‬ ‫بندیهای‬ ‫دسته‬ ‫و‬ ‫متریکها‬ ‫از‬ ‫بهتر‬ ‫دید‬ ‫داشتن‬ ‫برای‬kafka, Zookeeper‫و‬Kafka Connect‫روی‬ ‫بر‬ ‫را‬JMX Port‫و‬JMX Host‫روی‬ ‫بر‬ ‫زیر‬ ‫های‬Docker Container‫کردیم‬ ‫تنظیم‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫آنها‬ ‫های‬: Zookeeper :: JMXPORT=55001 :: JMXHOST=172.16.159.95 Kafka :: JMXPORT=55002 :: JMXHOST=172.16.159.95 Kafka Connect :: JMXPORT=55003 :: JMXHOST=172.16.159.95 ‫طریق‬ ‫از‬ ‫توانیم‬ ‫می‬ ‫حاال‬Jconsole‫صورت‬ ‫به‬ ‫آنها‬ ‫به‬Remote‫و‬ ‫کنیم‬ ‫مانیتور‬ ‫را‬ ‫آنها‬ ‫و‬ ‫شده‬ ‫وصل‬Metric‫را‬ ‫دسترسی‬ ‫قابل‬ ‫های‬ ‫از‬ ‫باید‬ ‫اینکار‬ ‫برای‬ ‫نماییم‬ ‫بررسی‬MBeans Tab‫در‬JConsole‫نماییم‬ ‫استفاده‬:
  • 9. Grafana Suggested Dashboard for Monitoring Kafka: Download link: https://grafana.com/api/dashboards/721/revisions/1/download Download link: https://github.com/rama-nallamilli/kafka-prometheus- monitoring/blob/master/dashboards/Kafka.json
  • 10. ‫کردن‬ ‫گانفیگ‬ ‫برای‬Prometheus, JMX Exporter, Zookeeper, Kafka, Grafana‫ا‬ ‫توانیم‬ ‫می‬‫ز‬Workflow‫در‬ ‫که‬ ‫زیر‬ ‫فایل‬ ‫یک‬ ‫واقع‬Docker-Compose‫کنیم‬ ‫اجرا‬ ‫آنرا‬ ‫و‬ ‫گرفته‬ ‫ایده‬ ‫است‬:
  • 12. We can configure prometheus.yml in order to get metrics from Prometheus-jmx-exporter (here we named it projmxexpo) like following prometheus.yml global: scrape_interval: 10s evaluation_interval: 10s scrape_configs: - job_name: 'kafka' static_configs: - targets: - projmxexpo:5556
  • 13. Following is the config.yml that we should provide it for the Prometheus-jmx-exporter (via docker –v commands or manually altering the default one in the docker container) config.yml lowercaseOutputName: true jmxUrl: service:jmx:rmi:///jndi/rmi://172.16.159.95:55002/jmxrmi rules: - pattern : kafka.network<type=Processor, name=IdlePercent, networkProcessor=(.+)><>Value - pattern : kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+)><>OneMinuteRate - pattern : kafka.network<type=SocketServer, name=NetworkProcessorAvgIdlePercent><>Value - pattern : kafka.server<type=ReplicaFetcherManager, name=MaxLag, clientId=(.+)><>Value - pattern : kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>OneMinuteRate - pattern : kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate - pattern : kafka.server<type=Produce><>queue-size - pattern : kafka.server<type=ReplicaManager, name=(.+)><>(Value|OneMinuteRate) - pattern : kafka.server<type=controller-channel-metrics, broker-id=(.+)><>(.*) - pattern : kafka.server<type=socket-server-metrics, networkProcessor=(.+)><>(.*) - pattern : kafka.server<type=Fetch><>queue-size - pattern : kafka.server<type=SessionExpireListener, name=(.+)><>OneMinuteRate - pattern : kafka.controller<type=KafkaController, name=(.+)><>Value - pattern : kafka.controller<type=ControllerStats, name=(.+)><>OneMinuteRate - pattern : kafka.cluster<type=Partition, name=UnderReplicated, topic=(.+), partition=(.+)><>Value - pattern : kafka.utils<type=Throttler, name=cleaner-io><>OneMinuteRate - pattern : kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value - pattern : java.lang<type=(.*)>
  • 14. Example for JMXURL: jmxUrl: service:jmx:rmi:///jndi/rmi:// 172.16.159.95:55002/jmxrmi Docker Commands: Prometheus-jmx-exporter: docker run -d --name projmxexpo -p 5556:5556 -v "/root/config.yml:/opt/jmx_exporter/config.yml" -- link kafka:kafka --link zookeeper:zookeeper quay.io/toraj58/pro-jmx-exporter Prometheus: docker run -d --name prometheus -p 9090:9090 -v "/root/prometheus.yml:/etc/prometheus/prometheus.yml" --link projmxexpo:projmxexpo quay.io/toraj58/prometheus Grafana: docker run -d --name grafanarc -p 3000:3000 --link prometheus:prometheus quay.io/toraj58/grafanarc
  • 15. Prometheus: After running Prometheus Docker Container we can see its UI in the following URL: Then we can add multitude of graphs in order to monitor desired metrics. http://172.16.159.95:9090
  • 16. Prometheus-jmx-collector After running Prometheus-jmx-collector docker container and exposing port 5556 to host we can connect to the following URL to see metrics: http://172.16.159.95:5556/metrics
  • 17. Grafana: After running Dockers and configuration of the whole system using their .yml files, json files etc. as described in this document we can see garafana customized dashboard for Kafka monitoring like following:
  • 18. If we issue docker ps and docker images command we should have something like following that gives us an overview of the dockers we have configured for the monitoring system: Configured Grafana for monitoring our event bus with Kafka:
  • 20. References: https://github.com/rama-nallamilli/kafka-prometheus-monitoring https://www.robustperception.io/monitoring-kafka-with-prometheus/ https://grafana.net/dashboards/721 https://blog.serverdensity.com/how-to-monitor-kafka/ https://www.serverdensity.com/ http://docs.confluent.io/3.0.0/kafka/monitoring.html http://debezium.io/docs/monitoring/ http://126kr.com/article/6kaq7meq2pf https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/