SlideShare a Scribd company logo
What is the State of my Kafka Streams Application?
Unleashing Metrics.
Neil Buesing
Kafka Summit Europe 2021
@nbuesing nbuesing
Takeaways
1. See how to extract metrics from your application using existing JMX tooling.

2. Walkthrough how to build a dashboard for observing those metrics.

3. Explore options of how to add additional JMX resources and Kafka Stream
metrics to your application.

4. How to verify you built your dashboard correctly by creating a data control set to
validate the dashboards. 

5. How to go beyond what you can collect from the Kafka Stream metrics.

6. An analysis and overview of the provided metrics, including the new end-to-end
metrics of Kafka Streams 2.7.
So Many Metrics
• Kafka Broker

• JVM

• Kafka Client

• Producer

• Consumer

• Kafka Streams
Kafka Stream Metrics
• Kafka Clients

• Producer

• Consumer 

• Admin

• Restore Consumer

• Client
Thread
Task
Processor
State Store
• Processing
• Thread

• Task (sub-topology)

• Processor Node

• Persistence Metrics
• State Store

• RocksDB

• Record Cache
JVM
Thread
JVM
Kafka Stream Metrics
Thread
Task
Processor
State Store
• Job (Application)

• Instance (JVM)

• Thread (Worker)

• Task (sub-topology)

• Partition

• State Stores
Task
Processor
State Store
Thread
Task
Processor
State Store
desire evenly distribution when scaling?
consider 12 or 24 partitions
12 -> 1/2/3/4/6/12
24 -> 1/2/3/4/6/8/12/24
Let's Build A Dashboard
https://pixy.org/4796106/
What's Needed
• Access to the Metrics



• Time Series Database



• Visualization
https://pixabay.com/images/id-309118/
Getting the Metrics
• MetricsReporter

• Built-in JMXReporter

• RMI - JConsole (UI), jmxterm (CLI)

• Java Agents - JMX Exporter (Prometheus), Jolokia

• Frameworks

• 3rd Party Metric Reporters

• Write Your Own
JMX
is the reason for the warning
when using . and _ in topic names.
As . gets converted to _ in JMX.
Prometheus JMX Exporter
streams-config.yml
• Give Me Everything
lowercaseOutputName: true
rules:
- pattern: (.*)<type=(.*)><>(.*)
• http://{hostname}:7071
-javaagent:/jmx_prometheus_javaagent.jar=7071:/streams-config.yml"
Prometheus JMX Exporter
streams-config.yml
• Give Me Just What I Want
lowercaseOutputName: true
rules:
- pattern: java.lang<type=(.*)>
- pattern: kafka.streams<type=stream-thread-metrics, thread-id=(.+),
task-id=(.+)><>(.*)
- pattern: kafka.streams<type=stream-task-metrics, thread-id=(.+)><>(.*)
- pattern: kafka.consumer.*<client-id=(.+)><>(.*)
- pattern: kafka.producer.*<client-id=(.+)><>(.*)
time curl -s http://localhost:7071 | wc -l

2244 | 0.179 total time
time curl -s http://localhost:7071 | wc -l
4243 | 0.490 total time
Prometheus JMX Exporter
streams-config.yml
• Give Me Just What I Want It And the Way I Want It
lowercaseOutputName: true
rules:
- pattern: kafka.streams<type=stream-record-cache-metrics, thread-id=(.+),
task-id=(.+)_(.+), record-cache-id=(.+)><>(.+):.+
name: kafka_streams_stream_record_cache_metrics
labels:
thread_id: "$1"
task_id: "$2"
partition_id: "$3"
record-cache-id: "$4"
metric: "$5"
See streams-config.yml in
streams/docker/streams-config.yml
for complete example
Time Series Database
• Prometheus
remove port from
URL for dashboard
- job_name: targets
file_sd_configs:
- files:
- /etc/prometheus/targets.json
relabel_configs:
- source_labels: [__address__]
regex: '(.*):(.*)'
target_label: instance
replacement: '$1'
Dashboard
Variables
Dashboard
• ~5 / second?

• what's wrong?

• Grafana
computed?
• Name the topology processes

KSTREAM-PEEK-0000000027

• Client property client.id and thread_id

• utilize JMX Prometheus 

• grafana autocomplete

• =~ 

• Breakdown Graph & Total
Let's Build A Dashboard: Takeaways
Baseline Application
Application State

• Users - KTable

• Store - Global KTable

• Products - KTable

• Assembly (Purchase Order) - Aggregate
orders-purchase orders-pickup
repartition
attach
user
& store
attach
line item
pricing
assemble
4 Tasks (sub-topologies)
builder.<String, PurchaseOrder>stream(options.getPurchaseTopic(), Consumed.as("purchase-order-source"))
.selectKey((k, v) -> v.getUserId(), Named.as("purchase-order-keyByUserId"))
.join(users, (purchaseOrder, user) -> {
purchaseOrder.setUser(user);
return purchaseOrder;
}, Joined.as("purchase-order-join-user"))
.join(stores, (k, v) -> v.getStoreId(), (purchaseOrder, store) -> {
purchaseOrder.setStore(store);
return purchaseOrder;
}, Named.as("purchase-order-join-store"))
.flatMap((k, v) -> v.getItems().stream().map(item -> KeyValue.pair(item.getSku(), v)).collect(Collectors.toList()),
Named.as("purchase-order-products-flatmap"))
.join(products, (purchaseOrder, product) -> {
purchaseOrder.getItems().stream().filter(item -> item.getSku().equals(product.getSku())).forEach(item -> item.setPrice(product.getPrice()));
return purchaseOrder;
}, Joined.as("purchase-order-join-product"))
.groupBy((k, v) -> v.getOrderId(), Grouped.as("pickup-order-groupBy-orderId"))
.reduce((incoming, aggregate) -> {
if (aggregate == null) {
aggregate = incoming;
} else {
final PurchaseOrder purchaseOrder = aggregate;
incoming.getItems().stream().forEach(item -> {
if (item.getPrice() != null) {
purchaseOrder.getItems().stream().filter(i -> i.getSku().equals(item.getSku())).forEach(i -> i.setPrice(item.getPrice()));
}
});
}
return aggregate;
}, Named.as("pickup-order-reduce"), Materialized.as("pickup-order-reduce-store"))
.filter((k, v) -> v.getItems().stream().allMatch(i -> i.getPrice() != null), Named.as("pickup-order-filtered"))
.toStream(Named.as("pickup-order-reduce-tostream"))
.to(options.getPickupTopic(), Produced.as("pickup-orders"));
KTable
GlobalKTable
KTable
Aggregate
Kafka Stream Metrics
• configurations

• metrics.recording.level - info
• info, debug, & trace
• metrics.reporters
• metrics.sample.window.ms - 30 seconds

• metrics.num.samples - 2

• built.in.metrics.version - latest
Thread
Task
Processor
State Store
Thread Metrics
• operation

• commit

• poll

• process

• metric

• total

• rate
• job (application)

• instance (JVM)

• thread (worker)
JVM
JVM
Thread
Thread Thread
Thread Metrics
Task Metrics
• operation

• commit

• process

• metric

• total

• rate
• job (application)

• instance (JVM)

• thread (worker)

• task

• subtopology

• partition
JVM
JVM
Thread Thread
Task
Task
Thread
Task
Task
Task Task
Task Metrics
Processor Metrics
• operation

• process

• metric

• total

• rate

• e2e latency
• job (application)

• instance (JVM)

• thread (worker)

• task

• subtopology

• partition

• process node
JVM
JVM
Thread Thread
Task
Task
Thread
Task
Task
Task Task
Processor Processor Processor
ProcessorNodeMetrics.e2ELatencySensor()
KAFKA-9983: task-level metrics exposed for all source and terminal nodes
Processor Metrics
Processor Metrics - e2e
State Store Metrics
JVM
Thread
JVM
Thread
Task
Processor
State Store
Processor
Task
State Store
Thread
Task
Processor
State Store
• operation

• put {-if-absent/-all}

• get

• delete

• flush

• restore

• all

• metric

• total

• rate

• latency-{avg|max}

in nanoseconds
• job (application)

• instance (JVM)

• thread (worker)

• task

• subtopology

• partition

• store type
• store name
State Store Metrics
Additional RocksDB Metrics
num-immutable-mem-table
cur-size-active-mem-table
cur-size-all-mem-tables
size-all-mem-tables
num-entries-active-mem-table
num-entries-imm-mem-tables
num-deletes-active-mem-table
num-deletes-imm-mem-tables
mem-table-flush-pending
num-running-flushes
compaction-pending
num-running-compactions
estimate-pending-compaction-bytes
bytes-written-rate
bytes-read-rate
memtable-bytes-flushed-rate
memtable-hit-ratio
block-cache-data-hit-ratio
block-cache-index-hit-ratio
block-cache-filter-hit-ratio
write-stall-duration-avg
write-stall-duration-total
bytes-read-compaction-rate
number-open-files
number-file-errors-total
Property Metrics Statistical Metrics
total-sst-files-size
live-sst-files-size
num-live-versions
block-cache-capacity
block-cache-usage
block-cache-pinned-usage
estimate-num-keys
estimate-table-readers-mem
background-errors
Record Cache Metrics
JVM
Thread
JVM
Thread
Task
Processor
State Store
Processor
Task
State Store
Thread
Task
Processor
State Store
• metric

• hit-ratio-min

• hit-ratio-max

• hit-ratio-avg
• job (application)

• instance (JVM)

• thread (worker)

• task

• subtopology

• partition
• store name
Record Cache Metrics
Explore The Metrics
• 1 Application (Job)

• 2 Instances (JVMs)

• 2 Threads/Instance

• 4 Partitions - even distribution
orders-purchase orders-pickup
repartition
attach
user
& store
attach
line item
pricing
assemble
• 10 orders/second & 3 line-items/order

• Scenarios
• network-issues for 1 instance

• 100ms delay to brokers

• processing issue

• 100ms pricing delay
Traffic Control
tc linux command
100ms network latency 'stream'
100ms pricing processing
No Issues Pricing (#4) 100ms delay
Explore The Metrics: Takeaways
• Understand how Operations will use the Dashboards

• Baseline / Control Application

• Chaos Monkey
Alternate Collection Example
• Access to the Metrics

• Custom Metric
• Custom Reporter
• Time Series Database

• Apache Druid
• Visualization

• Druid SQL Query
Custom Metric
.transformValues(() -> new ValueTransformerWithKey<String, PurchaseOrder, PurchaseOrder>() {
private Sensor sensor;
public void init(ProcessorContext context) {
sensor = createSensor(Thread.currentThread().getName(),
context.taskId().toString(), "purchase-order-lineitem-counter", (StreamsMetricsImpl) context.metrics());
}
public PurchaseOrder transform(String readOnlyKey, PurchaseOrder value) {
sensor.record(value.getItems().size());
return value;
}
public Sensor createSensor(final String threadId, final String taskId, final String processorNodeId, final StreamsMetricsImpl sm) {
return sm(threadId, taskId, processorNodeId, processorNodeId + "-lineitems", Sensor.RecordingLevel.INFO);
addAvgAndMinAndMaxToSensor(sensor, PROCESSOR_NODE_LEVEL_GROUP, sm.nodeLevelTagMap(threadId, taskId, processorNodeId),
"lineitems", "avg-doc","min-doc","max-doc");
return sensor;
}
}
StreamsMetricsImpl
Custom Reporter
private ObjectNode jsonNode(final KafkaMetric metric) {
ObjectNode objectNode = JsonNodeFactory.instance.objectNode();
... // populate objectNode with immutable data from metric
return objectNode;
}
public void run() {
map.forEach((k, v) -> {
final KafkaMetric metric = v.getKey();
final ObjectNode node = v.getValue();
node.put("value", v.getKey().value());
node.put("timestamp", System.currentTimeMillis());
producer.send(
new ProducerRecord<>(topic,
metric.metricName().name(), serialize(node)));
});
}
@Override
public void init(final List<KafkaMetric> metrics) {
metrics.forEach(metric -> {
map.put(metric.metricName(),
Pair.of(metric, jsonNode(metric)));
});
executor.scheduleAtFixedRate(runnable, 5L, 5L,
TimeUnit.SECONDS);
}
@Override
public void metricChange(final KafkaMetric metric) {
map.put(metric.metricName(),
Pair.of(metric, jsonNode(metric)));
}
@Override
public void metricRemoval(KafkaMetric metric) {
map.remove(metric.metricName());
}
@Override
public void close() {
this.executor.shutdownNow();
} If reporter uses a Kafka producer,
be sure to
config.remove("metric.reporters")
Containerized Druid and configuration
part of GitHub project
Druid
{
"type": "kafka",
"spec": {
"ioConfig": {...},
"dataSchema": {
"dataSource": "_metrics-kafka-streams",
"granularitySpec": {
"type": "uniform",
"queryGranularity": "fifteen_minute",
"segmentGranularity": "fifteen_minute",
"rollup": true
},
"timestampSpec": {
"column": "timestamp",
"format": "millis"
},
"dimensionsSpec": {
"dimensions": [
<<COLUMNS>>
]
},
"metricsSpec": [
{
"name": "count",
"type": "count"
},
{
"name": "value",
"type": "doubleLast",
"fieldName": "value"
}
]
}
}
}
Takeaways
• 2.5 - Metric Overhaul (KIP-444)

• 2.7 - Added e2e Latency (KIP-613)

• "Discovery" Dashboards

• Validate 

• `-total` metrics great for validation 

• Extensible
Thank you
https://github.com/nbuesing/kafka-streams-dashboards



(docker and Java11 required)

./scripts/startup.sh
@nbuesing nbuesing

More Related Content

What's hot (20)

PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
HostedbyConfluent
 
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
PPTX
Monitoring Apache Kafka
confluent
 
PDF
ksqlDB: A Stream-Relational Database System
confluent
 
PDF
Introduction to Redis
Dvir Volk
 
PPTX
No data loss pipeline with apache kafka
Jiangjie Qin
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
confluent
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
A Deep Dive into Kafka Controller
confluent
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PDF
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
PDF
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HostedbyConfluent
 
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
PDF
Apache Kafka Introduction
Amita Mirajkar
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
HostedbyConfluent
 
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Monitoring Apache Kafka
confluent
 
ksqlDB: A Stream-Relational Database System
confluent
 
Introduction to Redis
Dvir Volk
 
No data loss pipeline with apache kafka
Jiangjie Qin
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
confluent
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
A Deep Dive into Kafka Controller
confluent
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HostedbyConfluent
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Apache Kafka Introduction
Amita Mirajkar
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 

Similar to What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil Buesing, Object Partners (20)

PDF
Unleashing your Kafka Streams Application Metrics!
HostedbyConfluent
 
PDF
Monitoring Akka with Kamon 1.0
Steffen Gebert
 
PPTX
Architectures, Frameworks and Infrastructure
harendra_pathak
 
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
PDF
6 tips for improving ruby performance
Engine Yard
 
PDF
Play Framework and Activator
Kevin Webber
 
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
PPTX
Dealing with and learning from the sandbox
Elaine Van Bergen
 
PDF
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
PDF
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
Amazon Web Services Japan
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PPTX
Dealing with and learning from the sandbox
Elaine Van Bergen
 
PPTX
Spark on Yarn
Qubole
 
PDF
Spring Batch in Code - simple DB to DB batch applicaiton
tomi vanek
 
PDF
OOW09 Ebs Tuning Final
jucaab
 
KEY
Wider than rails
Alexey Nayden
 
PPTX
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
PROIDEA
 
PPTX
Ansible benelux meetup - Amsterdam 27-5-2015
Pavel Chunyayev
 
PPTX
Spring Cloud: API gateway upgrade & configuration in the cloud
Orkhan Gasimov
 
PDF
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
Unleashing your Kafka Streams Application Metrics!
HostedbyConfluent
 
Monitoring Akka with Kamon 1.0
Steffen Gebert
 
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
6 tips for improving ruby performance
Engine Yard
 
Play Framework and Activator
Kevin Webber
 
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Dealing with and learning from the sandbox
Elaine Van Bergen
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
Amazon Web Services Japan
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Dealing with and learning from the sandbox
Elaine Van Bergen
 
Spark on Yarn
Qubole
 
Spring Batch in Code - simple DB to DB batch applicaiton
tomi vanek
 
OOW09 Ebs Tuning Final
jucaab
 
Wider than rails
Alexey Nayden
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
PROIDEA
 
Ansible benelux meetup - Amsterdam 27-5-2015
Pavel Chunyayev
 
Spring Cloud: API gateway upgrade & configuration in the cloud
Orkhan Gasimov
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 

What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil Buesing, Object Partners

  • 1. What is the State of my Kafka Streams Application? Unleashing Metrics. Neil Buesing Kafka Summit Europe 2021 @nbuesing nbuesing
  • 2. Takeaways 1. See how to extract metrics from your application using existing JMX tooling. 2. Walkthrough how to build a dashboard for observing those metrics. 3. Explore options of how to add additional JMX resources and Kafka Stream metrics to your application. 4. How to verify you built your dashboard correctly by creating a data control set to validate the dashboards.  5. How to go beyond what you can collect from the Kafka Stream metrics. 6. An analysis and overview of the provided metrics, including the new end-to-end metrics of Kafka Streams 2.7.
  • 3. So Many Metrics • Kafka Broker • JVM • Kafka Client • Producer • Consumer • Kafka Streams
  • 4. Kafka Stream Metrics • Kafka Clients • Producer • Consumer • Admin • Restore Consumer • Client Thread Task Processor State Store • Processing • Thread • Task (sub-topology) • Processor Node • Persistence Metrics • State Store • RocksDB • Record Cache
  • 5. JVM Thread JVM Kafka Stream Metrics Thread Task Processor State Store • Job (Application) • Instance (JVM) • Thread (Worker) • Task (sub-topology) • Partition • State Stores Task Processor State Store Thread Task Processor State Store desire evenly distribution when scaling? consider 12 or 24 partitions 12 -> 1/2/3/4/6/12 24 -> 1/2/3/4/6/8/12/24
  • 6. Let's Build A Dashboard https://pixy.org/4796106/
  • 7. What's Needed • Access to the Metrics
 
 • Time Series Database
 
 • Visualization https://pixabay.com/images/id-309118/
  • 8. Getting the Metrics • MetricsReporter • Built-in JMXReporter • RMI - JConsole (UI), jmxterm (CLI) • Java Agents - JMX Exporter (Prometheus), Jolokia • Frameworks • 3rd Party Metric Reporters • Write Your Own JMX is the reason for the warning when using . and _ in topic names. As . gets converted to _ in JMX.
  • 9. Prometheus JMX Exporter streams-config.yml • Give Me Everything lowercaseOutputName: true rules: - pattern: (.*)<type=(.*)><>(.*) • http://{hostname}:7071 -javaagent:/jmx_prometheus_javaagent.jar=7071:/streams-config.yml"
  • 10. Prometheus JMX Exporter streams-config.yml • Give Me Just What I Want lowercaseOutputName: true rules: - pattern: java.lang<type=(.*)> - pattern: kafka.streams<type=stream-thread-metrics, thread-id=(.+), task-id=(.+)><>(.*) - pattern: kafka.streams<type=stream-task-metrics, thread-id=(.+)><>(.*) - pattern: kafka.consumer.*<client-id=(.+)><>(.*) - pattern: kafka.producer.*<client-id=(.+)><>(.*) time curl -s http://localhost:7071 | wc -l
 2244 | 0.179 total time time curl -s http://localhost:7071 | wc -l 4243 | 0.490 total time
  • 11. Prometheus JMX Exporter streams-config.yml • Give Me Just What I Want It And the Way I Want It lowercaseOutputName: true rules: - pattern: kafka.streams<type=stream-record-cache-metrics, thread-id=(.+), task-id=(.+)_(.+), record-cache-id=(.+)><>(.+):.+ name: kafka_streams_stream_record_cache_metrics labels: thread_id: "$1" task_id: "$2" partition_id: "$3" record-cache-id: "$4" metric: "$5" See streams-config.yml in streams/docker/streams-config.yml for complete example
  • 12. Time Series Database • Prometheus remove port from URL for dashboard - job_name: targets file_sd_configs: - files: - /etc/prometheus/targets.json relabel_configs: - source_labels: [__address__] regex: '(.*):(.*)' target_label: instance replacement: '$1'
  • 15. Dashboard • ~5 / second? • what's wrong? • Grafana computed?
  • 16. • Name the topology processes KSTREAM-PEEK-0000000027 • Client property client.id and thread_id • utilize JMX Prometheus • grafana autocomplete • =~ • Breakdown Graph & Total Let's Build A Dashboard: Takeaways
  • 17. Baseline Application Application State • Users - KTable • Store - Global KTable • Products - KTable • Assembly (Purchase Order) - Aggregate orders-purchase orders-pickup repartition attach user & store attach line item pricing assemble
  • 18. 4 Tasks (sub-topologies) builder.<String, PurchaseOrder>stream(options.getPurchaseTopic(), Consumed.as("purchase-order-source")) .selectKey((k, v) -> v.getUserId(), Named.as("purchase-order-keyByUserId")) .join(users, (purchaseOrder, user) -> { purchaseOrder.setUser(user); return purchaseOrder; }, Joined.as("purchase-order-join-user")) .join(stores, (k, v) -> v.getStoreId(), (purchaseOrder, store) -> { purchaseOrder.setStore(store); return purchaseOrder; }, Named.as("purchase-order-join-store")) .flatMap((k, v) -> v.getItems().stream().map(item -> KeyValue.pair(item.getSku(), v)).collect(Collectors.toList()), Named.as("purchase-order-products-flatmap")) .join(products, (purchaseOrder, product) -> { purchaseOrder.getItems().stream().filter(item -> item.getSku().equals(product.getSku())).forEach(item -> item.setPrice(product.getPrice())); return purchaseOrder; }, Joined.as("purchase-order-join-product")) .groupBy((k, v) -> v.getOrderId(), Grouped.as("pickup-order-groupBy-orderId")) .reduce((incoming, aggregate) -> { if (aggregate == null) { aggregate = incoming; } else { final PurchaseOrder purchaseOrder = aggregate; incoming.getItems().stream().forEach(item -> { if (item.getPrice() != null) { purchaseOrder.getItems().stream().filter(i -> i.getSku().equals(item.getSku())).forEach(i -> i.setPrice(item.getPrice())); } }); } return aggregate; }, Named.as("pickup-order-reduce"), Materialized.as("pickup-order-reduce-store")) .filter((k, v) -> v.getItems().stream().allMatch(i -> i.getPrice() != null), Named.as("pickup-order-filtered")) .toStream(Named.as("pickup-order-reduce-tostream")) .to(options.getPickupTopic(), Produced.as("pickup-orders")); KTable GlobalKTable KTable Aggregate
  • 19. Kafka Stream Metrics • configurations • metrics.recording.level - info • info, debug, & trace • metrics.reporters • metrics.sample.window.ms - 30 seconds • metrics.num.samples - 2 • built.in.metrics.version - latest Thread Task Processor State Store
  • 20. Thread Metrics • operation • commit • poll • process • metric • total • rate • job (application) • instance (JVM) • thread (worker) JVM JVM Thread Thread Thread
  • 22. Task Metrics • operation • commit • process • metric • total • rate • job (application) • instance (JVM) • thread (worker) • task • subtopology • partition JVM JVM Thread Thread Task Task Thread Task Task Task Task
  • 24. Processor Metrics • operation • process • metric • total • rate • e2e latency • job (application) • instance (JVM) • thread (worker) • task • subtopology • partition • process node JVM JVM Thread Thread Task Task Thread Task Task Task Task Processor Processor Processor ProcessorNodeMetrics.e2ELatencySensor() KAFKA-9983: task-level metrics exposed for all source and terminal nodes
  • 27. State Store Metrics JVM Thread JVM Thread Task Processor State Store Processor Task State Store Thread Task Processor State Store • operation • put {-if-absent/-all} • get • delete • flush • restore • all • metric • total • rate • latency-{avg|max}
 in nanoseconds • job (application) • instance (JVM) • thread (worker) • task • subtopology • partition • store type • store name
  • 29. Additional RocksDB Metrics num-immutable-mem-table cur-size-active-mem-table cur-size-all-mem-tables size-all-mem-tables num-entries-active-mem-table num-entries-imm-mem-tables num-deletes-active-mem-table num-deletes-imm-mem-tables mem-table-flush-pending num-running-flushes compaction-pending num-running-compactions estimate-pending-compaction-bytes bytes-written-rate bytes-read-rate memtable-bytes-flushed-rate memtable-hit-ratio block-cache-data-hit-ratio block-cache-index-hit-ratio block-cache-filter-hit-ratio write-stall-duration-avg write-stall-duration-total bytes-read-compaction-rate number-open-files number-file-errors-total Property Metrics Statistical Metrics total-sst-files-size live-sst-files-size num-live-versions block-cache-capacity block-cache-usage block-cache-pinned-usage estimate-num-keys estimate-table-readers-mem background-errors
  • 30. Record Cache Metrics JVM Thread JVM Thread Task Processor State Store Processor Task State Store Thread Task Processor State Store • metric • hit-ratio-min • hit-ratio-max • hit-ratio-avg • job (application) • instance (JVM) • thread (worker) • task • subtopology • partition • store name
  • 32. Explore The Metrics • 1 Application (Job) • 2 Instances (JVMs) • 2 Threads/Instance • 4 Partitions - even distribution orders-purchase orders-pickup repartition attach user & store attach line item pricing assemble • 10 orders/second & 3 line-items/order • Scenarios • network-issues for 1 instance • 100ms delay to brokers • processing issue • 100ms pricing delay Traffic Control tc linux command
  • 34. 100ms pricing processing No Issues Pricing (#4) 100ms delay
  • 35. Explore The Metrics: Takeaways • Understand how Operations will use the Dashboards • Baseline / Control Application • Chaos Monkey
  • 36. Alternate Collection Example • Access to the Metrics • Custom Metric • Custom Reporter • Time Series Database • Apache Druid • Visualization • Druid SQL Query
  • 37. Custom Metric .transformValues(() -> new ValueTransformerWithKey<String, PurchaseOrder, PurchaseOrder>() { private Sensor sensor; public void init(ProcessorContext context) { sensor = createSensor(Thread.currentThread().getName(), context.taskId().toString(), "purchase-order-lineitem-counter", (StreamsMetricsImpl) context.metrics()); } public PurchaseOrder transform(String readOnlyKey, PurchaseOrder value) { sensor.record(value.getItems().size()); return value; } public Sensor createSensor(final String threadId, final String taskId, final String processorNodeId, final StreamsMetricsImpl sm) { return sm(threadId, taskId, processorNodeId, processorNodeId + "-lineitems", Sensor.RecordingLevel.INFO); addAvgAndMinAndMaxToSensor(sensor, PROCESSOR_NODE_LEVEL_GROUP, sm.nodeLevelTagMap(threadId, taskId, processorNodeId), "lineitems", "avg-doc","min-doc","max-doc"); return sensor; } } StreamsMetricsImpl
  • 38. Custom Reporter private ObjectNode jsonNode(final KafkaMetric metric) { ObjectNode objectNode = JsonNodeFactory.instance.objectNode(); ... // populate objectNode with immutable data from metric return objectNode; } public void run() { map.forEach((k, v) -> { final KafkaMetric metric = v.getKey(); final ObjectNode node = v.getValue(); node.put("value", v.getKey().value()); node.put("timestamp", System.currentTimeMillis()); producer.send( new ProducerRecord<>(topic, metric.metricName().name(), serialize(node))); }); } @Override public void init(final List<KafkaMetric> metrics) { metrics.forEach(metric -> { map.put(metric.metricName(), Pair.of(metric, jsonNode(metric))); }); executor.scheduleAtFixedRate(runnable, 5L, 5L, TimeUnit.SECONDS); } @Override public void metricChange(final KafkaMetric metric) { map.put(metric.metricName(), Pair.of(metric, jsonNode(metric))); } @Override public void metricRemoval(KafkaMetric metric) { map.remove(metric.metricName()); } @Override public void close() { this.executor.shutdownNow(); } If reporter uses a Kafka producer, be sure to config.remove("metric.reporters")
  • 39. Containerized Druid and configuration part of GitHub project Druid { "type": "kafka", "spec": { "ioConfig": {...}, "dataSchema": { "dataSource": "_metrics-kafka-streams", "granularitySpec": { "type": "uniform", "queryGranularity": "fifteen_minute", "segmentGranularity": "fifteen_minute", "rollup": true }, "timestampSpec": { "column": "timestamp", "format": "millis" }, "dimensionsSpec": { "dimensions": [ <<COLUMNS>> ] }, "metricsSpec": [ { "name": "count", "type": "count" }, { "name": "value", "type": "doubleLast", "fieldName": "value" } ] } } }
  • 40. Takeaways • 2.5 - Metric Overhaul (KIP-444) • 2.7 - Added e2e Latency (KIP-613) • "Discovery" Dashboards • Validate • `-total` metrics great for validation • Extensible