SlideShare a Scribd company logo
Performance Analysis and
Optimizations for Kafka Streams
Guozhang Wang
Kafka Summit London, May. 14, 2019
Outline
• Streams application execution: from 1000 to 400 feet
• Processor topology generation: an optimization problem
• Kafka Streams topology optimization framework
2
3
So you want to write a Streams app?
4
Step I: Read the demo
streams/examples/src/main/java/o.a.k.streams/examples/wordcount/WordCountDemo.java
5
Step II: Modify the demo code to yours
kubectl create -f my-kafka-streams-deployment.yaml
Deployment “my-kafka-streams-app” created
kubectl scale deployment my-kafka-streams-app --replicas=2
Deployment “my-kafka-streams-app” scaled
6
java -cp /my/app -Dlog4j.configuration=file:/my/app/conf/log4j.properties /


-Dcom.sun.management.jmxremote.authenticate=false /

-Dcom.sun.management.jmxremote.ssl=false /

-Dcom.sun.management.jmxremote.port=17072 /

-XX:+UseGCLogFileRotation -XX:GCLogFileSize=10M /

-Xloggc:/my/app/logs/GC.log /

my.streams.MyStreamsApp /my/app/conf/config.properties
The fancy way
The not-so-fancy
way
Step III: Run It!
[Kafka Summit SF 2018, Shapira & Sax]
Wait, what’s this?
7
Wait, what’s this?
8
Wait, what’s this?
9
10
Kafka Streams Execution
Kafka Streams Kafka
StreamThread
JVM
(from 1000 feet)
11
Kafka Streams Execution
Kafka Streams Kafka
ProcessorTopology
(from 900 feet)
Define your Processor Topology in DSL
12
KStream<..> stream1 = builder.stream(”topic1”);
KStream<..> stream2 = builder.stream(”topic2”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.groupBy(…).aggregate(…);
aggregated.toStream().to(”topic3”);
Define your Processor Topology in DSL
13
KStream<..> stream1 = builder.stream(”topic1”);
KStream<..> stream2 = builder.table(”topic2”);
.addSource(”Source2”, ”topic2”)
topology.addSource(”Source1”, ”topic1”)
.addProcessor(”Join”, LeftJoin:new, ”Source1”, ”Source2”)
.addProcessor(”Aggregate”, Aggregate:new, ”Join”)
.addStateStore(Stores.persistent(…).build(), ”Aggregate”)
.addSink(”Sink”, ”topic3”, ”Aggregate”)
State
Examine your Processor Topology
14
State
topology.addSource(”Source1”, ”topic1”)
.addProcessor(”Join”, LeftJoin:new, ”Source1”, ”Source2”)
.addProcessor(”Aggregate”, Aggregate:new, ”Join”)
.addStateStore(Stores.persistent(…).build(), ”Aggregate”)
.addSink(”Sink”, ”topic3”, ”Aggregate”)
.addSource(”Source2”, ”topic2”)
topology = builder.build();
System.out.println(topology.describe());
15
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [topic1])
--> KSTREAM-WINDOWED-0000000002
Source: KSTREAM-SOURCE-0000000001 (topics: [topic2])
--> KSTREAM-WINDOWED-0000000003
……
Processor: KSTREAM-KEY-SELECT-0000000007 (stores: [])
--> KSTREAM-FILTER-0000000011
<-- KSTREAM-MERGE-0000000006
……
Sink: KSTREAM-SINK-0000000010 (topic: MyAggStore-repartition)
<-- KSTREAM-FILTER-0000000011
Sub-topology: 1
Source: KSTREAM-SOURCE-0000000012 (topics: [MyAggStore-repartition])
--> KSTREAM-AGGREGATE-0000000009
Processor: KSTREAM-AGGREGATE-0000000009 (stores: [MyAggStore])
--> KTABLE-TOSTREAM-0000000013
<-- KSTREAM-SOURCE-0000000012
……
Sink: KSTREAM-SINK-0000000014 (topic: topic3)
<-- KTABLE-TOSTREAM-0000000013
That looks familiar..
State
[Kafka Streams Topology Visualizer. https://zz85.github.io/kafka-streams-viz/]
16
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [topic1])
--> KSTREAM-WINDOWED-0000000002
Source: KSTREAM-SOURCE-0000000001 (topics: [topic2])
--> KSTREAM-WINDOWED-0000000003
……
Processor: KSTREAM-KEY-SELECT-0000000007 (stores: [])
--> KSTREAM-FILTER-0000000011
<-- KSTREAM-MERGE-0000000006
……
Sink: KSTREAM-SINK-0000000010 (topic: MyAggStore-repartition)
<-- KSTREAM-FILTER-0000000011
Sub-topology: 1
Source: KSTREAM-SOURCE-0000000012 (topics: [MyAggStore-repartition])
--> KSTREAM-AGGREGATE-0000000009
Processor: KSTREAM-AGGREGATE-0000000009 (stores: [MyAggStore])
--> KTABLE-TOSTREAM-0000000013
<-- KSTREAM-SOURCE-0000000012
……
Sink: KSTREAM-SINK-0000000014 (topic: topic3)
<-- KTABLE-TOSTREAM-0000000013
Hmm, that looks bizarre?
State
[Kafka Streams Topology Visualizer. https://zz85.github.io/kafka-streams-viz/]
17
Data Parallelism 101 (specially tailored for Kafka)
Topic 1
Partitions
Producers Consumers
18
Data Parallelism 101 (specially tailored for Kafka)
groupBy(…).count(…)
19
Data Parallelism 101 (specially tailored for Kafka)
groupBy(…).count(…)
20
Data Parallelism 101 (specially tailored for Kafka)
groupBy(…).count(…)
21
Data Parallelism 101 (specially tailored for Kafka)
groupBy(…).count(…)
count
3

3
3

3
22
Shuffling in Streams: RepartitionTopics
State
23
Shuffling in Streams: RepartitionTopics
State
24
Shuffling in Streams: RepartitionTopics
State
Repartition Topic (partition key = grouped key)
Topologies:
Sub-topology: 0
……
Sink: KSTREAM-SINK-0000000010 (topic: MyAggStore-repartition)
<-- KSTREAM-FILTER-0000000011
Sub-topology: 1
Source: KSTREAM-SOURCE-0000000012 (topics: [MyAggStore-repartition])
--> KSTREAM-AGGREGATE-0000000009
……
25Kafka Streams Kafka
Kafka Streams Execution (from 500 feet)
State
Kafka Topic B
Task2Task1
Stream Partitions and Tasks
26
Kafka Topic A
Kafka Topic B
Stream Partitions and Tasks
27
Kafka Topic A
Task2Task1
28Kafka Streams Kafka
Kafka Streams Execution (from 400 feet)
State State
instance-1
instance-2
instance-3
29
Repartition Topics for Shuffling
State State
instance-1
instance-2
instance-3
…
…
30
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream mapped = source.map(..);
KTable counts = mapped.groupByKey().aggregate(..);
KStream sink = mapped.leftJoin(counts, ..);
31
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream mapped = source.map(..);
KTable counts = mapped.groupByKey().aggregate(..);
KStream sink = mapped.leftJoin(counts, ..);
32
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream mapped = source.map(..);
KTable counts = mapped.groupByKey().aggregate(..);
KStream sink = mapped.leftJoin(counts, ..);
Map
State
Agg
agg-repartition
33
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream mapped = source.map(..);
KTable counts = mapped.groupByKey().aggregate(..);
KStream sink = mapped.leftJoin(counts, ..);
Map
State
Join
Agg
map-join-repartition
agg-repartition
34
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream mapped = source.map(..);
KTable counts = mapped.groupByKey().aggregate(..);
KStream sink = mapped.leftJoin(counts, ..);
Map
State
Join
Agg
map-join-repartition
agg-repartition
35
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream shuffled = source.map(..)

.through(“topic2”);
KTable counts = shuffled.groupByKey().count(..);
KStream sink = shuffled.leftJoin(counts, ..);
36
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream shuffled = source.map(..)

.through(“topic2”);
KTable counts = shuffled.groupByKey().count(..);
KStream sink = shuffled.leftJoin(counts, ..);
Map
topic2
37
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream shuffled = source.map(..)

.through(“topic2”);
KTable counts = shuffled.groupByKey().count(..);
KStream sink = shuffled.leftJoin(counts, ..);
Map
State
Agg
topic2
38
Case #1: Duplicate Repartitioning
KStream source = builder.stream(“topic1");
KStream shuffled = source.map(..)

.through(“topic2”);
KTable counts = shuffled.groupByKey().count(..);
KStream sink = shuffled.leftJoin(counts, ..);
Map
State
Join
Agg
topic2
39
Key-Changing Operations
Key-Value Value Only
map mapValues
flatMap flatMapValues
transform transformValues
flatTransform flatTransformValues
40
Case #2: Clumsy Repartitioning
KStream source = builder.stream(“topic1");
KTable aggregated = source.groupBy(..).count(..);
41
Case #2: Clumsy Repartitioning
SelectKey
Agg
agg-repartition
KStream source = builder.stream(“topic1");
KTable aggregated = source.groupBy(..).count(..);
State
42
Case #2: Clumsy Repartitioning
SelectKey
Agg
agg-repartition
KStream source = builder.stream(“topic1");
KTable aggregated = source.groupBy(..).count(..);
State
43
Case #2: Clumsy Repartitioning
KStream source = builder.stream(“topic1");
KStream projected = source.mapValues(..);
KTable aggregated = projected.groupBy(..).count(..);
44
Case #2: Clumsy Repartitioning
KStream source = builder.stream(“topic1");
KStream projected = source.mapValues(..);
KTable aggregated = projected.groupBy(..).count(..);
mapValues
45
Case #2: Clumsy Repartitioning
KStream source = builder.stream(“topic1");
KStream projected = source.mapValues(..);
KTable aggregated = projected.groupBy(..).count(..);
SelectKey
mapValues
Agg
agg-repartition
State
46
Topology Generation(an optimization problem)
Define: write DSL code
Examine: topology.describe
Refine: modify DSL code
Examine: topology.describe
…
47
Topology Generation(an optimization problem)
Define: write DSL code
Examine: topology.describe
Refine: modify DSL code
Examine: topology.describe
…
Root Cause: Step-by-step Translation
48
Do I Really have to learn all this? No!
(beyond Kafka 2.1)
49
Key Idea: Replace step-by-step translation



with logical plan compilation
50
Topology Optimization Framework
User DSL code
Processor Topology
Parsing and Generation
51
Topology Optimization Framework
User DSL code
Logical Plan: Operator Graph
Physical Plan: Processor Topology
Parsing
Compilation
Logical plan optimization
52
Logical Plan Optimization
• Currently rule based
• Repartitioning push-up and consolidation

• Sharing topics for source / changelog

• Logical views for table materialization [2.2+]

• etc..
53
Case #3: Unnecessary Materialization
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
StateState
54
Case #3: Unnecessary Materialization
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
filter
StateState
55
Case #3: Unnecessary Materialization
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
filter
StateState
join
56
Case #3: Unnecessary Materialization
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
filter
State
State
State
join
57
Case #3: Unnecessary Materialization
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
filter
State
State
State
join
58
Case #3 with Optimization Enabled
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
filter
State
View
State
join
59
Case #3 with Optimization Enabled
KTable table1 = builder.table(“topic1");
KTable table2 = builder.table(“topic2");


table1.filter().join(table2, ..);
filter
State
View
State
join
60
Enable Topology Optimization
config: topology.optimization = all 

(default = none)
[KIP-295]
code: StreamBuilder#build(props) 

upgrade: watch out for compatibility
[https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/]
61
What’s next
[KIP-372, KIP-307]
• Extensible optimization framework

• More re-write rules!

• Beyond all-or-nothing config

• Compatibility support for optimized topology
[KAFKA-6034]
Take-aways
• Optimize your topology for better performance
and less footprint
62
System.out.println(topology.describe());
Take-aways
• Optimize your topology for better performance
and less footprint
• It’s OK if you forget the first bullet point: 



Kafka Streams will help doing that for you!
63
System.out.println(topology.describe());
Take-aways
64
THANKS!
Guozhang Wang | guozhang@confluent.io | @guozhangwang
• Optimize your topology for better performance
and less footprint
• It’s OK if you forget the first bullet point: 



Kafka Streams will help doing that for you!
System.out.println(topology.describe());
65
We are Hiring!

More Related Content

What's hot (20)

PDF
Introduction to Apache Calcite
Jordan Halterman
 
PPTX
Managing and Versioning Machine Learning Models in Python
Simon Frid
 
PDF
Inside MongoDB: the Internals of an Open-Source Database
Mike Dirolf
 
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
PPTX
(ZDM) Zero Downtime DB Migration to Oracle Cloud
Ruggero Citton
 
PDF
Kafka 101 and Developer Best Practices
confluent
 
PDF
Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Str...
confluent
 
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PDF
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
confluent
 
PDF
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
PDF
Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi &...
HostedbyConfluent
 
PPTX
MySQL Performance Schema in MySQL 8.0
Mayank Prasad
 
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
PDF
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
PDF
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Introduction to Apache Calcite
Jordan Halterman
 
Managing and Versioning Machine Learning Models in Python
Simon Frid
 
Inside MongoDB: the Internals of an Open-Source Database
Mike Dirolf
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
(ZDM) Zero Downtime DB Migration to Oracle Cloud
Ruggero Citton
 
Kafka 101 and Developer Best Practices
confluent
 
Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Str...
confluent
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
confluent
 
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi &...
HostedbyConfluent
 
MySQL Performance Schema in MySQL 8.0
Mayank Prasad
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 

Similar to Performance Analysis and Optimizations for Kafka Streams Applications (20)

PDF
Richmond kafka streams intro
confluent
 
PDF
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
PDF
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
PDF
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko
 
PDF
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
HostedbyConfluent
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PDF
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
confluent
 
PPTX
Real time data pipline with kafka streams
Yoni Farin
 
PDF
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
confluent
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PPTX
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
PDF
Follow the (Kafka) Streams
confluent
 
PPTX
Stateful streaming and the challenge of state
Yoni Farin
 
PPTX
Kafka streams decoupling with stores
Yoni Farin
 
PDF
Robust Operations of Kafka Streams
confluent
 
PPTX
Apache Kafka Streams Use Case
Apache Kafka TLV
 
PPTX
Apache Kafka Streams
Apache Kafka TLV
 
PDF
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
HostedbyConfluent
 
PDF
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Michael Noll
 
PPTX
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Richmond kafka streams intro
confluent
 
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko
 
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
HostedbyConfluent
 
Introduction to Kafka Streams
Guozhang Wang
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
confluent
 
Real time data pipline with kafka streams
Yoni Farin
 
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
confluent
 
Stream processing using Kafka
Knoldus Inc.
 
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
Follow the (Kafka) Streams
confluent
 
Stateful streaming and the challenge of state
Yoni Farin
 
Kafka streams decoupling with stores
Yoni Farin
 
Robust Operations of Kafka Streams
confluent
 
Apache Kafka Streams Use Case
Apache Kafka TLV
 
Apache Kafka Streams
Apache Kafka TLV
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
HostedbyConfluent
 
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Michael Noll
 
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Ad

More from Guozhang Wang (11)

PDF
Consensus in Apache Kafka: From Theory to Production.pdf
Guozhang Wang
 
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
PDF
Introduction to the Incremental Cooperative Protocol of Kafka
Guozhang Wang
 
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
PPTX
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PPTX
Behavioral Simulations in MapReduce
Guozhang Wang
 
PPTX
Automatic Scaling Iterative Computations
Guozhang Wang
 
Consensus in Apache Kafka: From Theory to Production.pdf
Guozhang Wang
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
Introduction to the Incremental Cooperative Protocol of Kafka
Guozhang Wang
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
Apache Kafka at LinkedIn
Guozhang Wang
 
Behavioral Simulations in MapReduce
Guozhang Wang
 
Automatic Scaling Iterative Computations
Guozhang Wang
 
Ad

Recently uploaded (20)

PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PPTX
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
Distribution reservoir and service storage pptx
dhanashree78
 
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 

Performance Analysis and Optimizations for Kafka Streams Applications

  • 1. Performance Analysis and Optimizations for Kafka Streams Guozhang Wang Kafka Summit London, May. 14, 2019
  • 2. Outline • Streams application execution: from 1000 to 400 feet • Processor topology generation: an optimization problem • Kafka Streams topology optimization framework 2
  • 3. 3 So you want to write a Streams app?
  • 4. 4 Step I: Read the demo streams/examples/src/main/java/o.a.k.streams/examples/wordcount/WordCountDemo.java
  • 5. 5 Step II: Modify the demo code to yours
  • 6. kubectl create -f my-kafka-streams-deployment.yaml Deployment “my-kafka-streams-app” created kubectl scale deployment my-kafka-streams-app --replicas=2 Deployment “my-kafka-streams-app” scaled 6 java -cp /my/app -Dlog4j.configuration=file:/my/app/conf/log4j.properties /

 -Dcom.sun.management.jmxremote.authenticate=false /
 -Dcom.sun.management.jmxremote.ssl=false /
 -Dcom.sun.management.jmxremote.port=17072 /
 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=10M /
 -Xloggc:/my/app/logs/GC.log /
 my.streams.MyStreamsApp /my/app/conf/config.properties The fancy way The not-so-fancy way Step III: Run It! [Kafka Summit SF 2018, Shapira & Sax]
  • 10. 10 Kafka Streams Execution Kafka Streams Kafka StreamThread JVM (from 1000 feet)
  • 11. 11 Kafka Streams Execution Kafka Streams Kafka ProcessorTopology (from 900 feet)
  • 12. Define your Processor Topology in DSL 12 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.groupBy(…).aggregate(…); aggregated.toStream().to(”topic3”);
  • 13. Define your Processor Topology in DSL 13 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.table(”topic2”); .addSource(”Source2”, ”topic2”) topology.addSource(”Source1”, ”topic1”) .addProcessor(”Join”, LeftJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, Aggregate:new, ”Join”) .addStateStore(Stores.persistent(…).build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”) State
  • 14. Examine your Processor Topology 14 State topology.addSource(”Source1”, ”topic1”) .addProcessor(”Join”, LeftJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, Aggregate:new, ”Join”) .addStateStore(Stores.persistent(…).build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”) .addSource(”Source2”, ”topic2”) topology = builder.build(); System.out.println(topology.describe());
  • 15. 15 Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [topic1]) --> KSTREAM-WINDOWED-0000000002 Source: KSTREAM-SOURCE-0000000001 (topics: [topic2]) --> KSTREAM-WINDOWED-0000000003 …… Processor: KSTREAM-KEY-SELECT-0000000007 (stores: []) --> KSTREAM-FILTER-0000000011 <-- KSTREAM-MERGE-0000000006 …… Sink: KSTREAM-SINK-0000000010 (topic: MyAggStore-repartition) <-- KSTREAM-FILTER-0000000011 Sub-topology: 1 Source: KSTREAM-SOURCE-0000000012 (topics: [MyAggStore-repartition]) --> KSTREAM-AGGREGATE-0000000009 Processor: KSTREAM-AGGREGATE-0000000009 (stores: [MyAggStore]) --> KTABLE-TOSTREAM-0000000013 <-- KSTREAM-SOURCE-0000000012 …… Sink: KSTREAM-SINK-0000000014 (topic: topic3) <-- KTABLE-TOSTREAM-0000000013 That looks familiar.. State [Kafka Streams Topology Visualizer. https://zz85.github.io/kafka-streams-viz/]
  • 16. 16 Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [topic1]) --> KSTREAM-WINDOWED-0000000002 Source: KSTREAM-SOURCE-0000000001 (topics: [topic2]) --> KSTREAM-WINDOWED-0000000003 …… Processor: KSTREAM-KEY-SELECT-0000000007 (stores: []) --> KSTREAM-FILTER-0000000011 <-- KSTREAM-MERGE-0000000006 …… Sink: KSTREAM-SINK-0000000010 (topic: MyAggStore-repartition) <-- KSTREAM-FILTER-0000000011 Sub-topology: 1 Source: KSTREAM-SOURCE-0000000012 (topics: [MyAggStore-repartition]) --> KSTREAM-AGGREGATE-0000000009 Processor: KSTREAM-AGGREGATE-0000000009 (stores: [MyAggStore]) --> KTABLE-TOSTREAM-0000000013 <-- KSTREAM-SOURCE-0000000012 …… Sink: KSTREAM-SINK-0000000014 (topic: topic3) <-- KTABLE-TOSTREAM-0000000013 Hmm, that looks bizarre? State [Kafka Streams Topology Visualizer. https://zz85.github.io/kafka-streams-viz/]
  • 17. 17 Data Parallelism 101 (specially tailored for Kafka) Topic 1 Partitions Producers Consumers
  • 18. 18 Data Parallelism 101 (specially tailored for Kafka) groupBy(…).count(…)
  • 19. 19 Data Parallelism 101 (specially tailored for Kafka) groupBy(…).count(…)
  • 20. 20 Data Parallelism 101 (specially tailored for Kafka) groupBy(…).count(…)
  • 21. 21 Data Parallelism 101 (specially tailored for Kafka) groupBy(…).count(…) count 3 3 3 3
  • 22. 22 Shuffling in Streams: RepartitionTopics State
  • 23. 23 Shuffling in Streams: RepartitionTopics State
  • 24. 24 Shuffling in Streams: RepartitionTopics State Repartition Topic (partition key = grouped key) Topologies: Sub-topology: 0 …… Sink: KSTREAM-SINK-0000000010 (topic: MyAggStore-repartition) <-- KSTREAM-FILTER-0000000011 Sub-topology: 1 Source: KSTREAM-SOURCE-0000000012 (topics: [MyAggStore-repartition]) --> KSTREAM-AGGREGATE-0000000009 ……
  • 25. 25Kafka Streams Kafka Kafka Streams Execution (from 500 feet) State
  • 26. Kafka Topic B Task2Task1 Stream Partitions and Tasks 26 Kafka Topic A
  • 27. Kafka Topic B Stream Partitions and Tasks 27 Kafka Topic A Task2Task1
  • 28. 28Kafka Streams Kafka Kafka Streams Execution (from 400 feet) State State instance-1 instance-2 instance-3
  • 29. 29 Repartition Topics for Shuffling State State instance-1 instance-2 instance-3 … …
  • 30. 30 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream mapped = source.map(..); KTable counts = mapped.groupByKey().aggregate(..); KStream sink = mapped.leftJoin(counts, ..);
  • 31. 31 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream mapped = source.map(..); KTable counts = mapped.groupByKey().aggregate(..); KStream sink = mapped.leftJoin(counts, ..);
  • 32. 32 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream mapped = source.map(..); KTable counts = mapped.groupByKey().aggregate(..); KStream sink = mapped.leftJoin(counts, ..); Map State Agg agg-repartition
  • 33. 33 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream mapped = source.map(..); KTable counts = mapped.groupByKey().aggregate(..); KStream sink = mapped.leftJoin(counts, ..); Map State Join Agg map-join-repartition agg-repartition
  • 34. 34 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream mapped = source.map(..); KTable counts = mapped.groupByKey().aggregate(..); KStream sink = mapped.leftJoin(counts, ..); Map State Join Agg map-join-repartition agg-repartition
  • 35. 35 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream shuffled = source.map(..)
 .through(“topic2”); KTable counts = shuffled.groupByKey().count(..); KStream sink = shuffled.leftJoin(counts, ..);
  • 36. 36 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream shuffled = source.map(..)
 .through(“topic2”); KTable counts = shuffled.groupByKey().count(..); KStream sink = shuffled.leftJoin(counts, ..); Map topic2
  • 37. 37 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream shuffled = source.map(..)
 .through(“topic2”); KTable counts = shuffled.groupByKey().count(..); KStream sink = shuffled.leftJoin(counts, ..); Map State Agg topic2
  • 38. 38 Case #1: Duplicate Repartitioning KStream source = builder.stream(“topic1"); KStream shuffled = source.map(..)
 .through(“topic2”); KTable counts = shuffled.groupByKey().count(..); KStream sink = shuffled.leftJoin(counts, ..); Map State Join Agg topic2
  • 39. 39 Key-Changing Operations Key-Value Value Only map mapValues flatMap flatMapValues transform transformValues flatTransform flatTransformValues
  • 40. 40 Case #2: Clumsy Repartitioning KStream source = builder.stream(“topic1"); KTable aggregated = source.groupBy(..).count(..);
  • 41. 41 Case #2: Clumsy Repartitioning SelectKey Agg agg-repartition KStream source = builder.stream(“topic1"); KTable aggregated = source.groupBy(..).count(..); State
  • 42. 42 Case #2: Clumsy Repartitioning SelectKey Agg agg-repartition KStream source = builder.stream(“topic1"); KTable aggregated = source.groupBy(..).count(..); State
  • 43. 43 Case #2: Clumsy Repartitioning KStream source = builder.stream(“topic1"); KStream projected = source.mapValues(..); KTable aggregated = projected.groupBy(..).count(..);
  • 44. 44 Case #2: Clumsy Repartitioning KStream source = builder.stream(“topic1"); KStream projected = source.mapValues(..); KTable aggregated = projected.groupBy(..).count(..); mapValues
  • 45. 45 Case #2: Clumsy Repartitioning KStream source = builder.stream(“topic1"); KStream projected = source.mapValues(..); KTable aggregated = projected.groupBy(..).count(..); SelectKey mapValues Agg agg-repartition State
  • 46. 46 Topology Generation(an optimization problem) Define: write DSL code Examine: topology.describe Refine: modify DSL code Examine: topology.describe …
  • 47. 47 Topology Generation(an optimization problem) Define: write DSL code Examine: topology.describe Refine: modify DSL code Examine: topology.describe … Root Cause: Step-by-step Translation
  • 48. 48 Do I Really have to learn all this? No! (beyond Kafka 2.1)
  • 49. 49 Key Idea: Replace step-by-step translation
 
 with logical plan compilation
  • 50. 50 Topology Optimization Framework User DSL code Processor Topology Parsing and Generation
  • 51. 51 Topology Optimization Framework User DSL code Logical Plan: Operator Graph Physical Plan: Processor Topology Parsing Compilation Logical plan optimization
  • 52. 52 Logical Plan Optimization • Currently rule based • Repartitioning push-up and consolidation
 • Sharing topics for source / changelog
 • Logical views for table materialization [2.2+]
 • etc..
  • 53. 53 Case #3: Unnecessary Materialization KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); StateState
  • 54. 54 Case #3: Unnecessary Materialization KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); filter StateState
  • 55. 55 Case #3: Unnecessary Materialization KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); filter StateState join
  • 56. 56 Case #3: Unnecessary Materialization KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); filter State State State join
  • 57. 57 Case #3: Unnecessary Materialization KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); filter State State State join
  • 58. 58 Case #3 with Optimization Enabled KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); filter State View State join
  • 59. 59 Case #3 with Optimization Enabled KTable table1 = builder.table(“topic1"); KTable table2 = builder.table(“topic2"); 
 table1.filter().join(table2, ..); filter State View State join
  • 60. 60 Enable Topology Optimization config: topology.optimization = all 
 (default = none) [KIP-295] code: StreamBuilder#build(props) 
 upgrade: watch out for compatibility [https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/]
  • 61. 61 What’s next [KIP-372, KIP-307] • Extensible optimization framework
 • More re-write rules!
 • Beyond all-or-nothing config
 • Compatibility support for optimized topology [KAFKA-6034]
  • 62. Take-aways • Optimize your topology for better performance and less footprint 62 System.out.println(topology.describe());
  • 63. Take-aways • Optimize your topology for better performance and less footprint • It’s OK if you forget the first bullet point: 
 
 Kafka Streams will help doing that for you! 63 System.out.println(topology.describe());
  • 64. Take-aways 64 THANKS! Guozhang Wang | [email protected] | @guozhangwang • Optimize your topology for better performance and less footprint • It’s OK if you forget the first bullet point: 
 
 Kafka Streams will help doing that for you! System.out.println(topology.describe());