SlideShare a Scribd company logo
DEEP DIVE INTO / INTERNALSDEEP DIVE INTO / INTERNALS
OFOF
KAFKA STREAMSKAFKA STREAMS
KAFKA STREAMS 2.2.0KAFKA STREAMS 2.2.0
//
//
THE "INTERNALS" BOOKS:THE "INTERNALS" BOOKS:
//
@JACEKLASKOWSKI@JACEKLASKOWSKI
STACKOVERFLOWSTACKOVERFLOW GITHUBGITHUB
KAFKAKAFKA
STREAMSSTREAMS APACHE KAFKAAPACHE KAFKA
MY VERY FIRST #KAFKASUMMIT!MY VERY FIRST #KAFKASUMMIT!
Jacek Laskowski is a freelance IT consultant
Core competencies in Spark, Kafka, Kafka
Streams, Scala
Development | Consulting | Training
Among contributors to
Contact me at jacek@japila.pl
Follow on twitter
for more #ApacheSpark, #ApacheKafka,
#KafkaStreams
Apache Spark
@JacekLaskowski
Jacek is best known by the online "Internals"
books:
1.
2.
3.
4.
5.
The Internals of Apache Spark
The Internals of Spark SQL
The Internals of Spark Structured
Streaming
The Internals of Kafka Streams
The Internals of Apache Kafka
Jacek is "active" on StackOverflow
AGENDAAGENDA
1.
2.
1.
2.
3.
1.
2.
3.
4.
5.
Kafka Streams
Main Development Entities
Topology
KafkaStreams
Main Execution Entities
StreamThread
TaskManager
StreamTask and StandbyTask
StreamsPartitionAssignor
RebalanceListener
KAFKA STREAMSKAFKA STREAMS (1 OF 2)(1 OF 2)
1. Kafka Streams is a client library for stream
processing applications that process records
in Kafka topics
Low-level Processor API
2. Stream processing primitives, e.g. KStream and
KTable
High-level Streams DSL
3. Topology to describe the processing flow
» One record at a time «
4. Wrapper around the Kafka Producer and
Consumer APIs
5. Supports fault­tolerant local state for stateful
operations (e.g. windowed aggregations and
joins)
KAFKA STREAMSKAFKA STREAMS (2 OF 2)(2 OF 2)
1. "Groundbreaking" facts (which changed my life):
1. When started, a topology processes one
record at a time
2. A Kafka Streams application can be run in
multiple instances (e.g. as Docker containers)
3. Increasing stream processing power is about
increasing number of threads or even
instances
MAIN DEVELOPMENT ENTITIESMAIN DEVELOPMENT ENTITIES
1. As a Kafka Streams developer you work with the
two main developer-facing entities:
1. Topology
2. KafkaStreams
TOPOLOGYTOPOLOGY
1. Represents the stream processing logic of a
Kafka Streams application
2. Directed Acyclic Graph of Processors (Stream
Processing Nodes)
3. Logical representation
4. Created directly or indirectly (using Streams
DSL)
5. Topology API
Adding sources, processors, sinks, state stores
6. Can be described (and printed out to stdout)
EXAMPLE: CREATING TOPOLOGYEXAMPLE: CREATING TOPOLOGY
// Creating directly
import org.apache.kafka.streams.Topology
val topology = new Topology
// Created using Streams DSL (StreamsBuilder API)
// Scala API for Kafka Streams
import org.apache.kafka.streams.scala._
import ImplicitConversions._
import Serdes._
val builder = new StreamsBuilder
val topology = builder.build
EXAMPLE: DESCRIBING TOPOLOGYEXAMPLE: DESCRIBING TOPOLOGY
scala> println(topology.describe)
Topologies:
Sub-topology: 0 for global store (will not generate tasks)
Source: demo-source-processor (topics: [demo-topic])
--> demo-processor-supplier
Processor: demo-processor-supplier (stores: [in-memory-key-v
--> none
<-- demo-source-processor
KAFKASTREAMSKAFKASTREAMS
1. Manages execution of a topology of a Kafka
Streams application
Start, close (shut down), state
2. Consumes messages from and produces
processing results to Kafka topics
3. Acceptable to create multiple KafkaStreams
instances per Kafka Streams application
4. For better performance, consider creating
multiple KafkaStreams instances as separate
instances of Kafka Streams application
EXAMPLE: CREATING KAFKASTREAMSEXAMPLE: CREATING KAFKASTREAMS
import org.apache.kafka.streams.KafkaStreams
val topology: Topology = ...
val config: StreamsConfig = ...
val ks = new KafkaStreams(topology, config)
ks.start // <-- starts execution
MAIN EXECUTION ENTITIESMAIN EXECUTION ENTITIES
1. StreamThread
2. TaskManager
3. StreamTask and StandbyTask
4. StreamsPartitionAssignor
5. RebalanceListener
STREAMTHREADSTREAMTHREAD (1 OF 2)(1 OF 2)
1. Stream Processor Thread
Runs the main record processing loop
Thread of execution (java.lang.Thread)
2. StreamThread uses a Kafka consumer (to poll for
records)
Subscribes to source topics
Think of Consumer Group
3. num.stream.threads configuration property
(default: 1)
4. Uses Kafka Consumer "tools" for operation
Registers StreamsPartitionAssignor
Uses RebalanceListener to intercept
changes to the partitions assigned
5. Uses TaskManager to manage processing tasks
(on next slide)
STREAMTHREADSTREAMTHREAD (2 OF 2)(2 OF 2)
TASKMANAGERTASKMANAGER (1 OF 2)(1 OF 2)
1. Task manager of a StreamThread
2. Manages active and standby tasks
Only active StreamTasks process records
3. Creates processor tasks for assigned partitions
RebalanceListener.onPartitionsAssigned
TASKMANAGERTASKMANAGER (2 OF 2)(2 OF 2)
STREAM PROCESSOR TASKSSTREAM PROCESSOR TASKS (1 OF 2)(1 OF 2)
1. Managed by TaskManager to run a topology of
stream processors
2. StreamTask - the default processor task
As many as partitions assigned
Managed as a group as
AssignedStreamsTasks
Only active StreamTasks process records
Use FIFO RecordQueues (per Kafka
TopicPartition)
3. StandbyTask - a backup processor task
"Ghost" tasks for active StreamTasks
Default: 0 standby tasks
Managed as a group as
AssignedStandbyTasks
STREAM PROCESSOR TASKSSTREAM PROCESSOR TASKS (2 OF 2)(2 OF 2)
STREAMSPARTITIONASSIGNORSTREAMSPARTITIONASSIGNOR (1 OF 2)(1 OF 2)
1. Custom PartitionAssignor from the Kafka Consum
Used for dynamic partition assignment and distr
across the members of a consumer group
partition.assignment.strategy /
ConsumerConfig.PARTITION_ASSIGNME
configuration property
2. Group management protocol
group membership
state synchronization
3. Assigns partitions dynamically across the instances
Required application.id /
StreamsConfig.APPLICATION_ID_CONFI
STREAMSPARTITIONASSIGNORSTREAMSPARTITIONASSIGNOR (2 OF 2)(2 OF 2)
REBALANCELISTENERREBALANCELISTENER (1 OF 2)(1 OF 2)
1. Custom ConsumerRebalanceListener from
the Kafka Consumer API
Callback interface for custom actions when
the set of partitions assigned to the consumer
changes
partition re-assignment will be triggered any
time the members of the consumer group
change
2. Intercepts changes to the partitions assigned to a
single StreamThread
REBALANCELISTENERREBALANCELISTENER (2 OF 2)(2 OF 2)
RECAPRECAP
1.
2.
1.
2.
3.
1.
2.
3.
4.
5.
Kafka Streams
Main Development Entities
Topology
KafkaStreams
Main Execution Entities
StreamThread
TaskManager
StreamTask and StandbyTask
StreamsPartitionAssignor
RebalanceListener
QUESTIONS?QUESTIONS?
Read
Read
Follow on twitter (DMs
open)
Upvote
Contact me at jacek@japila.pl
The Internals of Kafka Streams
The Internals of Apache Kafka
@jaceklaskowski
my questions and answers on
StackOverflow
© 2019 / / jacek@japila.plJacek Laskowski @JacekLaskowski

More Related Content

What's hot (20)

PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PDF
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
PDF
PostgreSQL 15 and its Major Features -(Aakash M - Mydbops) - Mydbops Opensour...
Mydbops
 
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Scaling Twitter
Blaine
 
PDF
Understanding and Improving Code Generation
Databricks
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PPSX
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
Log analytics with ELK stack
AWS User Group Bengaluru
 
PDF
Thoughts on kafka capacity planning
JamieAlquiza
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PDF
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
PostgreSQL 15 and its Major Features -(Aakash M - Mydbops) - Mydbops Opensour...
Mydbops
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Scaling Twitter
Blaine
 
Understanding and Improving Code Generation
Databricks
 
Introduction to Apache Kafka
AIMDek Technologies
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
Apache Airflow overview
NikolayGrishchenkov
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Log analytics with ELK stack
AWS User Group Bengaluru
 
Thoughts on kafka capacity planning
JamieAlquiza
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 

Similar to Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (Jacek Laskowski, Consultant) Kafka Summit London 2019 (20)

PDF
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Abdelhamide EL ARIB
 
PDF
2017 meetup-apache-kafka-nov
Florian Hussonnois
 
PDF
What is apache Kafka?
Kenny Gorman
 
PDF
What is Apache Kafka®?
Eventador
 
PDF
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Sparkstreaming
Marilyn Waldman
 
PDF
Spark 101
Mohit Garg
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PPTX
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
PDF
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Lightbend
 
PDF
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
Yu-Jhe Li
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PDF
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
HostedbyConfluent
 
PDF
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
HostedbyConfluent
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Matt Stubbs
 
PDF
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Lightbend
 
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Abdelhamide EL ARIB
 
2017 meetup-apache-kafka-nov
Florian Hussonnois
 
What is apache Kafka?
Kenny Gorman
 
What is Apache Kafka®?
Eventador
 
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Sparkstreaming
Marilyn Waldman
 
Spark 101
Mohit Garg
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Lightbend
 
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
Yu-Jhe Li
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
HostedbyConfluent
 
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
HostedbyConfluent
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Matt Stubbs
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Lightbend
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Digital Circuits, important subject in CS
contactparinay1
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (Jacek Laskowski, Consultant) Kafka Summit London 2019

  • 1. DEEP DIVE INTO / INTERNALSDEEP DIVE INTO / INTERNALS OFOF KAFKA STREAMSKAFKA STREAMS KAFKA STREAMS 2.2.0KAFKA STREAMS 2.2.0 // // THE "INTERNALS" BOOKS:THE "INTERNALS" BOOKS: // @JACEKLASKOWSKI@JACEKLASKOWSKI STACKOVERFLOWSTACKOVERFLOW GITHUBGITHUB KAFKAKAFKA STREAMSSTREAMS APACHE KAFKAAPACHE KAFKA
  • 2. MY VERY FIRST #KAFKASUMMIT!MY VERY FIRST #KAFKASUMMIT!
  • 3. Jacek Laskowski is a freelance IT consultant Core competencies in Spark, Kafka, Kafka Streams, Scala Development | Consulting | Training Among contributors to Contact me at [email protected] Follow on twitter for more #ApacheSpark, #ApacheKafka, #KafkaStreams Apache Spark @JacekLaskowski
  • 4. Jacek is best known by the online "Internals" books: 1. 2. 3. 4. 5. The Internals of Apache Spark The Internals of Spark SQL The Internals of Spark Structured Streaming The Internals of Kafka Streams The Internals of Apache Kafka
  • 5. Jacek is "active" on StackOverflow
  • 7. KAFKA STREAMSKAFKA STREAMS (1 OF 2)(1 OF 2) 1. Kafka Streams is a client library for stream processing applications that process records in Kafka topics Low-level Processor API 2. Stream processing primitives, e.g. KStream and KTable High-level Streams DSL 3. Topology to describe the processing flow » One record at a time « 4. Wrapper around the Kafka Producer and Consumer APIs 5. Supports fault­tolerant local state for stateful operations (e.g. windowed aggregations and joins)
  • 8. KAFKA STREAMSKAFKA STREAMS (2 OF 2)(2 OF 2) 1. "Groundbreaking" facts (which changed my life): 1. When started, a topology processes one record at a time 2. A Kafka Streams application can be run in multiple instances (e.g. as Docker containers) 3. Increasing stream processing power is about increasing number of threads or even instances
  • 9. MAIN DEVELOPMENT ENTITIESMAIN DEVELOPMENT ENTITIES 1. As a Kafka Streams developer you work with the two main developer-facing entities: 1. Topology 2. KafkaStreams
  • 10. TOPOLOGYTOPOLOGY 1. Represents the stream processing logic of a Kafka Streams application 2. Directed Acyclic Graph of Processors (Stream Processing Nodes) 3. Logical representation 4. Created directly or indirectly (using Streams DSL) 5. Topology API Adding sources, processors, sinks, state stores 6. Can be described (and printed out to stdout)
  • 11. EXAMPLE: CREATING TOPOLOGYEXAMPLE: CREATING TOPOLOGY // Creating directly import org.apache.kafka.streams.Topology val topology = new Topology // Created using Streams DSL (StreamsBuilder API) // Scala API for Kafka Streams import org.apache.kafka.streams.scala._ import ImplicitConversions._ import Serdes._ val builder = new StreamsBuilder val topology = builder.build
  • 12. EXAMPLE: DESCRIBING TOPOLOGYEXAMPLE: DESCRIBING TOPOLOGY scala> println(topology.describe) Topologies: Sub-topology: 0 for global store (will not generate tasks) Source: demo-source-processor (topics: [demo-topic]) --> demo-processor-supplier Processor: demo-processor-supplier (stores: [in-memory-key-v --> none <-- demo-source-processor
  • 13. KAFKASTREAMSKAFKASTREAMS 1. Manages execution of a topology of a Kafka Streams application Start, close (shut down), state 2. Consumes messages from and produces processing results to Kafka topics 3. Acceptable to create multiple KafkaStreams instances per Kafka Streams application 4. For better performance, consider creating multiple KafkaStreams instances as separate instances of Kafka Streams application
  • 14. EXAMPLE: CREATING KAFKASTREAMSEXAMPLE: CREATING KAFKASTREAMS import org.apache.kafka.streams.KafkaStreams val topology: Topology = ... val config: StreamsConfig = ... val ks = new KafkaStreams(topology, config) ks.start // <-- starts execution
  • 15. MAIN EXECUTION ENTITIESMAIN EXECUTION ENTITIES 1. StreamThread 2. TaskManager 3. StreamTask and StandbyTask 4. StreamsPartitionAssignor 5. RebalanceListener
  • 16. STREAMTHREADSTREAMTHREAD (1 OF 2)(1 OF 2) 1. Stream Processor Thread Runs the main record processing loop Thread of execution (java.lang.Thread) 2. StreamThread uses a Kafka consumer (to poll for records) Subscribes to source topics Think of Consumer Group 3. num.stream.threads configuration property (default: 1) 4. Uses Kafka Consumer "tools" for operation Registers StreamsPartitionAssignor Uses RebalanceListener to intercept changes to the partitions assigned 5. Uses TaskManager to manage processing tasks (on next slide)
  • 18. TASKMANAGERTASKMANAGER (1 OF 2)(1 OF 2) 1. Task manager of a StreamThread 2. Manages active and standby tasks Only active StreamTasks process records 3. Creates processor tasks for assigned partitions RebalanceListener.onPartitionsAssigned
  • 20. STREAM PROCESSOR TASKSSTREAM PROCESSOR TASKS (1 OF 2)(1 OF 2) 1. Managed by TaskManager to run a topology of stream processors 2. StreamTask - the default processor task As many as partitions assigned Managed as a group as AssignedStreamsTasks Only active StreamTasks process records Use FIFO RecordQueues (per Kafka TopicPartition) 3. StandbyTask - a backup processor task "Ghost" tasks for active StreamTasks Default: 0 standby tasks Managed as a group as AssignedStandbyTasks
  • 21. STREAM PROCESSOR TASKSSTREAM PROCESSOR TASKS (2 OF 2)(2 OF 2)
  • 22. STREAMSPARTITIONASSIGNORSTREAMSPARTITIONASSIGNOR (1 OF 2)(1 OF 2) 1. Custom PartitionAssignor from the Kafka Consum Used for dynamic partition assignment and distr across the members of a consumer group partition.assignment.strategy / ConsumerConfig.PARTITION_ASSIGNME configuration property 2. Group management protocol group membership state synchronization 3. Assigns partitions dynamically across the instances Required application.id / StreamsConfig.APPLICATION_ID_CONFI
  • 24. REBALANCELISTENERREBALANCELISTENER (1 OF 2)(1 OF 2) 1. Custom ConsumerRebalanceListener from the Kafka Consumer API Callback interface for custom actions when the set of partitions assigned to the consumer changes partition re-assignment will be triggered any time the members of the consumer group change 2. Intercepts changes to the partitions assigned to a single StreamThread
  • 27. QUESTIONS?QUESTIONS? Read Read Follow on twitter (DMs open) Upvote Contact me at [email protected] The Internals of Kafka Streams The Internals of Apache Kafka @jaceklaskowski my questions and answers on StackOverflow © 2019 / / [email protected] Laskowski @JacekLaskowski