SlideShare a Scribd company logo
Kafka Streams
The easiest way to start with stream processing
Yaroslav Tkachenko
Kafka Streams: the easiest way to start with stream processing
Stream processing
• Web and mobile analytics (clicks, page views, etc.)
• IoT sensors
• Metrics, logs and telemetry
• ...
Modern streams of data
Stream processing
{
“user_id”: 1234567890,
“action”: “click”,
...
}
Canada: 12314
USA: 32495
...
• Batch processing:
• Slow, expensive, not very flexible, etc.
• Mostly for reporting
• Way to go, historically
• Stream processing:
• Balance between latency and throughput, easy to redeploy
• Near-realtime features
• Last 5-10 years
Data processing styles
Stream processing
• Validation, transformation, enrichment, deduplication, ...
• Aggregations
• Joins
• Windowing
• Integrations and Storage
Stream processing operations (“Streaming ETL”)
Stream processing
• Delivery guarantees
• Latency / Throughput
• Fault tolerance
• Backpressure
• Event-time vs. Processing-time
Stream processing challenges
Stream processing
• One-message-at-a-time OR Micro-batches
• State management (in-memory, on-disk, replicated)
Stream processing techniques
Stream processing
Kafka
Kafka
Kafka
Kafka Streams
Kafka Streams is a client library for building applications and microservices, where the input
and output data are stored in Kafka clusters. It combines the simplicity of writing and
deploying standard Java and Scala applications on the client side with the benefits of Kafka's
server-side cluster technology.
Kafka Streams
Stream processing
• Available since Kafka 0.10 (May 2016)
• Java/Scala support
• Heavily relies on underlying Kafka cluster
• Need to integrate with external persistent systems? Use Kafka Connect
Kafka Cluster
Kafka Streams App
Topic[s] A Topic B
Kafka Streams
KStreamBuilder builder = new KStreamBuilder();
KStream<byte[], String> textLines = builder.stream("TextLinesTopic");
textLines
.mapValues((textLine) -> textLine.toUpperCase())
.to("UppercasedTextLinesTopic");
Kafka Streams
Kafka Streams
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> textLines = builder.stream("streams-plaintext-input");
KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))
.groupBy((key, value) -> value)
.count();
wordCounts.toStream().to("streams-wordcount-output");
Kafka Streams
Syntax
Kafka Streams
Streams DSL
• Declarative
• Functional
• Implicit state store management
• Stateless or stateful
Low-level Processor API
• Imperative
• Explicit state store management
• Usually stateful
Kafka Streams doesn’t
require YARN, Mesos,
Zookeeper, HDFS, etc.
Just a Kafka cluster*
*That you probably already have
Kafka Streams
Kafka Streams
kafka.brokers = "broker1:9092,broker2:9092,broker3:9092,..."
kafka.topics = [
{from: "topic-a", to: "topic-b"},
{from: "topic-c", to: "topic-d"},
...
]
streams {
threads = 8
replication-factor = 3
producer {
acks = all
}
}
consumer.auto.offset.reset = latest
Config example (HOCON)
Every Kafka Streams application must provide SerDes (Serializer/Deserializer) for
the data types of record keys and record values (e.g. java.lang.String or Avro
objects) to materialize the data when necessary.
You can provide SerDes by using either of these methods:
• By setting default SerDes via a StreamsConfig instance.
• By specifying explicit SerDes when calling the appropriate API methods, thus
overriding the defaults.
Serializers/Deserializers
Kafka Streams
Serializers/Deserializers
Kafka Streams
Properties settings = new Properties();
settings.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.Long().getClass().getName());
StreamsConfig config = new StreamsConfig(settings);
Serde<String> stringSerde = Serdes.String();
Serde<Long> longSerde = Serdes.Long();
KStream<String, Long> userCountByRegion = ...;
userCountByRegion.to("RegionCountsTopic", Produced.with(stringSerde, longSerde));
• Used for any stateful operation, implicitly or explicitly
• Backed by local RocksDB databases AND replicated changeset topic in Kafka
• Application’s entire state is spread across the local state stores (following the
same partitioning rules)
• Can be queried with standard API (even remotely)
• Support for key-value, window and custom stores
State stores
Kafka Streams
Kafka Streams
// writing
KStreamBuilder builder = ...;
KStream<String, String> textLines = ...;
textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))
.groupBy((key, word) -> word, Serialized.with(stringSerde, stringSerde))
.count(Materialized.<String, String, KeyValueStore<Bytes, byte[]>as("CountsKeyValueStore"));
KafkaStreams streams = new KafkaStreams(builder, getSettings());
streams.start();
// reading
ReadOnlyKeyValueStore<String, Long> keyValueStore = streams.store("CountsKeyValueStore",
QueryableStoreTypes.keyValueStore());
System.out.println("count for hello:" + keyValueStore.get("hello"));
KeyValueIterator<String, Long> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<String, Long> next = range.next();
System.out.println("count for " + next.key + ": " + next.value);
}
• Aggregate
• Reduce
• Count
All support windows.
Aggregations
Kafka Streams
Windowing
Kafka Streams
Window name Behavior Short description
Tumbling time window Time-based Fixed-size, non-overlapping, gap-less windows
Hopping time window Time-based Fixed-size, overlapping windows
Sliding time window Time-based Fixed-size, overlapping windows that work on differences between record timestamps
Session window Session-based Dynamically-sized, non-overlapping, data-driven windows
Windowing
Kafka Streams
KStream<String, GenericRecord> pageViews = ...;
KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews
.groupByKey()
.windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5)))
.count();
• Event-time: “user”-defined time, generated by the application
that uses Producer API
• Processing-time: Time when the record is being consumed
(pretty much anytime)
• Ingestion-time: generated by the Kafka brokers, embedded in
any message
Timestamp extractors can be used to achieve event-time semantics.
Time
Kafka Streams
Joins
Kafka Streams
Join operands Type (INNER) JOIN LEFT JOIN OUTER JOIN
KStream-to-KStream Windowed Supported Supported Supported
KTable-to-KTable Non-windowed Supported Supported Supported
KStream-to-KTable Non-windowed Supported Supported Not Supported
KStream-to-GlobalKTable Non-windowed Supported Supported Not Supported
KTable-to-GlobalKTable N/A Not Supported Not Supported Not Supported
Joins
Kafka Streams
KStream<String, Long> left = ...;
KTable<String, Double> right = ...;
KStream<String, String> joined = left.join(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue
);
Example
Kafka Streams
builder.stream("play-events", Consumed.with(Serdes.String(), playEventSerde))
// group by key so we can count by session windows
.groupByKey(Serialized.with(Serdes.String(), playEventSerde))
// window by session
.windowedBy(SessionWindows.with(TimeUnit.MINUTES.toMillis(30)))
// count play events per session
.count(Materialized.<String, Long, SessionStore<Bytes, byte[]>>as("PlayEventsPerSession")
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.Long()))
// convert to a stream so we can map the key to a string
.toStream()
// map key to a readable string
.map((key, value) -> new KeyValue<>(key.key() + "@" + key.window().start() + "->" +
key.window().end(), value))
// write to play-events-per-session topic
.to("play-events-per-session", Produced.with(Serdes.String(), Serdes.Long()));
Example
Kafka Streams
jo@1484823406597->1484823406597 = 1
bill@1484823466597->1484823466597 = 1
sarah@1484823526597->1484823526597 = 1
jo@1484825207597->1484825207597 = 1
bill@1484823466597->1484825206597 = 2
sarah@1484827006597->1484827006597 = 1
jo@1484823406597->1484825207597 = 3
bill@1484828806597->1484828806597 = 1
sarah@1484827006597->1484827186597 = 2
...
Summary
• Rich Streams DSL provides very expressive language, low-level
Processor API gives a lot of flexibility
• Seamless Kafka integration including exactly-once semantics
• No external dependencies
• Great fault-tolerance, “Cloud-ready” features, very easy to scale
• Stream/Table duality just makes sense!
• Built-in backpressure using Kafka Consumer API
• Easy to monitor with tons of metrics exposed over JMX
Pros
Kafka Streams
• Still very young framework (make sure to use 0.11+)
• Only support Kafka topics as sources and sinks (add external
systems using Kafka Connect)
• Only support one Kafka cluster
• No true Batch API
• No ML functionality (but can easily integrate any JVM library)
• KSQL is in development preview
• Streams DSL can create A LOT of internal topics
• No Async IO support
• Scalability is limited (up to MAX number of partitions for all
input topics)
Cons
Kafka Streams
No, if you’re happy with your
Spark/Flink environments
Maybe, if you just need to read & write to Kafka
Yes, if you’re comfortable with
JVM and want to start using
stream processing
So, should I start
using Kafka
Streams now?
Kafka Streams
Questions?
@sap1ens
• https://docs.confluent.io/current/streams/index.html
• https://kafka.apache.org/documentation/streams/
• https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
• https://www.confluent.io/blog/enabling-exactly-kafka-streams/
• https://github.com/confluentinc/kafka-streams-examples
• http://mkuthan.github.io/blog/2017/11/02/kafka-streams-dsl-vs-processor-api/
• https://sap1ens.com/blog/2018/01/03/message-enrichment-with-kafka-streams/
Resources

More Related Content

What's hot (20)

PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
PDF
ksqlDB: A Stream-Relational Database System
confluent
 
PDF
Actors or Not: Async Event Architectures
Yaroslav Tkachenko
 
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
PDF
Event stream processing using Kafka streams
Fredrik Vraalsen
 
PDF
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
confluent
 
PDF
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
confluent
 
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
HostedbyConfluent
 
PPTX
Kick your database_to_the_curb_reston_08_27_19
confluent
 
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
PDF
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
PDF
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
confluent
 
PDF
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
PDF
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
PPTX
Kafka Streams: The Stream Processing Engine of Apache Kafka
Eno Thereska
 
PDF
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Andrzej Ludwikowski
 
PDF
KSQL Intro
confluent
 
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
Introduction to Kafka Streams
Guozhang Wang
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
ksqlDB: A Stream-Relational Database System
confluent
 
Actors or Not: Async Event Architectures
Yaroslav Tkachenko
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
Event stream processing using Kafka streams
Fredrik Vraalsen
 
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
confluent
 
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
confluent
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
HostedbyConfluent
 
Kick your database_to_the_curb_reston_08_27_19
confluent
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
confluent
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Eno Thereska
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Andrzej Ludwikowski
 
KSQL Intro
confluent
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 

Similar to Kafka Streams: the easiest way to start with stream processing (20)

PDF
Richmond kafka streams intro
confluent
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PPTX
Kafka Streams for Java enthusiasts
Slim Baltagi
 
PDF
Introducción a Stream Processing utilizando Kafka Streams
confluent
 
PPTX
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
PDF
Kafka Streams - From the Ground Up to the Cloud
VMware Tanzu
 
PDF
Data Streaming in Kafka
SilviuMarcu1
 
PDF
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
PDF
Queryable State for Kafka Streamsを使ってみた
Yoshiyasu SAEKI
 
PPT
Kafka Explainaton
NguyenChiHoangMinh
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
PPTX
Kafka streams distilled
Mikhail Grinfeld
 
PDF
How to Build Streaming Apps with Confluent II
confluent
 
PDF
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
HostedbyConfluent
 
PPTX
Kick Your Database to the Curb
Bill Bejeck
 
PDF
Introducing Kafka's Streams API
confluent
 
PPTX
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Michael Noll
 
Richmond kafka streams intro
confluent
 
Stream processing using Kafka
Knoldus Inc.
 
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Introducción a Stream Processing utilizando Kafka Streams
confluent
 
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
Kafka Streams - From the Ground Up to the Cloud
VMware Tanzu
 
Data Streaming in Kafka
SilviuMarcu1
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Queryable State for Kafka Streamsを使ってみた
Yoshiyasu SAEKI
 
Kafka Explainaton
NguyenChiHoangMinh
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
Kafka streams distilled
Mikhail Grinfeld
 
How to Build Streaming Apps with Confluent II
confluent
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
HostedbyConfluent
 
Kick Your Database to the Curb
Bill Bejeck
 
Introducing Kafka's Streams API
confluent
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Michael Noll
 
Ad

More from Yaroslav Tkachenko (15)

PDF
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Yaroslav Tkachenko
 
PDF
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
PDF
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
 
PDF
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
 
PDF
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
PDF
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Yaroslav Tkachenko
 
PPTX
10 tips for making Bash a sane programming language
Yaroslav Tkachenko
 
PDF
Building Stateful Microservices With Akka
Yaroslav Tkachenko
 
PDF
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
PPTX
Akka Microservices Architecture And Design
Yaroslav Tkachenko
 
PDF
Why Actor-Based Systems Are The Best For Microservices
Yaroslav Tkachenko
 
PPTX
Why actor-based systems are the best for microservices
Yaroslav Tkachenko
 
PPTX
Building Eventing Systems for Microservice Architecture
Yaroslav Tkachenko
 
PPTX
Быстрая и безболезненная разработка клиентской части веб-приложений
Yaroslav Tkachenko
 
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Yaroslav Tkachenko
 
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
 
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
 
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Yaroslav Tkachenko
 
10 tips for making Bash a sane programming language
Yaroslav Tkachenko
 
Building Stateful Microservices With Akka
Yaroslav Tkachenko
 
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
Akka Microservices Architecture And Design
Yaroslav Tkachenko
 
Why Actor-Based Systems Are The Best For Microservices
Yaroslav Tkachenko
 
Why actor-based systems are the best for microservices
Yaroslav Tkachenko
 
Building Eventing Systems for Microservice Architecture
Yaroslav Tkachenko
 
Быстрая и безболезненная разработка клиентской части веб-приложений
Yaroslav Tkachenko
 
Ad

Recently uploaded (20)

PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Letasoft Sound Booster 1.12.0.538 Crack Download+ Product Key [Latest]
HyperPc soft
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Letasoft Sound Booster 1.12.0.538 Crack Download+ Product Key [Latest]
HyperPc soft
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
Executive Business Intelligence Dashboards
vandeslie24
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 

Kafka Streams: the easiest way to start with stream processing

  • 1. Kafka Streams The easiest way to start with stream processing Yaroslav Tkachenko
  • 4. • Web and mobile analytics (clicks, page views, etc.) • IoT sensors • Metrics, logs and telemetry • ... Modern streams of data Stream processing { “user_id”: 1234567890, “action”: “click”, ... } Canada: 12314 USA: 32495 ...
  • 5. • Batch processing: • Slow, expensive, not very flexible, etc. • Mostly for reporting • Way to go, historically • Stream processing: • Balance between latency and throughput, easy to redeploy • Near-realtime features • Last 5-10 years Data processing styles Stream processing
  • 6. • Validation, transformation, enrichment, deduplication, ... • Aggregations • Joins • Windowing • Integrations and Storage Stream processing operations (“Streaming ETL”) Stream processing
  • 7. • Delivery guarantees • Latency / Throughput • Fault tolerance • Backpressure • Event-time vs. Processing-time Stream processing challenges Stream processing
  • 8. • One-message-at-a-time OR Micro-batches • State management (in-memory, on-disk, replicated) Stream processing techniques Stream processing
  • 10. Kafka
  • 11. Kafka
  • 13. Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology. Kafka Streams Stream processing • Available since Kafka 0.10 (May 2016) • Java/Scala support • Heavily relies on underlying Kafka cluster • Need to integrate with external persistent systems? Use Kafka Connect
  • 14. Kafka Cluster Kafka Streams App Topic[s] A Topic B
  • 15. Kafka Streams KStreamBuilder builder = new KStreamBuilder(); KStream<byte[], String> textLines = builder.stream("TextLinesTopic"); textLines .mapValues((textLine) -> textLine.toUpperCase()) .to("UppercasedTextLinesTopic");
  • 17. Kafka Streams KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream("streams-plaintext-input"); KTable<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+"))) .groupBy((key, value) -> value) .count(); wordCounts.toStream().to("streams-wordcount-output");
  • 19. Syntax Kafka Streams Streams DSL • Declarative • Functional • Implicit state store management • Stateless or stateful Low-level Processor API • Imperative • Explicit state store management • Usually stateful
  • 20. Kafka Streams doesn’t require YARN, Mesos, Zookeeper, HDFS, etc. Just a Kafka cluster* *That you probably already have
  • 22. Kafka Streams kafka.brokers = "broker1:9092,broker2:9092,broker3:9092,..." kafka.topics = [ {from: "topic-a", to: "topic-b"}, {from: "topic-c", to: "topic-d"}, ... ] streams { threads = 8 replication-factor = 3 producer { acks = all } } consumer.auto.offset.reset = latest Config example (HOCON)
  • 23. Every Kafka Streams application must provide SerDes (Serializer/Deserializer) for the data types of record keys and record values (e.g. java.lang.String or Avro objects) to materialize the data when necessary. You can provide SerDes by using either of these methods: • By setting default SerDes via a StreamsConfig instance. • By specifying explicit SerDes when calling the appropriate API methods, thus overriding the defaults. Serializers/Deserializers Kafka Streams
  • 24. Serializers/Deserializers Kafka Streams Properties settings = new Properties(); settings.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); settings.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.Long().getClass().getName()); StreamsConfig config = new StreamsConfig(settings); Serde<String> stringSerde = Serdes.String(); Serde<Long> longSerde = Serdes.Long(); KStream<String, Long> userCountByRegion = ...; userCountByRegion.to("RegionCountsTopic", Produced.with(stringSerde, longSerde));
  • 25. • Used for any stateful operation, implicitly or explicitly • Backed by local RocksDB databases AND replicated changeset topic in Kafka • Application’s entire state is spread across the local state stores (following the same partitioning rules) • Can be queried with standard API (even remotely) • Support for key-value, window and custom stores State stores Kafka Streams
  • 26. Kafka Streams // writing KStreamBuilder builder = ...; KStream<String, String> textLines = ...; textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+"))) .groupBy((key, word) -> word, Serialized.with(stringSerde, stringSerde)) .count(Materialized.<String, String, KeyValueStore<Bytes, byte[]>as("CountsKeyValueStore")); KafkaStreams streams = new KafkaStreams(builder, getSettings()); streams.start(); // reading ReadOnlyKeyValueStore<String, Long> keyValueStore = streams.store("CountsKeyValueStore", QueryableStoreTypes.keyValueStore()); System.out.println("count for hello:" + keyValueStore.get("hello")); KeyValueIterator<String, Long> range = keyValueStore.all(); while (range.hasNext()) { KeyValue<String, Long> next = range.next(); System.out.println("count for " + next.key + ": " + next.value); }
  • 27. • Aggregate • Reduce • Count All support windows. Aggregations Kafka Streams
  • 28. Windowing Kafka Streams Window name Behavior Short description Tumbling time window Time-based Fixed-size, non-overlapping, gap-less windows Hopping time window Time-based Fixed-size, overlapping windows Sliding time window Time-based Fixed-size, overlapping windows that work on differences between record timestamps Session window Session-based Dynamically-sized, non-overlapping, data-driven windows
  • 29. Windowing Kafka Streams KStream<String, GenericRecord> pageViews = ...; KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews .groupByKey() .windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5))) .count();
  • 30. • Event-time: “user”-defined time, generated by the application that uses Producer API • Processing-time: Time when the record is being consumed (pretty much anytime) • Ingestion-time: generated by the Kafka brokers, embedded in any message Timestamp extractors can be used to achieve event-time semantics. Time Kafka Streams
  • 31. Joins Kafka Streams Join operands Type (INNER) JOIN LEFT JOIN OUTER JOIN KStream-to-KStream Windowed Supported Supported Supported KTable-to-KTable Non-windowed Supported Supported Supported KStream-to-KTable Non-windowed Supported Supported Not Supported KStream-to-GlobalKTable Non-windowed Supported Supported Not Supported KTable-to-GlobalKTable N/A Not Supported Not Supported Not Supported
  • 32. Joins Kafka Streams KStream<String, Long> left = ...; KTable<String, Double> right = ...; KStream<String, String> joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue );
  • 33. Example Kafka Streams builder.stream("play-events", Consumed.with(Serdes.String(), playEventSerde)) // group by key so we can count by session windows .groupByKey(Serialized.with(Serdes.String(), playEventSerde)) // window by session .windowedBy(SessionWindows.with(TimeUnit.MINUTES.toMillis(30))) // count play events per session .count(Materialized.<String, Long, SessionStore<Bytes, byte[]>>as("PlayEventsPerSession") .withKeySerde(Serdes.String()) .withValueSerde(Serdes.Long())) // convert to a stream so we can map the key to a string .toStream() // map key to a readable string .map((key, value) -> new KeyValue<>(key.key() + "@" + key.window().start() + "->" + key.window().end(), value)) // write to play-events-per-session topic .to("play-events-per-session", Produced.with(Serdes.String(), Serdes.Long()));
  • 34. Example Kafka Streams jo@1484823406597->1484823406597 = 1 bill@1484823466597->1484823466597 = 1 sarah@1484823526597->1484823526597 = 1 jo@1484825207597->1484825207597 = 1 bill@1484823466597->1484825206597 = 2 sarah@1484827006597->1484827006597 = 1 jo@1484823406597->1484825207597 = 3 bill@1484828806597->1484828806597 = 1 sarah@1484827006597->1484827186597 = 2 ...
  • 36. • Rich Streams DSL provides very expressive language, low-level Processor API gives a lot of flexibility • Seamless Kafka integration including exactly-once semantics • No external dependencies • Great fault-tolerance, “Cloud-ready” features, very easy to scale • Stream/Table duality just makes sense! • Built-in backpressure using Kafka Consumer API • Easy to monitor with tons of metrics exposed over JMX Pros Kafka Streams
  • 37. • Still very young framework (make sure to use 0.11+) • Only support Kafka topics as sources and sinks (add external systems using Kafka Connect) • Only support one Kafka cluster • No true Batch API • No ML functionality (but can easily integrate any JVM library) • KSQL is in development preview • Streams DSL can create A LOT of internal topics • No Async IO support • Scalability is limited (up to MAX number of partitions for all input topics) Cons Kafka Streams
  • 38. No, if you’re happy with your Spark/Flink environments Maybe, if you just need to read & write to Kafka Yes, if you’re comfortable with JVM and want to start using stream processing So, should I start using Kafka Streams now? Kafka Streams
  • 40. • https://docs.confluent.io/current/streams/index.html • https://kafka.apache.org/documentation/streams/ • https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/ • https://www.confluent.io/blog/enabling-exactly-kafka-streams/ • https://github.com/confluentinc/kafka-streams-examples • http://mkuthan.github.io/blog/2017/11/02/kafka-streams-dsl-vs-processor-api/ • https://sap1ens.com/blog/2018/01/03/message-enrichment-with-kafka-streams/ Resources