SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Spark (Structured) Streaming vs.
Kafka Streams
Two stream processing platforms compared
Guido Schmutz
3.12.2018
@gschmutz guidoschmutz.wordpress.com
Agenda
1. Introducing Stream Processing
2. Spark Streaming vs. Kafka Streams – Overview
3. Spark Structured Streaming vs. Kafka Streams – in Action
4. Summary
Guido Schmutz
Working at Trivadis for more than 22 years
Oracle Groundbreaker Ambassador & Oracle ACE Director
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
135th edition
Introducing Stream Processing
“Data at Rest” vs. “Data in Motion”
Data at Rest Data in Motion
Store
Act
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110
Architekturen von Big Data Anwendungen
Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Modern Data Analytics
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
SQL
Export
Microservice State
{ }
API
Event
Stream
Event
Stream
Search
Service
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Stream
Processor
State
{ }
API
Stream Analytics
Results
DB
Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• Primarily cover streaming ETL
• Integration of data source and data sinks
• Filter and transform data
• (Enrich data)
• Route data
Stream Analytics
• calculating aggregates & detecting patterns
to generate higher-level, more relevant
summary information (complex events =>
used to be CEP)
• Complex events may signify threats or
opportunities that require a response
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
Example Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous_driv
ing_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
Spark Streaming vs. Kafka Streams
- Overview
Spark (Structured) Streaming
Spark Streaming
• 1st generation
• one of the first APIs to enable stream
processing using high-level functional
operators like map and reduce
• Like RDD API the DStreams API is based
on
relatively low-level operations on
Java/Python objects
• Used by many organizations in production
Spark Structured Streaming
• 2nd generation
• Structured API through DataFrames /
Datasets rather than RDDs
• Easier code reuse between batch and
streaming
• marked production ready in Spark 2.2.0
• Support for Java, Scala, Python, R and SQL
• Focus of this talk
Apache Spark Streaming as part of Spark Stack
Spark (Structured) Streaming
Resilient Distributed Dataset (RDD)
Spark
Standalone
MESOS /
Kubernetes
YARN HDFS S3
RDBMS &
NoSQL
Kafka
Libraries
Low Level API
Cluster Resource Managers Data Sources / Data Sinks
Advanced Analytics Libraries & Ecosystem
Data Frame
Structured API
Datasets SQL
Distributed Variables
Kafka Streams – part of Kafka Core
• Designed as a simple and lightweight
library in Apache Kafka
• no external dependencies on systems
other than Apache Kafka
• Part of open source Apache Kafka,
introduced in 0.10+
• Leverages Kafka as its internal
messaging layer
• Support for Java and SQL (KSQL)
Spark Structured Streaming vs.
Kafka Streams – in Action
Infrastructure
• Runs as part of a full Spark stack
• Cluster can be either Spark
Standalone, YARN-based or
container-based
• Many cloud options
• Just a Java library
• Runs anyware Java runs: Web
Container, Java Application, Container-
based …
Main Abstractions
Dataset/Data Frame API
• DataFrames and Datasets can represent
static, bounded data, as well as streaming,
unbounded data
• Use readStream() instead of read()
Transformation & Actions
• Almost all transformations from Spark
bounded data processing (Batch) are also
usable for streaming
Input Sources and Sinks
Triggers
• triggers define when data is output
• As soon as last group is finished
• Fixed interval between micro-batches
• One-time micro-batch
Output Mode
• Define how data is output
• Append – only add new records to
output
• Update – update changed records in
place
• Complete – rewrite full output
Main Abstractions
Topologyval schema = new StructType()
.add(...)
val inputDf = spark
.readStream
.format(...)
.option(...)
.load()
val filteredDf = inputDf.where(...)
val query = filteredDf
.writeStream
.format(...)
.option(...)
.start()
I
F
O
Main Abstractions
Stream Processing Application
• program that uses Kafka Streams library
Topology
• logic that needs to be performed by stream
processing
• functional DSL or low-level Processor API
Stream Processor
• a node in the processor topology
KStream
• Abstraction of a stream of records
• Interpreted as events
KTable
• Abstraction of a change log stream
• Interpreted as update of same record (by
key)
GlobalKTable
• Like KTable, but not partitioned => all data
is available on all parallel application
instances
Main Abstractions
Topologypublic static void main(String[] args) {
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(...);
final StreamsBuilder builder = new StreamsBuilder();
KStream<..,..> stream = builder.stream(...);
KStream<..,..> filtered = stream.filter(…)
filtered.to(...)
KafkaStreams streams = new KafkaStreams(
builder.build(),streamsConfiguration);
streams.start();
}
I
F
O
Streaming Data Sources
• File Source
• Reads files as a stream of data
• Supports text, csv, json, orc parquet
• Files must be atomically placed
• Kafka Source
• Reads from Kafka Topic
• Supports Kafka broker > 0.10.x
• Socket Source (for testing)
• Reads UTF8 text from socket
connection
• Rate Source (for testing)
• Generate data at specified number of
rows per second
val rawDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("subscribe", "truck_position")
.load()
Streaming Data Sources
"Kafka only"
KStream from Topic
KTable from Topic
Use Kafka Connect s reading
other data sources into Kafka
first
KStream<String, TruckPosition> positions =
builder.stream("truck_position"
, Consumed.with(Serdes.String()
, truckPositionSerde));
KTable<String, Driver> driver =
builder.table("trucking_driver"
, Consumed.with(Serdes.String()
, driverSerde)
, Materialized.as("driver-store"));
Streaming Sinks
• File Sink – stores output to a directory
• Kafka Sink – publishes to Kafka
• Foreach Sink - Runs arbitrary computation on the records in the output
• Console Sink – for debugging, prints output to console
• Memory Sink – for debugging, stores output in-memory table
val query = jsonTruckPlusDriverDf
.selectExpr("to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("topic","dangerous_driving ")
.option("checkpointLocation", "/tmp")
.start()
Streaming Sinks
"Kafka only"
For testing only:
Use Kafka Connect for
writing out to other targets
KStream<String, TruckPosition> posDriver = ..
posDriver.to("dangerous_driving"
,Produced.with(Serdes.String()
, truckPositionDriverSerde));
KStream<String, TruckPosition> posDriver = ..
// print to system output
posDriver.print(Printed.toSysOut())
// shortcut for
posDriver.foreach((key,value) ->
System.out.println(key + "=" + value))
Processing Model: Event-at-a-time vs. Micro Batch
Introduction to Stream Processing
Micro-Batch Processing
• Splits incoming stream in
small batches
• Higher latency
• Fault tolerance easier
Event-at-a-time Processing
• Events processed as they arrive
• low-latency
• fault tolerance expensive
Stateless Operations – Selection & Projection
Most common operations on
DataFrame/Dataset are supported for
streaming as well
select, filter, map, flatMap, …
KStream and KTable interfaces support
variety of transformation operations
filter, filterNot, map, mapValues,
flatMap, flatMapValues, branch,
selectKey, groupByKey …
val filteredDf =
truckPosDf.where(
"eventType !='Normal'")
KStream<> filtered =
positions.filter((key,value) ->
!value.eventType.equals("Normal")
)
Stateful Operations – Aggregations
Held in distributed memory with option to
spill to disk (fault tolerant through
checkpointing to Hadoop-like FS)
Output modes: Complete, Append,
Update
count, sum, mapGroupsWithState,
flatMapGroupsWithState, reduce ...
Require state store which can be in-
memory, RocksDB or custom impl (fault
tolerant through Kafka topics)
Result of Aggregation is a KTable
count, sum, avg, reduce, aggregate
...
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy()
.count()
KTable<..> c = stream
.groupByKey(..)
.count(...);
Stateful Operations – Time Abstraction
Clock
Event Time
Processing Time
Ingestion Time
1 2 3 4 5
adapted from Matthias Niehoff (Codecentric)
Stateful Operations – Time Abstraction
Event Time
• New with Spark Structured Streaming
• Extracted from the message (payload)
Ingestion Time
• for sources which capture ingestion time
Processing Time
• “Old” Spark Streaming only supported
processing time
• generate the timestamp upon processing
Event Time
• Point in time when event occurred
• Extracted from the message (payload or
header)
Ingestion Time
• Point in time when event is stored in Kafka
(sent in message header)
Processing Time
• Point in time when event happens to be
processed by stream processing
applicationdf.withColumn("processingTime"
,current_timestamp())
.option("includeTimestamp", true)
Stateful Operations - Windowing
streams are unbounded
need some meaningful time frames to do
computations (i.e. aggregations)
Computations over events done using
windows of data
Windows are tracked per unique key
Fixed Window Sliding Window Session Window
Time
Stream of Data Window of Data
Stateful Operations - Windowing
Support for Tumbling & Hopping
(Sliding) Time Windows
Handling Late Data with
Watermarking
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy(window($"eventTime"
, "1 minutes"
, "30 seconds")
, $"word")
.count()
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
watermark
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time
Stateful Operations - Windowing
Support for Tumbling & Hopping Windows
Support for Session Windows
Handling Late Data with Data
Retention (optional)
KTable<..> c = stream
.groupByKey(...)
.windowedBy(
SessionWindows
.with(5 * 60 * 1000)
).count();
KTable<..> c = stream
.groupByKey(..)
.windowedBy(
TimeWindows.of(60 * 1000)
.advanceBy(30 * 1000)
.until(10 * 60 * 1000)
).count(...);
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
Data Retention
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time
Stateful Operations - Joins
Introduction to Stream Processing
Challenges of joining streams
1. Data streams need to be aligned as they
come because they have different timestamps
2. since streams are never-ending, the joins
must be limited; otherwise join will never end
3. join needs to produce results continuously as
there is no end to the data
Stream to Static (Table) Join
Stream to Stream Join (one window join)
Stream to Stream Join (two window join)
Stream-to-
Static Join
Stream-to-
Stream
Join
Stream-to-
Stream
Join
Time
Time
Time
Stateful Operations - Joins
Stream-to-Static and Stream-to-Stream
(since 2.3) Joins on Dataset/DataFrame
Watermarking helps Spark to know for
how long to retain data
• Optional for Inner Joins
• Mandatory for Outer Joins
val jsonTruckPlusDriverDf =
jsonFilteredDf.join(driverDf
, Seq("driverId")
, "left")
Source: Spark Documentation
Supports following joins
• KStream-to-KStream
• KTable-to-KTable
• KStream-to-KTable
• KStream-to-GlobalKTable
• KTable-to-GlobalKTable
Stateful Operations - Joins
KStream<String, TruckPositionDriver> joined =
filteredRekeyed.leftJoin(driver
, (left,right) -> new TruckPositionDriver(left
, StringUtils.defaultIfEmpty(right.first_name,"")
, StringUtils.defaultIfEmpty(right.last_name,""))
, Joined.with(Serdes.String()
, truckPositionSerde
, driverSerde));
Source: Confluent Documentation
There is more ….
• Streaming Deduplication
• Run-Once Trigger / fixed Interval
Micro-Batching
• Continuous Trigger with fixed
checkpoint interval (experimental in
2.3)
• Streaming Machine Learning
• REPL
• Queryable State
• Processor API
• Exactly Once Processing
• Microservices with Kafka Streams
• Automatic Scale-up / Scale-Down
• Stand-by replica of local state
• Streaming SQL
There is more … Streaming SQL with KSQL
• Enables stream processing with
zero coding required
• The simplest way to process
(structured) streams of data in real-
time
• Powered by Kafka Streams
• KSQL server with REST API
• Spark SQL also offers SQL on
streaming data, but not as a “first-
class citizen”
ksql> CREATE STREAM truck_position_s 
(timestamp BIGINT, 
truckId BIGINT, 
driverId BIGINT, 
routeId BIGINT, 
eventType VARCHAR, 
latitude DOUBLE, 
longitude DOUBLE) 
WITH (kafka_topic='truck_position', 
value_format='JSON');
ksql> SELECT * FROM truck_position_s;
1506922133306 | "truck/13/position0 | 2017-10-
02T07:28:53 | 31 | 13 | 371182829 | Memphis to
Little Rock | Normal | 41.76 | -89.6 | -
2084263951914664106
ksql> SELECT * FROM truck_position_s
WHERE eventType != 'Normal';
Summary
Spark Structured Streaming vs. Kafka Streams
• Runs on top of a Spark cluster
• Reuse your investments into Spark
(knowledge and maybe code)
• A HDFS like file system needs to be
available
• Higher latency due to micro-batching
• Multi-Language support: Java, Python,
Scala, R
• Supports ad-hoc, notebook-style
development/environment
• Available as a Java library
• Can be the implementation choice of a
microservice
• Can only work with Kafka for both input and
output
• low latency due to continuous processing
• Currently only supports Java, Scala support
available soon
• KSQL abstraction provides SQL on top of
Kafka Streams
Comparison
Kafka Streams Spark Streaming Spark Structured Streaming
Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL
Processing Model Continuous Streaming Micro-Batching Micro-Batching
Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset
Programming Model Declarative/Imperative Declarative Declarative
Time Support Event / Ingestion / Processing Processing Event / Ingestion/ Processing
State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk
Time Window Support Fixed, Sliding, Session Fixed, Sliding Fixed, Sliding
Join Stream-Static, Stream-Stream Stream-Static Stream-Static, Stream-Stream (2.3)
Event Pattern detection No No No
Query Language Support KSQL No Spark SQL (limited)
Queryable State Interactive Queries No No
Scalability & Reliability Yes Yes Yes
Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial)
Latency Sub-second seconds seconds
Deployment Java Library Cluster (with HDFS like FS) Cluster (with HDFS like FS)
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

What's hot (20)

PPTX
Spark
Koushik Mondal
 
PDF
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
PDF
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PPTX
Introduction to Storm
Chandler Huang
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
Spark shuffle introduction
colorant
 
PDF
Introduction to MongoDB
Mike Dirolf
 
PDF
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
Apache kafka
NexThoughts Technologies
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Introduction to Apache Calcite
Jordan Halterman
 
Introduction to Storm
Chandler Huang
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Spark shuffle introduction
colorant
 
Introduction to MongoDB
Mike Dirolf
 
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Kafka replication apachecon_2013
Jun Rao
 
Productizing Structured Streaming Jobs
Databricks
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Dive into PySpark
Mateusz Buśkiewicz
 
Elastic Stack Introduction
Vikram Shinde
 

Similar to Spark (Structured) Streaming vs. Kafka Streams (20)

PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PPTX
Apache Spark Components
Girish Khanzode
 
PPTX
Streaming options in the wild
Atif Akhtar
 
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
PPTX
Meetup spark structured streaming
José Carlos García Serrano
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PPTX
Spark Structured Streaming
Revin Chalil
 
PDF
Streaming Solutions for Real time problems
Abhishek Gupta
 
PPTX
Apache Kafka Streams
Apache Kafka TLV
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Apache Spark Components
Girish Khanzode
 
Streaming options in the wild
Atif Akhtar
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Meetup spark structured streaming
José Carlos García Serrano
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark streaming state of the union
Databricks
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Introduction to Spark Streaming
datamantra
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Spark Structured Streaming
Revin Chalil
 
Streaming Solutions for Real time problems
Abhishek Gupta
 
Apache Kafka Streams
Apache Kafka TLV
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Ad

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
Guido Schmutz
 
PDF
Event Broker (Kafka) in a Modern Data Architecture
Guido Schmutz
 
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
PDF
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
PDF
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Guido Schmutz
 
PDF
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Guido Schmutz
 
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
Guido Schmutz
 
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PDF
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
PDF
Streaming Visualisation
Guido Schmutz
 
PDF
Kafka as an event store - is it good enough?
Guido Schmutz
 
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
PDF
Fundamentals Big Data and AI Architecture
Guido Schmutz
 
PDF
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
PDF
Streaming Visualization
Guido Schmutz
 
30 Minutes to the Analytics Platform with Infrastructure as Code
Guido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Guido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Guido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Guido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualisation
Guido Schmutz
 
Kafka as an event store - is it good enough?
Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
Fundamentals Big Data and AI Architecture
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualization
Guido Schmutz
 
Ad

Recently uploaded (20)

PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
What Is Data Integration and Transformation?
subhashenia
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Research Methodology Overview Introduction
ayeshagul29594
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 

Spark (Structured) Streaming vs. Kafka Streams

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Spark (Structured) Streaming vs. Kafka Streams Two stream processing platforms compared Guido Schmutz 3.12.2018 @gschmutz guidoschmutz.wordpress.com
  • 2. Agenda 1. Introducing Stream Processing 2. Spark Streaming vs. Kafka Streams – Overview 3. Spark Structured Streaming vs. Kafka Streams – in Action 4. Summary
  • 3. Guido Schmutz Working at Trivadis for more than 22 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 135th edition
  • 5. “Data at Rest” vs. “Data in Motion” Data at Rest Data in Motion Store Act Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 Architekturen von Big Data Anwendungen
  • 6. Hadoop Clusterd Hadoop Cluster Big Data Reference Architecture for Modern Data Analytics Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined SQL Export Microservice State { } API Event Stream Event Stream Search Service Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File IoT Data Mobile Apps Social Event Stream Telemetry Stream Processor State { } API Stream Analytics Results DB
  • 7. Two Types of Stream Processing (from Gartner) Stream Data Integration • Primarily cover streaming ETL • Integration of data source and data sinks • Filter and transform data • (Enrich data) • Route data Stream Analytics • calculating aggregates & detecting patterns to generate higher-level, more relevant summary information (complex events => used to be CEP) • Complex events may signify threats or opportunities that require a response
  • 8. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  • 9. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  • 11. Spark Streaming vs. Kafka Streams - Overview
  • 12. Spark (Structured) Streaming Spark Streaming • 1st generation • one of the first APIs to enable stream processing using high-level functional operators like map and reduce • Like RDD API the DStreams API is based on relatively low-level operations on Java/Python objects • Used by many organizations in production Spark Structured Streaming • 2nd generation • Structured API through DataFrames / Datasets rather than RDDs • Easier code reuse between batch and streaming • marked production ready in Spark 2.2.0 • Support for Java, Scala, Python, R and SQL • Focus of this talk
  • 13. Apache Spark Streaming as part of Spark Stack Spark (Structured) Streaming Resilient Distributed Dataset (RDD) Spark Standalone MESOS / Kubernetes YARN HDFS S3 RDBMS & NoSQL Kafka Libraries Low Level API Cluster Resource Managers Data Sources / Data Sinks Advanced Analytics Libraries & Ecosystem Data Frame Structured API Datasets SQL Distributed Variables
  • 14. Kafka Streams – part of Kafka Core • Designed as a simple and lightweight library in Apache Kafka • no external dependencies on systems other than Apache Kafka • Part of open source Apache Kafka, introduced in 0.10+ • Leverages Kafka as its internal messaging layer • Support for Java and SQL (KSQL)
  • 15. Spark Structured Streaming vs. Kafka Streams – in Action
  • 16. Infrastructure • Runs as part of a full Spark stack • Cluster can be either Spark Standalone, YARN-based or container-based • Many cloud options • Just a Java library • Runs anyware Java runs: Web Container, Java Application, Container- based …
  • 17. Main Abstractions Dataset/Data Frame API • DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data • Use readStream() instead of read() Transformation & Actions • Almost all transformations from Spark bounded data processing (Batch) are also usable for streaming Input Sources and Sinks Triggers • triggers define when data is output • As soon as last group is finished • Fixed interval between micro-batches • One-time micro-batch Output Mode • Define how data is output • Append – only add new records to output • Update – update changed records in place • Complete – rewrite full output
  • 18. Main Abstractions Topologyval schema = new StructType() .add(...) val inputDf = spark .readStream .format(...) .option(...) .load() val filteredDf = inputDf.where(...) val query = filteredDf .writeStream .format(...) .option(...) .start() I F O
  • 19. Main Abstractions Stream Processing Application • program that uses Kafka Streams library Topology • logic that needs to be performed by stream processing • functional DSL or low-level Processor API Stream Processor • a node in the processor topology KStream • Abstraction of a stream of records • Interpreted as events KTable • Abstraction of a change log stream • Interpreted as update of same record (by key) GlobalKTable • Like KTable, but not partitioned => all data is available on all parallel application instances
  • 20. Main Abstractions Topologypublic static void main(String[] args) { Properties streamsConfiguration = new Properties(); streamsConfiguration.put(...); final StreamsBuilder builder = new StreamsBuilder(); KStream<..,..> stream = builder.stream(...); KStream<..,..> filtered = stream.filter(…) filtered.to(...) KafkaStreams streams = new KafkaStreams( builder.build(),streamsConfiguration); streams.start(); } I F O
  • 21. Streaming Data Sources • File Source • Reads files as a stream of data • Supports text, csv, json, orc parquet • Files must be atomically placed • Kafka Source • Reads from Kafka Topic • Supports Kafka broker > 0.10.x • Socket Source (for testing) • Reads UTF8 text from socket connection • Rate Source (for testing) • Generate data at specified number of rows per second val rawDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", "truck_position") .load()
  • 22. Streaming Data Sources "Kafka only" KStream from Topic KTable from Topic Use Kafka Connect s reading other data sources into Kafka first KStream<String, TruckPosition> positions = builder.stream("truck_position" , Consumed.with(Serdes.String() , truckPositionSerde)); KTable<String, Driver> driver = builder.table("trucking_driver" , Consumed.with(Serdes.String() , driverSerde) , Materialized.as("driver-store"));
  • 23. Streaming Sinks • File Sink – stores output to a directory • Kafka Sink – publishes to Kafka • Foreach Sink - Runs arbitrary computation on the records in the output • Console Sink – for debugging, prints output to console • Memory Sink – for debugging, stores output in-memory table val query = jsonTruckPlusDriverDf .selectExpr("to_json(struct(*)) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("topic","dangerous_driving ") .option("checkpointLocation", "/tmp") .start()
  • 24. Streaming Sinks "Kafka only" For testing only: Use Kafka Connect for writing out to other targets KStream<String, TruckPosition> posDriver = .. posDriver.to("dangerous_driving" ,Produced.with(Serdes.String() , truckPositionDriverSerde)); KStream<String, TruckPosition> posDriver = .. // print to system output posDriver.print(Printed.toSysOut()) // shortcut for posDriver.foreach((key,value) -> System.out.println(key + "=" + value))
  • 25. Processing Model: Event-at-a-time vs. Micro Batch Introduction to Stream Processing Micro-Batch Processing • Splits incoming stream in small batches • Higher latency • Fault tolerance easier Event-at-a-time Processing • Events processed as they arrive • low-latency • fault tolerance expensive
  • 26. Stateless Operations – Selection & Projection Most common operations on DataFrame/Dataset are supported for streaming as well select, filter, map, flatMap, … KStream and KTable interfaces support variety of transformation operations filter, filterNot, map, mapValues, flatMap, flatMapValues, branch, selectKey, groupByKey … val filteredDf = truckPosDf.where( "eventType !='Normal'") KStream<> filtered = positions.filter((key,value) -> !value.eventType.equals("Normal") )
  • 27. Stateful Operations – Aggregations Held in distributed memory with option to spill to disk (fault tolerant through checkpointing to Hadoop-like FS) Output modes: Complete, Append, Update count, sum, mapGroupsWithState, flatMapGroupsWithState, reduce ... Require state store which can be in- memory, RocksDB or custom impl (fault tolerant through Kafka topics) Result of Aggregation is a KTable count, sum, avg, reduce, aggregate ... val c = source .withWatermark("timestamp" , "10 minutes") .groupBy() .count() KTable<..> c = stream .groupByKey(..) .count(...);
  • 28. Stateful Operations – Time Abstraction Clock Event Time Processing Time Ingestion Time 1 2 3 4 5 adapted from Matthias Niehoff (Codecentric)
  • 29. Stateful Operations – Time Abstraction Event Time • New with Spark Structured Streaming • Extracted from the message (payload) Ingestion Time • for sources which capture ingestion time Processing Time • “Old” Spark Streaming only supported processing time • generate the timestamp upon processing Event Time • Point in time when event occurred • Extracted from the message (payload or header) Ingestion Time • Point in time when event is stored in Kafka (sent in message header) Processing Time • Point in time when event happens to be processed by stream processing applicationdf.withColumn("processingTime" ,current_timestamp()) .option("includeTimestamp", true)
  • 30. Stateful Operations - Windowing streams are unbounded need some meaningful time frames to do computations (i.e. aggregations) Computations over events done using windows of data Windows are tracked per unique key Fixed Window Sliding Window Session Window Time Stream of Data Window of Data
  • 31. Stateful Operations - Windowing Support for Tumbling & Hopping (Sliding) Time Windows Handling Late Data with Watermarking val c = source .withWatermark("timestamp" , "10 minutes") .groupBy(window($"eventTime" , "1 minutes" , "30 seconds") , $"word") .count() Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time watermark 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  • 32. Stateful Operations - Windowing Support for Tumbling & Hopping Windows Support for Session Windows Handling Late Data with Data Retention (optional) KTable<..> c = stream .groupByKey(...) .windowedBy( SessionWindows .with(5 * 60 * 1000) ).count(); KTable<..> c = stream .groupByKey(..) .windowedBy( TimeWindows.of(60 * 1000) .advanceBy(30 * 1000) .until(10 * 60 * 1000) ).count(...); Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time Data Retention 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  • 33. Stateful Operations - Joins Introduction to Stream Processing Challenges of joining streams 1. Data streams need to be aligned as they come because they have different timestamps 2. since streams are never-ending, the joins must be limited; otherwise join will never end 3. join needs to produce results continuously as there is no end to the data Stream to Static (Table) Join Stream to Stream Join (one window join) Stream to Stream Join (two window join) Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join Time Time Time
  • 34. Stateful Operations - Joins Stream-to-Static and Stream-to-Stream (since 2.3) Joins on Dataset/DataFrame Watermarking helps Spark to know for how long to retain data • Optional for Inner Joins • Mandatory for Outer Joins val jsonTruckPlusDriverDf = jsonFilteredDf.join(driverDf , Seq("driverId") , "left") Source: Spark Documentation
  • 35. Supports following joins • KStream-to-KStream • KTable-to-KTable • KStream-to-KTable • KStream-to-GlobalKTable • KTable-to-GlobalKTable Stateful Operations - Joins KStream<String, TruckPositionDriver> joined = filteredRekeyed.leftJoin(driver , (left,right) -> new TruckPositionDriver(left , StringUtils.defaultIfEmpty(right.first_name,"") , StringUtils.defaultIfEmpty(right.last_name,"")) , Joined.with(Serdes.String() , truckPositionSerde , driverSerde)); Source: Confluent Documentation
  • 36. There is more …. • Streaming Deduplication • Run-Once Trigger / fixed Interval Micro-Batching • Continuous Trigger with fixed checkpoint interval (experimental in 2.3) • Streaming Machine Learning • REPL • Queryable State • Processor API • Exactly Once Processing • Microservices with Kafka Streams • Automatic Scale-up / Scale-Down • Stand-by replica of local state • Streaming SQL
  • 37. There is more … Streaming SQL with KSQL • Enables stream processing with zero coding required • The simplest way to process (structured) streams of data in real- time • Powered by Kafka Streams • KSQL server with REST API • Spark SQL also offers SQL on streaming data, but not as a “first- class citizen” ksql> CREATE STREAM truck_position_s (timestamp BIGINT, truckId BIGINT, driverId BIGINT, routeId BIGINT, eventType VARCHAR, latitude DOUBLE, longitude DOUBLE) WITH (kafka_topic='truck_position', value_format='JSON'); ksql> SELECT * FROM truck_position_s; 1506922133306 | "truck/13/position0 | 2017-10- 02T07:28:53 | 31 | 13 | 371182829 | Memphis to Little Rock | Normal | 41.76 | -89.6 | - 2084263951914664106 ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
  • 39. Spark Structured Streaming vs. Kafka Streams • Runs on top of a Spark cluster • Reuse your investments into Spark (knowledge and maybe code) • A HDFS like file system needs to be available • Higher latency due to micro-batching • Multi-Language support: Java, Python, Scala, R • Supports ad-hoc, notebook-style development/environment • Available as a Java library • Can be the implementation choice of a microservice • Can only work with Kafka for both input and output • low latency due to continuous processing • Currently only supports Java, Scala support available soon • KSQL abstraction provides SQL on top of Kafka Streams
  • 40. Comparison Kafka Streams Spark Streaming Spark Structured Streaming Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL Processing Model Continuous Streaming Micro-Batching Micro-Batching Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset Programming Model Declarative/Imperative Declarative Declarative Time Support Event / Ingestion / Processing Processing Event / Ingestion/ Processing State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk Time Window Support Fixed, Sliding, Session Fixed, Sliding Fixed, Sliding Join Stream-Static, Stream-Stream Stream-Static Stream-Static, Stream-Stream (2.3) Event Pattern detection No No No Query Language Support KSQL No Spark SQL (limited) Queryable State Interactive Queries No No Scalability & Reliability Yes Yes Yes Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial) Latency Sub-second seconds seconds Deployment Java Library Cluster (with HDFS like FS) Cluster (with HDFS like FS)
  • 41. Technology on its own won't help you. You need to know how to use it properly.