SlideShare a Scribd company logo
An Introduction to Distributed
Data Streaming
Elements and Systems
Paris Carbone<parisc@kth.se>
PhD Candidate
KTH Royal Institute of Technology
1
2
how to avoid this?
Q = +
Q
Motivation
3
Q Q
Q = +
Motivation
4
Q
Standing Query
Preliminaries
• Data Streaming Paradigm
• Incoming data is unbound - continuous arrival
• Standing queries are evaluated continuously
• Queries operate on the full data stream or on the
most recent views of the stream ~ windows
5
Data Streams Basics
• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
6
f
S1
S2
So
S’1
S’2
Streaming Pipelines
7
stream1
stream2
approximations
predictions
alerts
……
Q
sources
sinks
Core Abstractions
• Windows
• Synopses (summary state)
• Partitioning
8
Windows
Discussion
Why do we need windows?
9
Windows
• We are often interested only in fresh data
• f = “average temperature over the last minute every 20 sec”
• Range: Most data stream processing systems allow window
operations on the most recent history (eg. 1 minute, 1000 tuples)
• Slide: The frequency/granularity f is evaluated on a given range
10
#seconds40 80
Average #3
Average #2
0
Average #1
20 60 100
f
W: 1min, 20sec
Window Types
11
#sec
40 80
Average #2
0
Average #1
20 60 100
#sec
40 80
Average #3
Average #2
0
Average #1
20 60 100
#sec
40 80
Average #2
0
Average #1
20 60
100
0
120
0
120
000
Sliding
Tumbling
Jumping
range > slide
range = slide
range < slide
Synopses
We cannot infinitely store all events seen
• Synopsis: A summary of an infinite stream
• It is in principle any streaming operator state
• Examples: samples, histograms, sketches, state machines…
12
f
s
a summary of everything
seen so far
1. process t, s
2. update s
3. produce t’
t t’
What about window synopses?
Synopses-Aggregations
• Discussion - Rolling Aggregations
• Propose a synopsis, s=? when
• f= max
• f= ArithmeticMean
• f= stDev
13
Synopses-Approximations
14
• Discussion - Approximate Results
• Propose a synopsis, s=? when
• f= uniform random sample of k records over the
whole stream
• f= filter distinct records over windows of 1000
records with a 5% error
Synopses-ML and Graphs
15
• Examples of cool synopses to check out
• Sparsifiers/Spanners - approximating graph
properties such as shortest paths
• Change detectors - detecting concept drift
• Incremental decision trees - continuous stream
training and classification
Partitioning
• One stream operator is not enough
• Data might be too large to process
• e.g. very high input rate, too many stream sources
• State could possibly not fit in memory
16
f
s
f
s
f
s
parallel instances
How do we
partition the input streams?
f
s
Partitioning
• Partitioning defines how we allocate events to each
parallel instance. Typical partitioners are:
• Broadcast
• Shuffle
• Key-based
17
f
s
f
s
f
s
f
s
f
s
f
s
P
P
P
by
color
Putting Everything Together
18
Fire Detection
Pipeline
{area,temp}
{area,smoke}
{loc,alert!}
• operators
• synopses
• windows
• partitioning
trigger
on detection
trigger
periodically
?
Operators
19
A
s
F
s
Rolling Arithmetic Mean of Temperatures
State Machine-based Fire Alarm
{area,temp} {area,avgTemp}
{alarm}
Src
Sensor Data Sources
{area,temp}
Src
{area}
Periodic Temperature Updates
Smoke Detections
trivial…
What is the state and its transitions?
Partitioning
• We are only interested in correlating smoke and
high temperature within the same area
• Events carry area information so we can partition
our computation by area
20
Src P
key:area
Windowing
• Individual sensor data could be potentially faulty
• We need to gather data from all temperature sensors of
an area and produce an average
• We want fresh average temperatures
21
Src P
key:area{area,temp} A
s
A
s
w
w
w = ?
The Fire Alarm
22
F
s
T : avgTemp>40
T : avgTemp<40
…TTTSTTSTTTT….
OK
HOT
SMOKE
FIRE
T
T
T
S
S
T
T
synopsis= 1 state
S : Smoke
Putting Everything Together
23
{area,temp}
{area,smoke}
Src
Src
P
P
A
s
A
s
key:area
key:area
w
w
F
s
F
s
P
key:area
{area, alert}
{area,avg_temp}
{area,smoke}
Systems: The Big Picture
24
Proprietary Open Source
Google
DataFlow
IBM
Infosphere
Microsoft
Azure
Flink
Storm
Samza
Spark
Evolution
25
’95
Materialised
Views
’01
Complex
Event
Processing
’03
TelegraphCQ
’03
STREAM
’05
Borealis
’15
User-Defined
Windows
’12
Policy-Based
Windowing
’88
Active
DataBases
’88
HiPac
’12
Twitter
Storm
’12
IBM
System S
’13
Spark
Streaming
’14
Apache
Flink
’13
Parallel
Recovery
’05
Decentralised
Stream Queries
’05
High Availability
on Streaming
concepts
systems
’13
Google
Millwheel
’13
Discretized
Streams
’00
Eddies
02
Aurora ’12
Twitter
Storm
Programming Models
26
Compositional Declarative
• Offer basic building blocks
for composing custom
operators and topologies
• Advanced behaviour such
as windowing is often
missing
• Custom Optimisation
• Expose a high-level API
• Operators are higher order
functions on abstract data
stream types
• Advanced behaviour such
as windowing is supported
• Self-Optimisation
Programming Model Types
27
DStream, DataStream,
PCollection…
• Direct access to the execution
graph / topology
• Suitable for engineers
• Transformations abstract
operator details
• Suitable for engineers
and data analysts
Standing Queries with
Apache Storm
28
• Step1: Implement input (Spouts) and intermediate operators
(Bolts)
• Step 2: Construct a Topology by combining operators
Spout Bolt Bolt
Spouts are the
topology sources
The listen to data feeds
Bolts represent all intermediate computation
vertices of the topology
They do arbitrary data manipulation
Each operator can emit/subscribe to Streams
(computation results)
Example: Topology Definition
29
numbers new_numbers
numbers new_numbers
toFile
Standing Queries with Apache
Flink
30
Flink Runtime
Flink Job Graph Builder/Optimiser
Flink Client
Streamin
g
Program
• Operator fusion
• Window Pre-aggregates
• Deploy Long Running Tasks
• Monitor Execution
Distributed Stream
Execution Paradigms
31
(Hadoop, Spark) (Spark Streaming)
1) Real Streaming (Distributed Data Flow)
LONG-LIVED TASK EXECUTION
STATE IS KEPT
INSIDE TASKS
2) Batched Execution
Windows in Action
32
• DStreams are already
partitioned in time windows
• Only time windows supported
• Windows decomposed into
policies
• Policies can be user-defined too
range
slide
Windows on Storm?
33
src-http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
Partitioning in Action
34
forward()
shuffle()
broadcast()
keyBy()
partitionCustom()
shuffleGrouping()
allGrouping()
fieldsGrouping()
customGrouping()
repartition(num)
reduceByKey()
updateStateByKey()
no fine-grained controlfull control
Synopses in Action
35
implementing a rolling max per key
State in Spark?
36
• Streams are partitioned into small batches
• There is practically no state kept in workers (stateless)
• How do we keep state??
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
Implementing the alarm in
Flink
37
So everything works
38
{area,temp}
{area,smoke}
Src
Src
P
P
A
s
A
s
key:area
key:area
w
w
F
s
F
s
P
key:area
{area,avg_temp}
{area,smoke}
or…
Unreliable Sources
39
Standing Query
add more sensors
Q
recovered!
Unreliable Processing
40
Standing Query
Q
lost smoke events
Resilient Brokers
Main Features
• Topic-based partitioned queues
• Strongly consistent offset mapping to records
41
Processing Guarantees
• Kafka solves the source consistency problem
• How about the rest of the states of the computation ? (e.g. alert
operator state)
• Each system offers different guarantees
42
Guarantees Technique
Storm at least once event dependency tracking
Spark exactly once source upstream backup
Flink exactly once periodic snapshots
43
Q
Standing Query
Mission Accomplished
Research Topics at
KTH/SICS
• Exactly-Once-Output Guarantees
• State management and auto-scaling
• Streaming ML pipelines
• Streaming Graphs
44

More Related Content

What's hot (20)

PDF
Pulsar connector on flink 1.14
宇帆 盛
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PDF
Aggregate Sharing for User-Define Data Stream Windows
Paris Carbone
 
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PDF
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Till Rohrmann
 
PPTX
Flink internals web
Kostas Tzoumas
 
PDF
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
PDF
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Paris Carbone
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward
 
PDF
Zurich Flink Meetup
Konstantinos Kloudas
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PDF
Apache Flink internals
Kostas Tzoumas
 
Pulsar connector on flink 1.14
宇帆 盛
 
Apache Flink Training: System Overview
Flink Forward
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Aggregate Sharing for User-Define Data Stream Windows
Paris Carbone
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Flink Streaming @BudapestData
Gyula Fóra
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Till Rohrmann
 
Flink internals web
Kostas Tzoumas
 
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Paris Carbone
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward
 
Zurich Flink Meetup
Konstantinos Kloudas
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Apache Flink internals
Kostas Tzoumas
 

Similar to An Introduction to Distributed Data Streaming (20)

PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
Reflections on Almost Two Decades of Research into Stream Processing
Kyumars Sheykh Esmaili
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
PPTX
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PDF
Stream Processing Overview
Maycon Viana Bordin
 
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Let's get to know the Data Streaming
Knoldus Inc.
 
PPTX
The data streaming processing paradigm and its use in modern fog architectures
Vincenzo Gulisano
 
PDF
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
PDF
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
Vincenzo Gulisano
 
PPT
Introduction to Spark Streaming
Knoldus Inc.
 
PDF
Structured streaming for machine learning
Seth Hendrickson
 
PDF
Streaming analytics state of the art
Stavros Kontopoulos
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Reflections on Almost Two Decades of Research into Stream Processing
Kyumars Sheykh Esmaili
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Stream Processing Overview
Maycon Viana Bordin
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Let's get to know the Data Streaming
Knoldus Inc.
 
The data streaming processing paradigm and its use in modern fog architectures
Vincenzo Gulisano
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Vincenzo Gulisano
 
Introduction to Spark Streaming
Knoldus Inc.
 
Structured streaming for machine learning
Seth Hendrickson
 
Streaming analytics state of the art
Stavros Kontopoulos
 
Ad

More from Paris Carbone (8)

PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 
PDF
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Paris Carbone
 
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Paris Carbone
 
PDF
A Future Look of Data Stream Processing as an Architecture for AI
Paris Carbone
 
PDF
Continuous Deep Analytics
Paris Carbone
 
PDF
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Paris Carbone
 
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
PDF
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Paris Carbone
 
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Paris Carbone
 
A Future Look of Data Stream Processing as an Architecture for AI
Paris Carbone
 
Continuous Deep Analytics
Paris Carbone
 
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Paris Carbone
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
Ad

Recently uploaded (20)

PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
AI/ML Applications in Financial domain projects
Rituparna De
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
Climate Action.pptx action plan for climate
justfortalabat
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 

An Introduction to Distributed Data Streaming