SlideShare a Scribd company logo
From stream data management
To distributed dataflows
And beyond...
Vasiliki (vasia) Kalavri
(vkalavri@bu.edu)
Stream processing is an established technology in the
data analytics stack of the modern business
3
4
4
4
5
Traffic light adjustment in real time
Alibaba City Brain analyzes
vehicle locations to:

• clear paths for emergency
response vehicles

• provide scheduling information
for public transport

• recommend alternative routes
Read more: https://edition.cnn.com/2019/01/15/tech/alibaba-city-brain-hangzhou/index.html
6
Fault-detection for NASA’s Deep
Space Network
NASA’s DSN Complex Event Processing
analyzes real-time network data, predicted
antenna pointing parameters, and physical
hardware logs to:

• ingest, filter, store, and visualize all of the
DSN's monitor and control data

• ensure the successful DSN tracking,
ranging, and communication integrity of
dozens of concurrent deep-space missions
Read more: https://www.confluent.io/kafka-summit-san-francisco-2019/mission-critical-real-time-
fault-detection-for-nasas-deep-space-network-using-apache-kafka/ 7
• How did we get here?
• Are we there yet?
• What lies ahead?
9
SIGMOD ’92
9
SIGMOD ’92
[… A new class of queries, continuous queries, are similar to
conventional database queries, except that they are issued once and
henceforth run “continually” over the database …]
9
10
1992 20132004
Tapestry
20202000 2002
10
1992 20132004
Tapestry
20202000 2002
Aurora
TelegraphCQ
STREAM
GigascopeNiagaraCQ
10
1992 20132004
Tapestry
20202000 2002
Aurora
TelegraphCQ
STREAM
GigascopeNiagaraCQ
Data Stream Management Systems
Synopsis Maintenance
DSMS architecture
Synopsis
for S1
Synopsis
for Sr
…
Fast
approximate
answers
…
S1
S2
Sr
11
InputManager
Scheduler
QoS Monitor
Load Shedder
Query
Execution
Engine
QmQ2Q1
Ad-hoc or
continuous queries
Input streams
…
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
GigascopeNiagaraCQ
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
GigascopeNiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
GigascopeNiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
NiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
12
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
NiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
“Best-effort”
low-latency stream processor
λ-architecture
MapReduce /
Batch processing
system
Fast
approximate
results
13
InputManager
Input data
Persistent
storage
Slow
exact
results
Applications
Speed layer
Batch layer
14
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
2015
NiagaraCQ
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
14
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
2015
Distributed Dataflow Systems
NiagaraCQ
Spark Streaming
Naiad
Flink
Millwheel
Google Dataflow
Timely Dataflow
Samza
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
Stream processing doesn’t necessarily need to
be approximate and lossy
Worker
Task Task
state store
Task
DDS architecture
Streaming APIs
Distributed File System
Coordinator
Worker
Task Task Task
Worker
Task Task Task
TCP
output to
application
and sinks
16
Event logs
Socket
TCP
(Q, config)
client
schedule
trigger
checkpoint
status
put/get
checkpoint
Spark Streaming
17
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
Spark Streaming
17
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
Spark Streaming
17
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
Are we there yet?
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
18
SIGMOD
Record ’05
1. Process events online without storing them
18
SIGMOD
Record ’05
1. Process events online without storing them
18
SIGMOD
Record ’05
persistently store events and state
1. Process events online without storing them
2. Support a high-level language (SQL-like)
18
SIGMOD
Record ’05
persistently store events and state
1. Process events online without storing them
2. Support a high-level language (SQL-like)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
Some of my recent
and ongoing work
19
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Some of my recent
and ongoing work
19
Automatic elasticity and reconfiguration
20
heuristic policies

if CPU > 80% => scale
stop-and-restart
migration and
reconfiguration
Automatic elasticity and reconfiguration
21
Accuracy: no over/under-provisioning
Stability:no oscillations
Performance: fast convergence
Safe migration: correct results
Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows (OSDI ’18). 

Megaphone: Latency-conscious state migration for distributed streaming dataflows (VLDB’19).
github.com/strymon-system/ds2
github.com/strymon-system/megaphone
o1 cannot keep up
waiting for output
waiting for input
src
o1
o2
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Performance analysis of
streaming dataflows is itself a
challenging streaming
computation with strict latency
requirements
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Performance analysis of
streaming dataflows is itself a
challenging streaming
computation with strict latency
requirements
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Snailtrail: Generalizing critical paths for online analysis of distributed dataflows (NSDI’18).
github.com/li1/snailtrail
1. Process events online without storing them

2. Support a high-level language (SQL-like)

3. Handle missing, out-of-order, delayed data

4. Guarantee deterministic (on replay) and correct results (on recovery)

5. Combine batch and stream processing

6. Ensure availability despite failures

7. Support distribution and automatic elasticity

8. Offer low-latency
23
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
accurate, stable, latency-aware
reliability, production readiness and community can be
more important than raw performance
In open-source software
24
reliability, production readiness and community can be
more important than raw performance
In open-source software
24
Apache Flink, Nexmark Q4
latency (ms)
CDF
1.0
0.8
0.6
0.4
0.2
0.0
In-memory
state RocksDB state
1000080006000400020000
serde at every access
25
write-heavy, large state
RMW a single value
globally configured store
25
write-heavy, large state
RMW a single value
globally configured store
Type-aware, flexible state
management provides up to an order
of magnitude latency improvement
We need configurable streaming backends
New streaming state benchmarks
Beyond…
Model serving
27
Stream Processor Model Server
RPC
input
stream
predictions
Stream Processor
op
input
stream
predictions
Model management and versioning
1. Model stored externally 2. Model stored in managed state
Exactly-once guarantees?
Latency trade-offs unclear
What kind of state store to use?
Stateful serverless (FaaS)
28
Automatic scaling
Function orchestration
Support for transactions
External requests
Events and
function triggers
f
λ
f
f
f
output
Apache Flink Stateful Functions: https://statefun.io
Stateful Functions as a Service in Action (VLDB’19)
Graph streaming & online trainingdatarate
analytics complexity
low
high
low high
Machine

Learning
Data

Mining
Streaming
CEP
Relational

analytics
Graph processing
Complex streaming
data analytics
Streaming Graph Partitioning: An Experimental Study (VLDB’18).
Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism (arxiv.org/abs/1912.12740).
29
Graph state management
Data-parallel graph synopses
Languages & operator semantics
Adaptive graph partitioning
Spark Streaming
30
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
ML
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
Graphs
FaaS
Edge
Modern hardware
From stream data management
To distributed dataflows
And beyond...
Vasiliki (vasia) Kalavri
(vkalavri@bu.edu)

More Related Content

What's hot (20)

PPTX
Kakfa summit london 2019 - the art of the event-streaming app
Neil Avery
 
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PDF
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Neil Avery
 
PDF
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
HostedbyConfluent
 
PDF
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
HostedbyConfluent
 
PDF
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
StreamNative
 
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
PDF
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
PPTX
Real-World Pulsar Architectural Patterns
Devin Bost
 
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
PPTX
Neo4j Graph Streaming Services with Apache Kafka
jexp
 
PDF
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
PDF
The Event Mesh: real-time, event-driven, responsive APIs and beyond
Solace
 
PDF
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
Kai Wähner
 
PDF
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
HostedbyConfluent
 
PDF
Real-time processing of large amounts of data
confluent
 
PDF
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Kai Wähner
 
PDF
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
Kai Wähner
 
PDF
Concepts and Patterns for Streaming Services with Kafka
QAware GmbH
 
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Kai Wähner
 
Kakfa summit london 2019 - the art of the event-streaming app
Neil Avery
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Neil Avery
 
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
HostedbyConfluent
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
HostedbyConfluent
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
StreamNative
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
Real-World Pulsar Architectural Patterns
Devin Bost
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
Neo4j Graph Streaming Services with Apache Kafka
jexp
 
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
The Event Mesh: real-time, event-driven, responsive APIs and beyond
Solace
 
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
Kai Wähner
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
HostedbyConfluent
 
Real-time processing of large amounts of data
confluent
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Kai Wähner
 
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)
Kai Wähner
 
Concepts and Patterns for Streaming Services with Kafka
QAware GmbH
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Kai Wähner
 

Similar to From data stream management to distributed dataflows and beyond (20)

PDF
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
PDF
Data Streaming Technology Overview
Dan Lynn
 
PPTX
Stream Analytics
Franco Ucci
 
PDF
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
PDF
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
PPTX
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
PDF
EDA Meets Data Engineering – What's the Big Deal?
confluent
 
PDF
Circonus: Design failures - A Case Study
Heinrich Hartmann
 
PDF
Flink at netflix paypal speaker series
Monal Daxini
 
PDF
Cowboy dating with big data TechDays at Lohika-2020
b0ris_1
 
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
PPTX
Building Stream Processing as a Service
Steven Wu
 
PDF
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
confluent
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PDF
SamzaSQL QCon'16 presentation
Yi Pan
 
PDF
Safe Peak Technical Ppt W Product Publish
sqlserver.co.il
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
DOCX
Resume
farikou omarov
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
Data Streaming Technology Overview
Dan Lynn
 
Stream Analytics
Franco Ucci
 
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
EDA Meets Data Engineering – What's the Big Deal?
confluent
 
Circonus: Design failures - A Case Study
Heinrich Hartmann
 
Flink at netflix paypal speaker series
Monal Daxini
 
Cowboy dating with big data TechDays at Lohika-2020
b0ris_1
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
Building Stream Processing as a Service
Steven Wu
 
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
confluent
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
SamzaSQL QCon'16 presentation
Yi Pan
 
Safe Peak Technical Ppt W Product Publish
sqlserver.co.il
 
Apache Spark Streaming
Bartosz Jankiewicz
 
Ad

More from Vasia Kalavri (19)

PDF
Self-managed and automatically reconfigurable stream processing
Vasia Kalavri
 
PDF
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
PDF
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Vasia Kalavri
 
PDF
Apache Flink & Graph Processing
Vasia Kalavri
 
PDF
The shortest path is not always a straight line
Vasia Kalavri
 
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
PDF
Demystifying Distributed Graph Processing
Vasia Kalavri
 
PDF
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
PDF
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PDF
Big data processing systems research
Vasia Kalavri
 
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
Vasia Kalavri
 
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
PDF
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
PDF
MapReduce: Optimizations, Limitations, and Open Issues
Vasia Kalavri
 
PDF
A Skype case study (2011)
Vasia Kalavri
 
PDF
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Self-managed and automatically reconfigurable stream processing
Vasia Kalavri
 
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Vasia Kalavri
 
Apache Flink & Graph Processing
Vasia Kalavri
 
The shortest path is not always a straight line
Vasia Kalavri
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
Demystifying Distributed Graph Processing
Vasia Kalavri
 
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Big data processing systems research
Vasia Kalavri
 
Asymmetry in Large-Scale Graph Analysis, Explained
Vasia Kalavri
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
MapReduce: Optimizations, Limitations, and Open Issues
Vasia Kalavri
 
A Skype case study (2011)
Vasia Kalavri
 
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
Apache Flink Deep Dive
Vasia Kalavri
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Ad

Recently uploaded (20)

PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 

From data stream management to distributed dataflows and beyond

  • 1. From stream data management To distributed dataflows And beyond... Vasiliki (vasia) Kalavri ([email protected])
  • 2. Stream processing is an established technology in the data analytics stack of the modern business
  • 3. 3
  • 4. 4
  • 5. 4
  • 6. 4
  • 7. 5
  • 8. Traffic light adjustment in real time Alibaba City Brain analyzes vehicle locations to: • clear paths for emergency response vehicles • provide scheduling information for public transport • recommend alternative routes Read more: https://edition.cnn.com/2019/01/15/tech/alibaba-city-brain-hangzhou/index.html 6
  • 9. Fault-detection for NASA’s Deep Space Network NASA’s DSN Complex Event Processing analyzes real-time network data, predicted antenna pointing parameters, and physical hardware logs to: • ingest, filter, store, and visualize all of the DSN's monitor and control data • ensure the successful DSN tracking, ranging, and communication integrity of dozens of concurrent deep-space missions Read more: https://www.confluent.io/kafka-summit-san-francisco-2019/mission-critical-real-time- fault-detection-for-nasas-deep-space-network-using-apache-kafka/ 7
  • 10. • How did we get here? • Are we there yet? • What lies ahead?
  • 11. 9
  • 13. SIGMOD ’92 [… A new class of queries, continuous queries, are similar to conventional database queries, except that they are issued once and henceforth run “continually” over the database …] 9
  • 17. Synopsis Maintenance DSMS architecture Synopsis for S1 Synopsis for Sr … Fast approximate answers … S1 S2 Sr 11 InputManager Scheduler QoS Monitor Load Shedder Query Execution Engine QmQ2Q1 Ad-hoc or continuous queries Input streams …
  • 18. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 GigascopeNiagaraCQ
  • 19. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 GigascopeNiagaraCQ operator semantics event time & progress representations synopses & sketches
  • 20. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 GigascopeNiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 21. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce NiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 22. 12 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 NiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 23. “Best-effort” low-latency stream processor λ-architecture MapReduce / Batch processing system Fast approximate results 13 InputManager Input data Persistent storage Slow exact results Applications Speed layer Batch layer
  • 24. 14 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 2015 NiagaraCQ operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 25. 14 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 2015 Distributed Dataflow Systems NiagaraCQ Spark Streaming Naiad Flink Millwheel Google Dataflow Timely Dataflow Samza operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 26. Stream processing doesn’t necessarily need to be approximate and lossy
  • 27. Worker Task Task state store Task DDS architecture Streaming APIs Distributed File System Coordinator Worker Task Task Task Worker Task Task Task TCP output to application and sinks 16 Event logs Socket TCP (Q, config) client schedule trigger checkpoint status put/get checkpoint
  • 28. Spark Streaming 17 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow operator semantics event time & progress representations synopses & sketches load management high availability scheduling
  • 29. Spark Streaming 17 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow operator semantics event time & progress representations synopses & sketches load management high availability scheduling data parallelism exactly-once fault-tolerance state management general-purpose languages iterations UDFs
  • 30. Spark Streaming 17 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow Are we there yet? operator semantics event time & progress representations synopses & sketches load management high availability scheduling data parallelism exactly-once fault-tolerance state management general-purpose languages iterations UDFs
  • 32. 1. Process events online without storing them 18 SIGMOD Record ’05
  • 33. 1. Process events online without storing them 18 SIGMOD Record ’05 persistently store events and state
  • 34. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 18 SIGMOD Record ’05 persistently store events and state
  • 35. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like
  • 36. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like
  • 37. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs
  • 38. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs
  • 39. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once
  • 40. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once
  • 41. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming
  • 42. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming
  • 43. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 44. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 45. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 46. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates
  • 47. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates high throughput and “acceptable" latency
  • 48. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 18 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates high throughput and “acceptable" latency
  • 49. Some of my recent and ongoing work 19
  • 50. Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor Some of my recent and ongoing work 19
  • 51. Automatic elasticity and reconfiguration 20 heuristic policies if CPU > 80% => scale stop-and-restart migration and reconfiguration
  • 52. Automatic elasticity and reconfiguration 21 Accuracy: no over/under-provisioning Stability:no oscillations Performance: fast convergence Safe migration: correct results Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows (OSDI ’18). 
 Megaphone: Latency-conscious state migration for distributed streaming dataflows (VLDB’19). github.com/strymon-system/ds2 github.com/strymon-system/megaphone o1 cannot keep up waiting for output waiting for input src o1 o2
  • 53. Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor 22
  • 54. Performance analysis of streaming dataflows is itself a challenging streaming computation with strict latency requirements Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor 22
  • 55. Performance analysis of streaming dataflows is itself a challenging streaming computation with strict latency requirements Re-configurable Stream Processing Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor 22 Snailtrail: Generalizing critical paths for online analysis of distributed dataflows (NSDI’18). github.com/li1/snailtrail
  • 56. 1. Process events online without storing them 2. Support a high-level language (SQL-like) 3. Handle missing, out-of-order, delayed data 4. Guarantee deterministic (on replay) and correct results (on recovery) 5. Combine batch and stream processing 6. Ensure availability despite failures 7. Support distribution and automatic elasticity 8. Offer low-latency 23 SIGMOD Record ’05 persistently store events and state Java, Scala, Python, and SQL-like with tunable latency trade-offs and exactly-once batch is a special case of streaming and exactly-once state updates high throughput and “acceptable" latency accurate, stable, latency-aware
  • 57. reliability, production readiness and community can be more important than raw performance In open-source software 24
  • 58. reliability, production readiness and community can be more important than raw performance In open-source software 24 Apache Flink, Nexmark Q4 latency (ms) CDF 1.0 0.8 0.6 0.4 0.2 0.0 In-memory state RocksDB state 1000080006000400020000 serde at every access
  • 59. 25 write-heavy, large state RMW a single value globally configured store
  • 60. 25 write-heavy, large state RMW a single value globally configured store Type-aware, flexible state management provides up to an order of magnitude latency improvement We need configurable streaming backends New streaming state benchmarks
  • 62. Model serving 27 Stream Processor Model Server RPC input stream predictions Stream Processor op input stream predictions Model management and versioning 1. Model stored externally 2. Model stored in managed state Exactly-once guarantees? Latency trade-offs unclear What kind of state store to use?
  • 63. Stateful serverless (FaaS) 28 Automatic scaling Function orchestration Support for transactions External requests Events and function triggers f λ f f f output Apache Flink Stateful Functions: https://statefun.io Stateful Functions as a Service in Action (VLDB’19)
  • 64. Graph streaming & online trainingdatarate analytics complexity low high low high Machine Learning Data Mining Streaming CEP Relational analytics Graph processing Complex streaming data analytics Streaming Graph Partitioning: An Experimental Study (VLDB’18). Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism (arxiv.org/abs/1912.12740). 29 Graph state management Data-parallel graph synopses Languages & operator semantics Adaptive graph partitioning
  • 65. Spark Streaming 30 Data Stream Management Systems 1992 20132004 Tapestry Aurora TelegraphCQ STREAM 20202000 2002 Gigascope MapReduce Storm S4 Naiad Samza Flink Millwheel 2015 Google Dataflow Distributed Dataflow Systems NiagaraCQ Timely Dataflow ML operator semantics event time & progress representations synopses & sketches load management high availability scheduling data parallelism exactly-once fault-tolerance state management general-purpose languages iterations UDFs Graphs FaaS Edge Modern hardware
  • 66. From stream data management To distributed dataflows And beyond... Vasiliki (vasia) Kalavri ([email protected])