SlideShare a Scribd company logo
Re-introducing the Stream
ProcessorA Universal Tool for Continuous Data Analytical Needs
A Universal Tool for Continuous Data Analysis
Paris Carbone
Committer @ Apache Flink
PhD Candidate @ KTH
Data Stream Processors
Data Stream
Processor
can set up any data
pipeline for you
http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg
Is this really a step forward in data processing?
A growing open-source ecosystem:
kafkaflink beam apex
e.g.
General Idea of the tech:
• Processes pipeline computation in a cluster
• Computation is continuous and parallel (like data)
• Event-processing logic <-> Application state
• It’s production-ready and aims to simplify analytics
Data Stream Processors
streams
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed
stream
processor
1. Speed
Low-Latency Data Processing
Traditionally the sole reason stream processing was used
• No intermediate scheduling (you let it run)
• No physical blocking (pre-compute on the go)
• Copy-on-write for state and output
How do stream processors achieve low latency?
But Is this is only relevant for live data?
CEP semantics etc. are nowadays provided as additional
libraries for stream processors
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed 2. History
stream
processor
2. History
Offline Data Processing
It is possible and better over bulk historical data analysis
• Ability to define custom state to build up models
• Large-scale support is a given (inherits cluster computing benefits)
• Separation of notions of time and out-of-order processing
What can stream processors do for historical data?
But isn’t streaming hard to deal with failures?
session
windows
event-timewindowse.g.,
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed 2. History
3. Durability
stream
processor
3. Durability
Exactly-Once Data Processing
Traditionally streaming ~ lossy, approximate processing
This is no longer true. Forget the ‘lambda architecture’.
• Input records are durably stored and indexed in logs (e.g., Kafka)
• Systems handle state snapshotting & transactions with external
stores transparently.
• Idempontent and transactional writes to external stores
part 1 part 2 part 3 part 4
on Flink each stream computation either completes or repeats
e.g.
3. Durability
Exactly-Once Data Processing
input
streams
application
states
stream
processor
rollback
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed 2. History
3. Durability
stream
processor
4. Interactivity
4. Interactivity
Querying Data Processing State
Stream Processor ~ Inverse DBMS
Application state holds fresh knowledge we want to query:
• In some systems (e.g. Kafka-Streams) we can use the changelog
• In other systems (i.e., Flink) we can query the state externally…or
stream queries on custom query processor on-top of them*
Alice
Bob? Bob=…
*https://techblog.king.com/rbea-scalable-real-time-analytics-king/
4 Aspects of Data Processing
1. Speed 2. History
3. Durability 4. Interactivity
stream
processor
• no physical blocking/staging
• no rescheduling
• efficient pipelining
• copy-on-write data structures
• different notions of time
• flexible stateful processing
• high throughput
• durable input logging is a standard
• automated state management
• exactly-once processing
• output commit & Idempotency
• external access to state/
changelogs
• ability to ‘stream queries’ over state
@SenorCarbone
Try out Stream Processing
https://flink.apache.org/
https://kafka.apache.org/
https://beam.apache.org/

More Related Content

What's hot (20)

PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PPTX
An Introduction to Distributed Data Streaming
Paris Carbone
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Zurich Flink Meetup
Konstantinos Kloudas
 
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
A look at Flink 1.2
Stefan Richter
 
PDF
Stateful stream processing with Apache Flink
Knoldus Inc.
 
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
PDF
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
PDF
Data Stream Analytics - Why they are important
Paris Carbone
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Flink Forward
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
An Introduction to Distributed Data Streaming
Paris Carbone
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Zurich Flink Meetup
Konstantinos Kloudas
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
A look at Flink 1.2
Stefan Richter
 
Stateful stream processing with Apache Flink
Knoldus Inc.
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
Data Stream Analytics - Why they are important
Paris Carbone
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Flink Forward
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 

Similar to Reintroducing the Stream Processor: A universal tool for continuous data analysis (20)

PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
PDF
Reflections on Almost Two Decades of Research into Stream Processing
Kyumars Sheykh Esmaili
 
PDF
Spark meetup stream processing use cases
punesparkmeetup
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
PDF
data stream processing.and its applications pdf
ajajkhan16
 
PDF
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
PPTX
Apache Kafka Streams
Apache Kafka TLV
 
PDF
The Rise of Streaming SQL and Evolution of Streaming Applications
Srinath Perera
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
Santander Stream Processing with Apache Flink
confluent
 
PDF
The State of Stream Processing
confluent
 
PPTX
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PPTX
Stream Set presentation for datapipeline.
amitsahu9x
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PPTX
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
Reflections on Almost Two Decades of Research into Stream Processing
Kyumars Sheykh Esmaili
 
Spark meetup stream processing use cases
punesparkmeetup
 
Introduction to Stream Processing
Guido Schmutz
 
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
data stream processing.and its applications pdf
ajajkhan16
 
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
Apache Kafka Streams
Apache Kafka TLV
 
The Rise of Streaming SQL and Evolution of Streaming Applications
Srinath Perera
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Introduction to Stream Processing
Guido Schmutz
 
Introduction to Stream Processing
Guido Schmutz
 
Santander Stream Processing with Apache Flink
confluent
 
The State of Stream Processing
confluent
 
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
Stream Set presentation for datapipeline.
amitsahu9x
 
Introduction to Stream Processing
Guido Schmutz
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
Ad

More from Paris Carbone (6)

PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 
PDF
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Paris Carbone
 
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Paris Carbone
 
PDF
A Future Look of Data Stream Processing as an Architecture for AI
Paris Carbone
 
PDF
Continuous Deep Analytics
Paris Carbone
 
PDF
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Paris Carbone
 
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Paris Carbone
 
A Future Look of Data Stream Processing as an Architecture for AI
Paris Carbone
 
Continuous Deep Analytics
Paris Carbone
 
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
Ad

Recently uploaded (20)

DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
AI/ML Applications in Financial domain projects
Rituparna De
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Climate Action.pptx action plan for climate
justfortalabat
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Performance Report Sample (Draft7).pdf
AmgadMaher5
 

Reintroducing the Stream Processor: A universal tool for continuous data analysis

  • 1. Re-introducing the Stream ProcessorA Universal Tool for Continuous Data Analytical Needs A Universal Tool for Continuous Data Analysis Paris Carbone Committer @ Apache Flink PhD Candidate @ KTH
  • 2. Data Stream Processors Data Stream Processor can set up any data pipeline for you http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg
  • 3. Is this really a step forward in data processing? A growing open-source ecosystem: kafkaflink beam apex e.g. General Idea of the tech: • Processes pipeline computation in a cluster • Computation is continuous and parallel (like data) • Event-processing logic <-> Application state • It’s production-ready and aims to simplify analytics Data Stream Processors streams
  • 4. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer
  • 5. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer
  • 6. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed stream processor
  • 7. 1. Speed Low-Latency Data Processing Traditionally the sole reason stream processing was used • No intermediate scheduling (you let it run) • No physical blocking (pre-compute on the go) • Copy-on-write for state and output How do stream processors achieve low latency? But Is this is only relevant for live data? CEP semantics etc. are nowadays provided as additional libraries for stream processors
  • 8. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed 2. History stream processor
  • 9. 2. History Offline Data Processing It is possible and better over bulk historical data analysis • Ability to define custom state to build up models • Large-scale support is a given (inherits cluster computing benefits) • Separation of notions of time and out-of-order processing What can stream processors do for historical data? But isn’t streaming hard to deal with failures? session windows event-timewindowse.g.,
  • 10. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed 2. History 3. Durability stream processor
  • 11. 3. Durability Exactly-Once Data Processing Traditionally streaming ~ lossy, approximate processing This is no longer true. Forget the ‘lambda architecture’. • Input records are durably stored and indexed in logs (e.g., Kafka) • Systems handle state snapshotting & transactions with external stores transparently. • Idempontent and transactional writes to external stores part 1 part 2 part 3 part 4 on Flink each stream computation either completes or repeats e.g.
  • 12. 3. Durability Exactly-Once Data Processing input streams application states stream processor rollback
  • 13. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed 2. History 3. Durability stream processor 4. Interactivity
  • 14. 4. Interactivity Querying Data Processing State Stream Processor ~ Inverse DBMS Application state holds fresh knowledge we want to query: • In some systems (e.g. Kafka-Streams) we can use the changelog • In other systems (i.e., Flink) we can query the state externally…or stream queries on custom query processor on-top of them* Alice Bob? Bob=… *https://techblog.king.com/rbea-scalable-real-time-analytics-king/
  • 15. 4 Aspects of Data Processing 1. Speed 2. History 3. Durability 4. Interactivity stream processor • no physical blocking/staging • no rescheduling • efficient pipelining • copy-on-write data structures • different notions of time • flexible stateful processing • high throughput • durable input logging is a standard • automated state management • exactly-once processing • output commit & Idempotency • external access to state/ changelogs • ability to ‘stream queries’ over state
  • 16. @SenorCarbone Try out Stream Processing https://flink.apache.org/ https://kafka.apache.org/ https://beam.apache.org/