SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Giselle van Dongen,
Stream processing:
choosing the right tool for
the job
#UnifiedDataAnalytics #SparkAISummit
3#UnifiedDataAnalytics #SparkAISummit
●
●
○
○
Context
4#UnifiedDataAnalytics #SparkAISummit
…
Context
5#UnifiedDataAnalytics #SparkAISummit
Context
6#UnifiedDataAnalytics #SparkAISummit
Disclaimer
7#UnifiedDataAnalytics #SparkAISummit
a.k.a I will not pick a stream processing framework for you
Commonalities
8#UnifiedDataAnalytics #SparkAISummit
●
●
●
●
Imagine...
9#UnifiedDataAnalytics #SparkAISummit
10#UnifiedDataAnalytics #SparkAISummit
11#UnifiedDataAnalytics #SparkAISummit
Do we need stream processing?
12#UnifiedDataAnalytics #SparkAISummit
…
Do we need stream processing?
13#UnifiedDataAnalytics #SparkAISummit
…
14#UnifiedDataAnalytics #SparkAISummit
15#UnifiedDataAnalytics #SparkAISummit
16#UnifiedDataAnalytics #SparkAISummit
How much data?
➔
➔
17#UnifiedDataAnalytics #SparkAISummit
How much data?
18#UnifiedDataAnalytics #SparkAISummit
Spark Flink Kafka Spark Struct Flink KafkaStruct
20#UnifiedDataAnalytics #SparkAISummit
Does it need to be fast?
21#UnifiedDataAnalytics #SparkAISummit
➔ ➔
Does it need to be fast?
22#UnifiedDataAnalytics #SparkAISummit
Does it need to be fast?
●
…
●
●
23#UnifiedDataAnalytics #SparkAISummit
Event-driven
24#UnifiedDataAnalytics #SparkAISummit
https://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1
Micro-batching
25#UnifiedDataAnalytics #SparkAISummit
https://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1
Does it need to be fast?
●
26#UnifiedDataAnalytics #SparkAISummit
➔
➔
Spark Flink Kafka Spark Struct Flink KafkaStruct
28#UnifiedDataAnalytics #SparkAISummit
29#UnifiedDataAnalytics #SparkAISummit
Performance
Advanced features
Deployment & Internals
Who will build it?
30#UnifiedDataAnalytics #SparkAISummit
Who will build it?
31#UnifiedDataAnalytics #SparkAISummit
Spark Flink Kafka Spark Struct Flink KafkaStruct
33#UnifiedDataAnalytics #SparkAISummit
Performance
Advanced features
Deployment & Internals
Is accurate ordering important?
34#UnifiedDataAnalytics #SparkAISummit
4 7 10
window 1 window 2 window 3
8
Is accurate ordering important?
35#UnifiedDataAnalytics #SparkAISummit
4 7 10
window 1 window 2 window 3
88
36#UnifiedDataAnalytics #SparkAISummit
4 7 10
window 1 window 2 window 3
88
window 1
window 2
window 3
Is accurate ordering important?
Is accurate ordering important?
37#UnifiedDataAnalytics #SparkAISummit
Is accurate ordering important?
38#UnifiedDataAnalytics #SparkAISummit
➔ ➔
Spark Flink Kafka Spark Struct Flink KafkaStruct
40#UnifiedDataAnalytics #SparkAISummit
Performance
Advanced features
Deployment & Internals
What is the ecosystem like?
•
•
•
41#UnifiedDataAnalytics #SparkAISummit
What is the ecosystem like?
42#UnifiedDataAnalytics #SparkAISummit
Spark Flink Kafka Spark Struct Flink KafkaStruct
44#UnifiedDataAnalytics #SparkAISummit
45#UnifiedDataAnalytics #SparkAISummit
How do we want it to run?
46#UnifiedDataAnalytics #SparkAISummit
W W W W W
M
T T T T T
Spark Flink Kafka Spark Struct Flink KafkaStruct
48#UnifiedDataAnalytics #SparkAISummit
What if a message gets lost?
49#UnifiedDataAnalytics #SparkAISummit
What if a message gets lost?
•
•
•
50#UnifiedDataAnalytics #SparkAISummit
What if a message gets lost?
•
•
W1 W3 W4
W2 W4
What if a message gets lost?
52#UnifiedDataAnalytics #SparkAISummit
State
State
State
State
State
HDFS
ref. Flink Forward 2018 Best practices for state and time, Tzu-Li Tai
W1 W3 W4
W2 W4
What if a message gets lost?
53#UnifiedDataAnalytics #SparkAISummit
State
State
State
State
State
HDFS
ref. Flink Forward 2018 Best practices for state and time, Tzu-Li Tai
W1 W4
W2 W4
W3
What if a message gets lost?
54#UnifiedDataAnalytics #SparkAISummit
State
State
State
State
State
HDFS
ref. Flink Forward 2018 Best practices for state and time, Tzu-Li Tai
What if a message gets lost?
●
○
○
○
●
○
○
○
○
What if a message gets lost?
56#UnifiedDataAnalytics #SparkAISummit
What if a message gets lost?
57#UnifiedDataAnalytics #SparkAISummit
Spark Flink Kafka Spark Struct Flink KafkaStruct
Spark Flink Kafka SparkStruct Struct Flink Kafka
60#UnifiedDataAnalytics #SparkAISummit
61#UnifiedDataAnalytics #SparkAISummit
Want to know more?
62#UnifiedDataAnalytics #SparkAISummit
63#UnifiedDataAnalytics #SparkAISummit
THANK YOU!
Do you want to work with these tools?
We are hiring!
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PDF
Memory Management in Apache Spark
Databricks
 
PPTX
Practical learnings from running thousands of Flink jobs
Flink Forward
 
PDF
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
PPTX
The Juniper SDN Landscape
Chris Jones
 
PDF
Terraform: An Overview & Introduction
Lee Trout
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PDF
Etl is Dead; Long Live Streams
confluent
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PPTX
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
PDF
Scalability, Availability & Stability Patterns
Jonas Bonér
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Steve Pember
 
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
PDF
Cloud Native Networking & Security with Cilium & eBPF
Raphaël PINSON
 
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Memory Management in Apache Spark
Databricks
 
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
The Juniper SDN Landscape
Chris Jones
 
Terraform: An Overview & Introduction
Lee Trout
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Etl is Dead; Long Live Streams
confluent
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Steve Pember
 
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Cloud Native Networking & Security with Cilium & eBPF
Raphaël PINSON
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 

Similar to Stream Processing: Choosing the Right Tool for the Job (20)

PDF
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
PDF
Apache Spark Data Validation
Databricks
 
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
PDF
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
Databricks
 
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
PDF
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Databricks
 
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp
 
PDF
Internals of Speeding up PySpark with Arrow
Databricks
 
PPTX
Scaling Face Recognition with Big Data
Bogdan Bocse
 
PDF
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
PDF
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
PDF
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
PDF
Scaling face recognition with big data - Bogdan Bocse
ITCamp
 
PPTX
Industrialiser spark
Lucien Fregosi
 
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
Databricks
 
PDF
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Databricks
 
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
Databricks
 
PDF
Connecting the Dots: Integrating Apache Spark into Production Pipelines
Databricks
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Apache Spark Data Validation
Databricks
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
Databricks
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Databricks
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
Physical Plans in Spark SQL
Databricks
 
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp
 
Internals of Speeding up PySpark with Arrow
Databricks
 
Scaling Face Recognition with Big Data
Bogdan Bocse
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
Scaling face recognition with big data - Bogdan Bocse
ITCamp
 
Industrialiser spark
Lucien Fregosi
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Databricks
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Databricks
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Databricks
 
Connecting the Dots: Integrating Apache Spark into Production Pipelines
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 

Stream Processing: Choosing the Right Tool for the Job