SlideShare a Scribd company logo
Stephan Ewen
@stephanewen
The Stream Processor
as a Database
Apache Flink
2
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
Apache Flink Stack
3
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
Programs and Dataflows
4
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.apply(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Source
[1]
map()
[1]
keyBy()/
window()/
apply()
[1]
Sink
[1]
Source
[2]
map()
[2]
keyBy()/
window()/
apply()
[2]
Streaming
Dataflow
What makes Flink flink?
5
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
The (Classic) Use Case
Realtime Counts and Aggregates
6
(Real)Time Series Statistics
7
stream of events realtime statistics
The Architecture
8
collect log analyze serve & store
The Flink Job
9
case class Impressions(id: String, impressions: Long)
val events: DataStream[Event] =
env.addSource(new FlinkKafkaConsumer09(…))
val impressions: DataStream[Impressions] = events
.filter(evt => evt.isImpression)
.map(evt => Impressions(evt.id, evt.numImpressions)
val counts: DataStream[Impressions]= stream
.keyBy("id")
.timeWindow(Time.hours(1))
.sum("impressions")
The Flink Job
10
Kafka
Source
map()
window()/
sum()
Sink
Kafka
Source
map()
window()/
sum()
Sink
filter()
filter()
keyBy()
keyBy()
Putting it all together
11
Periodically (every second)
flush new aggregates
to Redis
How does it perform?
12
Latency Throughput
Number of
Keys
99th Percentile
Latency (sec)
9
8
2
1
Storm 0.10
Flink 0.10
60 80 100 120 140 160 180
Throughput
(1000 events/sec)
Spark Streaming 1.5
Yahoo! Streaming Benchmark
13
Latency
(lower is better)
Extended Benchmark: Throughput
14
Throughput
• 10 Kafka brokers with 2 partitions each
• 10 compute machines (Flink / Storm)
• Xeon E3-1230-V2@3.30GHz CPU (4 cores HT)
• 32 GB RAM (only 8GB allocated to JVMs)
• 10 GigE Ethernet between compute nodes
• 1 GigE Ethernet between Kafka cluster and Flink nodes
Scaling Number of Users
 Yahoo! Streaming Benchmark has 100 keys only
• Every second, only 100 keys are written to
key/value store
• Quite few, compared to many real world use cases
 Tweet impressions: millions keys/hour
• Up to millions of keys updated per second
15
Number of
Keys
Performance
16
Number of
Keys
The Bottleneck
17
Writes to the key/value
store take too long
Queryable State
18
Queryable State
19
Queryable State
20
Optional, and
only at the end of
windows
Queryable State Enablers
 Flink has state as a first class citizen
 State is fault tolerant (exactly once semantics)
 State is partitioned (sharded) together with the operators
that create/update it
 State is continuous (not mini batched)
 State is scalable (e.g., embedded RocksDB state backend)
21
Queryable State Status
 [FLINK-3779] / Pull Request #2051 :
Queryable State Prototype
 Design and implementation under evolution
 Some experiments were using earlier versions of the
implementation
 Exact numbers may differ in final implementation, but order
of magnitude is comparable
22
Queryable State Performance
23
Queryable State: Application View
24
Application only interested in latest realtime results
Application
Queryable State: Application View
25
Application requires both latest realtime- and older results
Database
realtime results older results
Application Query Service
current time
windows
past time
windows
Apache Flink Architecture Review
26
Queryable State: Implementation
27
Query Client
State
Registry
window()/
sum()
Job Manager Task Manager
ExecutionGraph
State Location Server
deploy
status
Query: /job/operation/state-name/key
State
Registry
window()/
sum()
Task Manager
(1) Get location of "key-partition"
for "operator" of" job"
(2) Look up
location
(3)
Respond location
(4) Query
state-name and key
local
state
register
Contrasting with key/value stores
28
Turning the Database Inside Out
 Cf. Martin Kleppman's talks on
re-designing data warehousing
based on log-centric processing
 This view angle picks up some of
these concepts
 Queryable State in Apache Flink = (Turning DB inside out)++
29
Write Path in Cassandra (simplified)
30
From the Apache Cassandra docs
Write Path in Cassandra (simplified)
31
From the Apache Cassandra docs
First step is durable write to the commit log
(in all databases that offer strong durability)
Memtable is a re-computable
view of the commit log
actions and the persistent SSTables)
Write Path in Cassandra (simplified)
32
From the Apache Cassandra docs
First step is durable write to the commit log
(in all databases that offer strong durability)
Memtable is a re-computable
view of the commit log
actions and the persistent SSTables)
Replication to Quorum
before write is acknowledged
Durability of Queryable state
33
snapshot
state
Durability of Queryable state
34
Event sequence is the ground truth and
is durably stored in the log already
Queryable state
re-computable
from checkpoint and log
snapshot
state Snapshot replication
can happen in the
background
Performance of Flink's State
35
window()/
sum()
Source /
filter() /
map()
State index
(e.g., RocksDB)
Events are persistent
and ordered (per partition / key)
in the log (e.g., Apache Kafka)
Events flow without replication or synchronous writes
Performance of Flink's State
36
window()/
sum()
Source /
filter() /
map()
Trigger checkpoint Inject checkpoint barrier
Performance of Flink's State
37
window()/
sum()
Source /
filter() /
map()
Take state snapshot RocksDB:
Trigger state
copy-on-write
Performance of Flink's State
38
window()/
sum()
Source /
filter() /
map()
Persist state snapshots Durably persist
snapshots
asynchronously
Processing pipeline continues
Conclusion
39
Takeaways
 Streaming applications are often not bound by the stream
processor itself. Cross system interaction is frequently biggest
bottleneck
 Queryable state mitigates a big bottleneck: Communication
with external key/value stores to publish realtime results
 Apache Flink's sophisticated support for state makes this
possible
40
Takeaways
Performance of Queryable State
 Data persistence is fast with logs (Apache Kafka)
• Append only, and streaming replication
 Computed state is fast with local data structures and no
synchronous replication (Apache Flink)
 Flink's checkpoint method makes computed state persistent
with low overhead
41
Go Flink!
42
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers

More Related Content

What's hot (20)

PDF
TiDB as an HTAP Database
PingCAP
 
PDF
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
PDF
Getting Started with Confluent Schema Registry
confluent
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
When NOT to use Apache Kafka?
Kai Wähner
 
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
PDF
A Deep Dive into Kafka Controller
confluent
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
PDF
Oracle database maximum performance on Exadata
Alireza Kamrani
 
PDF
ELK, a real case study
Paolo Tonin
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PDF
ksqlDB: Building Consciousness on Real Time Events
confluent
 
PPTX
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
PDF
Redpanda and ClickHouse
Altinity Ltd
 
PDF
Securing Kafka
confluent
 
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
HostedbyConfluent
 
TiDB as an HTAP Database
PingCAP
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
Getting Started with Confluent Schema Registry
confluent
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
When NOT to use Apache Kafka?
Kai Wähner
 
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
A Deep Dive into Kafka Controller
confluent
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
Oracle database maximum performance on Exadata
Alireza Kamrani
 
ELK, a real case study
Paolo Tonin
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Introduction to Apache Calcite
Jordan Halterman
 
ksqlDB: Building Consciousness on Real Time Events
confluent
 
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Redpanda and ClickHouse
Altinity Ltd
 
Securing Kafka
confluent
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
HostedbyConfluent
 

Viewers also liked (20)

PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PDF
Timeline service V2 at the Hadoop Summit SJ 2016
Vrushali Channapattan
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PPTX
IoT:what about data storage?
DataWorks Summit/Hadoop Summit
 
PPT
The Evolution of Big Data Pipelines at Intuit
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PPTX
Zero Downtime App Deployment using Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
PDF
The shortest path is not always a straight line
Vasia Kalavri
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Timeline service V2 at the Hadoop Summit SJ 2016
Vrushali Channapattan
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Flink vs. Spark
Slim Baltagi
 
IoT:what about data storage?
DataWorks Summit/Hadoop Summit
 
The Evolution of Big Data Pipelines at Intuit
DataWorks Summit/Hadoop Summit
 
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Zero Downtime App Deployment using Hadoop
DataWorks Summit/Hadoop Summit
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
The shortest path is not always a straight line
Vasia Kalavri
 
Ad

Similar to The Stream Processor as a Database Apache Flink (20)

PDF
Building Applications with Streams and Snapshots
J On The Beach
 
PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Stephan Ewen - Scaling to large State
Flink Forward
 
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PDF
Apache Flink Stream Processing
Suneel Marthi
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Building Applications with Streams and Snapshots
J On The Beach
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Stephan Ewen - Scaling to large State
Flink Forward
 
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Flink history, roadmap and vision
Stephan Ewen
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Apache Flink Stream Processing
Suneel Marthi
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 

The Stream Processor as a Database Apache Flink

  • 1. Stephan Ewen @stephanewen The Stream Processor as a Database Apache Flink
  • 2. 2 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  • 3. Apache Flink Stack 3 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  • 4. Programs and Dataflows 4 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  • 5. What makes Flink flink? 5 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  • 6. The (Classic) Use Case Realtime Counts and Aggregates 6
  • 7. (Real)Time Series Statistics 7 stream of events realtime statistics
  • 8. The Architecture 8 collect log analyze serve & store
  • 9. The Flink Job 9 case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  • 11. Putting it all together 11 Periodically (every second) flush new aggregates to Redis
  • 12. How does it perform? 12 Latency Throughput Number of Keys
  • 13. 99th Percentile Latency (sec) 9 8 2 1 Storm 0.10 Flink 0.10 60 80 100 120 140 160 180 Throughput (1000 events/sec) Spark Streaming 1.5 Yahoo! Streaming Benchmark 13 Latency (lower is better)
  • 14. Extended Benchmark: Throughput 14 Throughput • 10 Kafka brokers with 2 partitions each • 10 compute machines (Flink / Storm) • Xeon [email protected] CPU (4 cores HT) • 32 GB RAM (only 8GB allocated to JVMs) • 10 GigE Ethernet between compute nodes • 1 GigE Ethernet between Kafka cluster and Flink nodes
  • 15. Scaling Number of Users  Yahoo! Streaming Benchmark has 100 keys only • Every second, only 100 keys are written to key/value store • Quite few, compared to many real world use cases  Tweet impressions: millions keys/hour • Up to millions of keys updated per second 15 Number of Keys
  • 17. The Bottleneck 17 Writes to the key/value store take too long
  • 20. Queryable State 20 Optional, and only at the end of windows
  • 21. Queryable State Enablers  Flink has state as a first class citizen  State is fault tolerant (exactly once semantics)  State is partitioned (sharded) together with the operators that create/update it  State is continuous (not mini batched)  State is scalable (e.g., embedded RocksDB state backend) 21
  • 22. Queryable State Status  [FLINK-3779] / Pull Request #2051 : Queryable State Prototype  Design and implementation under evolution  Some experiments were using earlier versions of the implementation  Exact numbers may differ in final implementation, but order of magnitude is comparable 22
  • 24. Queryable State: Application View 24 Application only interested in latest realtime results Application
  • 25. Queryable State: Application View 25 Application requires both latest realtime- and older results Database realtime results older results Application Query Service current time windows past time windows
  • 27. Queryable State: Implementation 27 Query Client State Registry window()/ sum() Job Manager Task Manager ExecutionGraph State Location Server deploy status Query: /job/operation/state-name/key State Registry window()/ sum() Task Manager (1) Get location of "key-partition" for "operator" of" job" (2) Look up location (3) Respond location (4) Query state-name and key local state register
  • 29. Turning the Database Inside Out  Cf. Martin Kleppman's talks on re-designing data warehousing based on log-centric processing  This view angle picks up some of these concepts  Queryable State in Apache Flink = (Turning DB inside out)++ 29
  • 30. Write Path in Cassandra (simplified) 30 From the Apache Cassandra docs
  • 31. Write Path in Cassandra (simplified) 31 From the Apache Cassandra docs First step is durable write to the commit log (in all databases that offer strong durability) Memtable is a re-computable view of the commit log actions and the persistent SSTables)
  • 32. Write Path in Cassandra (simplified) 32 From the Apache Cassandra docs First step is durable write to the commit log (in all databases that offer strong durability) Memtable is a re-computable view of the commit log actions and the persistent SSTables) Replication to Quorum before write is acknowledged
  • 33. Durability of Queryable state 33 snapshot state
  • 34. Durability of Queryable state 34 Event sequence is the ground truth and is durably stored in the log already Queryable state re-computable from checkpoint and log snapshot state Snapshot replication can happen in the background
  • 35. Performance of Flink's State 35 window()/ sum() Source / filter() / map() State index (e.g., RocksDB) Events are persistent and ordered (per partition / key) in the log (e.g., Apache Kafka) Events flow without replication or synchronous writes
  • 36. Performance of Flink's State 36 window()/ sum() Source / filter() / map() Trigger checkpoint Inject checkpoint barrier
  • 37. Performance of Flink's State 37 window()/ sum() Source / filter() / map() Take state snapshot RocksDB: Trigger state copy-on-write
  • 38. Performance of Flink's State 38 window()/ sum() Source / filter() / map() Persist state snapshots Durably persist snapshots asynchronously Processing pipeline continues
  • 40. Takeaways  Streaming applications are often not bound by the stream processor itself. Cross system interaction is frequently biggest bottleneck  Queryable state mitigates a big bottleneck: Communication with external key/value stores to publish realtime results  Apache Flink's sophisticated support for state makes this possible 40
  • 41. Takeaways Performance of Queryable State  Data persistence is fast with logs (Apache Kafka) • Append only, and streaming replication  Computed state is fast with local data structures and no synchronous replication (Apache Flink)  Flink's checkpoint method makes computed state persistent with low overhead 41
  • 42. Go Flink! 42 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  • 43. Flink Forward 2016, Berlin Submission deadline: June 30, 2016 Early bird deadline: July 15, 2016 www.flink-forward.org