SlideShare a Scribd company logo
Stephan Ewen
@stephanewen
Streaming Analytics
with Apache Flink 1.0
Apache Flink Stack
2
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
Today
3
Streaming and batch as first class citizens.
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
4
Streaming is the next programming paradigm
for data applications, and you need to start
thinking in terms of streams.
5
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
Continuous Processing with Batch
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic batch
jobs
6
λ Architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
7
A Stream Processing Pipeline
8
collect log analyze serve & store
A brief History of Flink
9
January ‘10 December ‘14
v0.5 v0.6 v0.7
March ‘16
Flink Project
Incubation
Top Level
Project
v0.8 v0.10
Release
1.0
Project
Stratosphere
(Flink precursor)
v0.9
April ‘14
A brief History of Flink
10
January ‘10 December ‘14
v0.5 v0.6 v0.7
March ‘16
Flink Project
Incubation
Top Level
Project
v0.8 v0.10
Release
1.0
Project
Stratosphere
(Flink precursor)
v0.9
April ‘14
The academia gap:
Reading/writing papers,
teaching, worrying about thesis
Realizing this might be
interesting to people
beyond academia
(even more so, actually)
Programs and Dataflows
11
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Source
[1]
map()
[1]
keyBy()/
window()/
apply()
[1]
Sink
[1]
Source
[2]
map()
[2]
keyBy()/
window()/
apply()
[2]
Streaming
Dataflow
What makes Flink flink?
12
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
Streaming Analytics by Example
13
Time-Windowed Aggregations
14
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum("measure")
Time-Windowed Aggregations
15
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.timeWindow(Time.seconds(60), Time.seconds(5))
.sum("measure")
Session-Windowed Aggregations
16
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.max("measure")
Session-Windowed Aggregations
17
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.max("measure")
Flink 1.1 syntax
Pattern Detection
18
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("producer")
.flatMap(new RichFlatMapFuncion[Event, Alert]() {
lazy val state: ValueState[Int] = getRuntimeContext.getState(…)
def flatMap(event: Event, out: Collector[Alert]) = {
val newState = state.value() match {
case 0 if (event.evtType == 0) => 1
case 1 if (event.evtType == 1) => 0
case x => out.collect(Alert(event.msg, x)); 0
}
state.update(newState)
}
})
Pattern Detection
19
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("producer")
.flatMap(new RichFlatMapFuncion[Event, Alert]() {
lazy val state: ValueState[Int] = getRuntimeContext.getState(…)
def flatMap(event: Event, out: Collector[Alert]) = {
val newState = state.value() match {
case 0 if (event.evtType == 0) => 1
case 1 if (event.evtType == 1) => 0
case x => out.collect(Alert(event.msg, x)); 0
}
state.update(newState)
}
})
Embedded key/value
state store
Many more
 Joining streams (e.g. combine readings from sensor)
 Detecting Patterns (CEP)
 Applying (changing) rules or models to events
 Training and applying online machine learning
models
 …
20
(It's) About Time
21
22
The biggest change in moving from
batch to streaming is
handling time explicitly
Example: Windowing by Time
23
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Example: Windowing by Time
24
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Different Notions of Time
25
Event Producer Message Queue
Flink
Data Source
Flink
Window Operator
partition 1
partition 2
Event
Time
Ingestion
Time
Window
Processing
Time
1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time vs. Processing Time
26
Out of order Streams
27
Events occur on devices
Queue / Log
Events analyzed in
a
data streaming
system
Stream Analysis
Events stored in a log
Out of order Streams
28
Out of order Streams
29
Out of order Streams
30
Out of order Streams
31
Out of order !!!
First burst of events
Second burst of events
Out of order Streams
32
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events
Processing Time
33
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(ProcessingTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Window by operator's processing time
Ingestion Time
34
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(IngestionTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Event Time
35
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Event Time
36
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignAscendingTimestamps(_.timestamp)
tsStream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Event Time
37
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
tsStream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Watermarks
38
7
W(11)W(17)
11159121417122220 171921
Watermark
Event
Event timestamp
Stream (in order)
7
W(11)W(20)
Watermark
991011141517
Event
Event timestamp
1820 192123
Stream (out of order)
Watermarks in Parallel
39
Source
(1)
Source
(2)
map
(1)
map
(2)
window
(1)
window
(2)
29
29
17
14
14
29
14
14
W(33)
W(17)
W(17)
A|30B|31
C|30
D|15
E|30
F|15G|18H|20
K|35
Watermark
Event Time
at the operator
Event
[id|timestamp]
Event Time
at input streams
33
17
Watermark
Generation
M|39N|39Q|44
L|22O|23R|37
Mixing Event Time Processing Time
40
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignAscendingTimestamps(_.timestamp)
tsStream
.keyBy("id")
.window(SlidingEventTimeWindows.of(seconds(15), seconds(5))
.trigger(new MyTrigger())
.sum("measure")
Window Triggers
 React to any combination of
• Event Time
• Processing Time
• Event data
 Example of a mixed EventTime / Proc. Time Trigger:
• Trigger when event time reaches window end
OR
• When processing time reaches window end plus 30 secs.
41
Trigger example
42
.sum("measure")
public class EventTimeTrigger extends Trigger<Object, TimeWindow> {
public TriggerResult onElement(Object evt, long time,
TimeWindow window, TriggerContext ctx) {
ctx.registerEventTimeTimer(window.maxTimestamp());
ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000);
return TriggerResult.CONTINUE;
}
public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) {
return TriggerResult.FIRE_AND_PURGE;
}
public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) {
return TriggerResult.FIRE_AND_PURGE;
}
Trigger example
43
.sum("measure")
public class EventTimeTrigger extends Trigger<Object, TimeWindow> {
public TriggerResult onElement(Object evt, long time,
TimeWindow window, TriggerContext ctx) {
ctx.registerEventTimeTimer(window.maxTimestamp());
ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000);
return TriggerResult.CONTINUE;
}
public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) {
return TriggerResult.FIRE_AND_PURGE;
}
public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) {
return TriggerResult.FIRE_AND_CONTINUE;
}
Per Kafka Partition Watermarks
44
Source
(1)
Source
(2)
map
(1)
map
(2)
window
(1)
window
(2)
33
17
29
29
17
14
14
29
14
14
W(33)
W(17)
W(17)
A|30B|73
C|33
D|18
E|31
F|15G|91H|94
K|77
Watermark
Generation
L|35N|39
O|97 M|89
I|21Q|23
T|99 S|97
Per Kafka Partition Watermarks
45
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val kafka = new FlinkKafkaConsumer09(topic, schema, props)
kafka.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
val stream: DataStream[Event] = env.addSource(kafka)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Matters of State
(Fault Tolerance, Reinstatements, etc)
46
Back to the Aggregation Example
47
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(
new FlinkKafkaConsumer09(topic, schema, properties))
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Stateful
Fault Tolerance
 Prevent data loss (reprocess lost in-flight events)
 Recover state consistency (exactly-once semantics)
• Pending windows & user-defined (key/value) state
 Checkpoint based fault tolerance
• Periodicaly create checkpoints
• Recovery: resume from last completed checkpoint
• Async. Barrier Snapshots (ABS) Algorithm 
48
Checkpoints
49
data stream
event
newer records older records
State of the dataflow
at point Y
State of the dataflow
at point X
Checkpoint Barriers
 Markers, injected into the streams
50
Checkpoint Procedure
51
Checkpoint Procedure
52
Savepoints
 A "Checkpoint" is a globally consistent point-in-time snapshot
of the streaming program (point in stream, state)
 A "Savepoint" is a user-triggered retained checkpoint
 Streaming programs can start from a savepoint
53
Savepoint B Savepoint A
(Re)processing data (in batch)
 Re-processing data (what-if exploration, to correct bugs, etc.)
 Usually by running a batch job with a set of old files
 Tools that map files to times
54
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am…
Collection of files, by ingestion time
2016-3-11
10:00pm
To the batch
processor
Unclear Batch Boundaries
55
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am…
2016-3-11
10:00pm
To the batch
processor
?
?
What about sessions across batches?
(Re)processing data (streaming)
 Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
 Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
56
Savepoint
Run new streaming
program from savepoint
Continuous Data Sources
57
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm …
partition
partition
Savepoint
Savepoint
Stream of Kafka Partitions
Stream view over sequence of files
Kafka offsets +
Operator state
File mod timestamp +
File position +
Operator state
WIP (target: Flink 1.1)
Upgrading Programs
 A program starting from a savepoint can differ from the
program that created the savepoint
• Unique operator names match state and operator
 Mechanism be used to fix bugs in programs, to evolve
programs, parameters, libraries, …
58
State Backends
 Large state is a collection of key/value pairs
 State backend defines what data structure holds the
state, plus how it is snapshotted
 Most common choices
• Main memory – snapshots to master
• Main memory – snapshots to dist. filesystem
• RocksDB – snapshots to dist. filesystem
59
Complex Event Processing Primer
60
Example: Temperature Monitoring
 Receiving temperature an power events
from sensors
 Looking for temperatures repeatedly
exceeding thresholds within a
short time period (10 secs)
61
Event Types
62
Defining Patterns
63
Generating Alerts
64
An Outlook on Things to Come
65
Flink in the wild
66
30 billion events daily 2 billion events in
10 1Gb machines
data integration & distribution
platform
See talks by at
Roadmap
 Dynamic Scaling, Resource Elasticity
 Stream SQL
 CEP enhancements
 Incremental & asynchronous state snapshotting
 Mesos support
 More connectors, end-to-end exactly once
 API enhancements (e.g., joins, slowly changing inputs)
 Security (data encryption, Kerberos with Kafka)
67
68
I stream, do you?

More Related Content

What's hot (20)

PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PDF
Introduction To Flink
Knoldus Inc.
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
From Zero to Hero with Kafka Connect
confluent
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
PDF
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Practical learnings from running thousands of Flink jobs
Flink Forward
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PPTX
Apache kafka
Kumar Shivam
 
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
PPTX
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Introduction To Flink
Knoldus Inc.
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
From Zero to Hero with Kafka Connect
confluent
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Introduction to Apache Kafka
AIMDek Technologies
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Apache kafka
Kumar Shivam
 
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 

Viewers also liked (8)

PPTX
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PDF
Inside Parquet Format
Yue Chen
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Streaming Analytics & CEP - Two sides of the same coin?
Till Rohrmann
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
Inside HDFS Append
Yue Chen
 
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Inside Parquet Format
Yue Chen
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Streaming Analytics & CEP - Two sides of the same coin?
Till Rohrmann
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Flink vs. Spark
Slim Baltagi
 
Inside HDFS Append
Yue Chen
 
Ad

Similar to Apache Flink @ NYC Flink Meetup (20)

PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
PDF
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
PDF
Introduction to Streaming with Apache Flink
Tugdual Grall
 
PDF
Building Applications with Streams and Snapshots
J On The Beach
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PDF
Introduction to Flink Streaming
datamantra
 
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Data Stream Analytics - Why they are important
Paris Carbone
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Introduction to Streaming with Apache Flink
Tugdual Grall
 
Building Applications with Streams and Snapshots
J On The Beach
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Stream processing - Apache flink
Renato Guimaraes
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Introduction to Flink Streaming
datamantra
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Data Stream Analytics - Why they are important
Paris Carbone
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Ad

More from Stephan Ewen (7)

PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
PPTX
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
PPTX
Flink use cases @ bay area meetup (october 2015)
Stephan Ewen
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Flink use cases @ bay area meetup (october 2015)
Stephan Ewen
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Flink history, roadmap and vision
Stephan Ewen
 

Recently uploaded (20)

PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Digital Circuits, important subject in CS
contactparinay1
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 

Apache Flink @ NYC Flink Meetup

  • 2. Apache Flink Stack 2 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  • 3. Today 3 Streaming and batch as first class citizens. DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries
  • 4. 4 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams.
  • 5. 5 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  • 6. Continuous Processing with Batch  Continuous ingestion  Periodic (e.g., hourly) files  Periodic batch jobs 6
  • 7. λ Architecture  "Batch layer": what we had before  "Stream layer": approximate early results 7
  • 8. A Stream Processing Pipeline 8 collect log analyze serve & store
  • 9. A brief History of Flink 9 January ‘10 December ‘14 v0.5 v0.6 v0.7 March ‘16 Flink Project Incubation Top Level Project v0.8 v0.10 Release 1.0 Project Stratosphere (Flink precursor) v0.9 April ‘14
  • 10. A brief History of Flink 10 January ‘10 December ‘14 v0.5 v0.6 v0.7 March ‘16 Flink Project Incubation Top Level Project v0.8 v0.10 Release 1.0 Project Stratosphere (Flink precursor) v0.9 April ‘14 The academia gap: Reading/writing papers, teaching, worrying about thesis Realizing this might be interesting to people beyond academia (even more so, actually)
  • 11. Programs and Dataflows 11 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  • 12. What makes Flink flink? 12 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  • 14. Time-Windowed Aggregations 14 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum("measure")
  • 15. Time-Windowed Aggregations 15 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(60), Time.seconds(5)) .sum("measure")
  • 16. Session-Windowed Aggregations 16 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure")
  • 17. Session-Windowed Aggregations 17 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure") Flink 1.1 syntax
  • 18. Pattern Detection 18 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) } })
  • 19. Pattern Detection 19 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) } }) Embedded key/value state store
  • 20. Many more  Joining streams (e.g. combine readings from sensor)  Detecting Patterns (CEP)  Applying (changing) rules or models to events  Training and applying online machine learning models  … 20
  • 22. 22 The biggest change in moving from batch to streaming is handling time explicitly
  • 23. Example: Windowing by Time 23 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 24. Example: Windowing by Time 24 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 25. Different Notions of Time 25 Event Producer Message Queue Flink Data Source Flink Window Operator partition 1 partition 2 Event Time Ingestion Time Window Processing Time
  • 26. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time Event Time vs. Processing Time 26
  • 27. Out of order Streams 27 Events occur on devices Queue / Log Events analyzed in a data streaming system Stream Analysis Events stored in a log
  • 28. Out of order Streams 28
  • 29. Out of order Streams 29
  • 30. Out of order Streams 30
  • 31. Out of order Streams 31 Out of order !!! First burst of events Second burst of events
  • 32. Out of order Streams 32 Event time windows Arrival time windows Instant event-at-a-time First burst of events Second burst of events
  • 33. Processing Time 33 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(ProcessingTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Window by operator's processing time
  • 34. Ingestion Time 34 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(IngestionTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 35. Event Time 35 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 36. Event Time 36 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignAscendingTimestamps(_.timestamp) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 37. Event Time 37 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 38. Watermarks 38 7 W(11)W(17) 11159121417122220 171921 Watermark Event Event timestamp Stream (in order) 7 W(11)W(20) Watermark 991011141517 Event Event timestamp 1820 192123 Stream (out of order)
  • 39. Watermarks in Parallel 39 Source (1) Source (2) map (1) map (2) window (1) window (2) 29 29 17 14 14 29 14 14 W(33) W(17) W(17) A|30B|31 C|30 D|15 E|30 F|15G|18H|20 K|35 Watermark Event Time at the operator Event [id|timestamp] Event Time at input streams 33 17 Watermark Generation M|39N|39Q|44 L|22O|23R|37
  • 40. Mixing Event Time Processing Time 40 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignAscendingTimestamps(_.timestamp) tsStream .keyBy("id") .window(SlidingEventTimeWindows.of(seconds(15), seconds(5)) .trigger(new MyTrigger()) .sum("measure")
  • 41. Window Triggers  React to any combination of • Event Time • Processing Time • Event data  Example of a mixed EventTime / Proc. Time Trigger: • Trigger when event time reaches window end OR • When processing time reaches window end plus 30 secs. 41
  • 42. Trigger example 42 .sum("measure") public class EventTimeTrigger extends Trigger<Object, TimeWindow> { public TriggerResult onElement(Object evt, long time, TimeWindow window, TriggerContext ctx) { ctx.registerEventTimeTimer(window.maxTimestamp()); ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000); return TriggerResult.CONTINUE; } public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) { return TriggerResult.FIRE_AND_PURGE; } public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) { return TriggerResult.FIRE_AND_PURGE; }
  • 43. Trigger example 43 .sum("measure") public class EventTimeTrigger extends Trigger<Object, TimeWindow> { public TriggerResult onElement(Object evt, long time, TimeWindow window, TriggerContext ctx) { ctx.registerEventTimeTimer(window.maxTimestamp()); ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000); return TriggerResult.CONTINUE; } public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) { return TriggerResult.FIRE_AND_PURGE; } public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) { return TriggerResult.FIRE_AND_CONTINUE; }
  • 44. Per Kafka Partition Watermarks 44 Source (1) Source (2) map (1) map (2) window (1) window (2) 33 17 29 29 17 14 14 29 14 14 W(33) W(17) W(17) A|30B|73 C|33 D|18 E|31 F|15G|91H|94 K|77 Watermark Generation L|35N|39 O|97 M|89 I|21Q|23 T|99 S|97
  • 45. Per Kafka Partition Watermarks 45 val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val kafka = new FlinkKafkaConsumer09(topic, schema, props) kafka.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) val stream: DataStream[Event] = env.addSource(kafka) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  • 46. Matters of State (Fault Tolerance, Reinstatements, etc) 46
  • 47. Back to the Aggregation Example 47 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource( new FlinkKafkaConsumer09(topic, schema, properties)) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Stateful
  • 48. Fault Tolerance  Prevent data loss (reprocess lost in-flight events)  Recover state consistency (exactly-once semantics) • Pending windows & user-defined (key/value) state  Checkpoint based fault tolerance • Periodicaly create checkpoints • Recovery: resume from last completed checkpoint • Async. Barrier Snapshots (ABS) Algorithm  48
  • 49. Checkpoints 49 data stream event newer records older records State of the dataflow at point Y State of the dataflow at point X
  • 50. Checkpoint Barriers  Markers, injected into the streams 50
  • 53. Savepoints  A "Checkpoint" is a globally consistent point-in-time snapshot of the streaming program (point in stream, state)  A "Savepoint" is a user-triggered retained checkpoint  Streaming programs can start from a savepoint 53 Savepoint B Savepoint A
  • 54. (Re)processing data (in batch)  Re-processing data (what-if exploration, to correct bugs, etc.)  Usually by running a batch job with a set of old files  Tools that map files to times 54 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am… Collection of files, by ingestion time 2016-3-11 10:00pm To the batch processor
  • 55. Unclear Batch Boundaries 55 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am… 2016-3-11 10:00pm To the batch processor ? ? What about sessions across batches?
  • 56. (Re)processing data (streaming)  Draw savepoints at times that you will want to start new jobs from (daily, hourly, …)  Reprocess by starting a new job from a savepoint • Defines start position in stream (for example Kafka offsets) • Initializes pending state (like partial sessions) 56 Savepoint Run new streaming program from savepoint
  • 57. Continuous Data Sources 57 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm … partition partition Savepoint Savepoint Stream of Kafka Partitions Stream view over sequence of files Kafka offsets + Operator state File mod timestamp + File position + Operator state WIP (target: Flink 1.1)
  • 58. Upgrading Programs  A program starting from a savepoint can differ from the program that created the savepoint • Unique operator names match state and operator  Mechanism be used to fix bugs in programs, to evolve programs, parameters, libraries, … 58
  • 59. State Backends  Large state is a collection of key/value pairs  State backend defines what data structure holds the state, plus how it is snapshotted  Most common choices • Main memory – snapshots to master • Main memory – snapshots to dist. filesystem • RocksDB – snapshots to dist. filesystem 59
  • 61. Example: Temperature Monitoring  Receiving temperature an power events from sensors  Looking for temperatures repeatedly exceeding thresholds within a short time period (10 secs) 61
  • 65. An Outlook on Things to Come 65
  • 66. Flink in the wild 66 30 billion events daily 2 billion events in 10 1Gb machines data integration & distribution platform See talks by at
  • 67. Roadmap  Dynamic Scaling, Resource Elasticity  Stream SQL  CEP enhancements  Incremental & asynchronous state snapshotting  Mesos support  More connectors, end-to-end exactly once  API enhancements (e.g., joins, slowly changing inputs)  Security (data encryption, Kerberos with Kafka) 67