Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Robert Metzger
Flink committer
@rmetzger_
Apache
Flink

1 year of Flink - code
April 2014 April 2015

Community growth
3
0
20
40
60
80
100
120
Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15
#unique contributors by git
commits

What is Flink?
4
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime

Program compilation
5
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
DataSourc
e
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results

Native workload support
6
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis

E.g.: Non-native iterations
7
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}

E.g.: Non-native streaming
8
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}

Native workload support
9
Flink
Streaming
topologies
Heavy
batch jobs
Machine Learning at scale
How can an engine natively support all these workloads?
And what does native mean?

Flink Engine
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some mutable state
4. Operate on managed memory
10

Data Streaming Analysis
streaming dataflows
12

3 Parts of a Streaming Infrastructure
13
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs

3 Parts of a Streaming Infrastructure
14
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
Result may be fed back to the broker

Cornerstones of Flink Streaming
 Pipelined stream processor (low latency)
 Expressive APIs
 Flexible operator state, streaming windows
 Efficient fault tolerance for streams and
state.
15

Pipelined stream processor
16
Streaming
Shuffle!

Expressive APIs
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):

Checkpointing / Recovery
18
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Operator checkpoint
starting
Checkpoint done
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)

Long batch pipelines
Batch on Streaming
19

Batch on Streaming
 Batch programs are a special kind of
streaming program
21
Infinite Streams Finite Streams
Stream Windows Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs

Batch Pipelines
22
Data exchange (shuffle / broadcast)
is mostly streamed
Some operators block (e.g. sorts / hash tables)

Operators Execution Overlaps
23

Smooth out-of-core performance
26
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core

Table API
27
val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)
.filter( o => dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders
.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as revenue")
val result = items
.groupBy("orderId, orderDate, shipPrio")
.select('orderId, revenue.sum, orderDate, shipPrio")

Machine Learning Algorithms
Iterative data flows
28

Iterate by looping
 for/while loop in client submits one job per
iteration step
 Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
29

Example: Matrix Factorization
31
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html

Graph Analysis
Stateful Iterations
32

Iterate natively with state/deltas
33

Effect of delta iterations…
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration

… fast graph analysis
35More at: http://data-artisans.com/data-analysis-with-flink.html

Flink Roadmap for 2015
Some examples:
 More flexible state and state backends in
streaming
 Master Failover
 Improved monitoring
 Integration with other Apache projects
• SAMOA, Zeppelin, Ignite
 More additions to the libraries
37

Flink Forward registration & call
for abstracts is open now
flink.apache.org 38
• 12. and 13. October 2015
• Kulturbrauerei Berlin
• With Flink Workshops/Training!

Examples of optimization
 Task chaining
• Coalesce map/filter/etc tasks
 Join optimizations
• Broadcast/partition, build/probe side, hash or sort-
merge
 Interesting properties
• Re-use partitioning and sorting for later operations
 Automatic caching
• E.g., for iterations
43

Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

More Related Content

What's hot (8)

Viewers also liked (8)

Similar to Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015 (20)

More from Robert Metzger (20)

Recently uploaded (20)

Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Editor's Notes