SlideShare a Scribd company logo
Robert Metzger
Flink committer
@rmetzger_
Apache
Flink
1 year of Flink - code
April 2014 April 2015
Community growth
3
0
20
40
60
80
100
120
Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15
#unique contributors by git
commits
What is Flink?
4
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
Program compilation
5
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
DataSourc
e
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
Native workload support
6
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
E.g.: Non-native iterations
7
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
E.g.: Non-native streaming
8
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
Native workload support
9
Flink
Streaming
topologies
Heavy
batch jobs
Machine Learning at scale
How can an engine natively support all these workloads?
And what does native mean?
Flink Engine
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some mutable state
4. Operate on managed memory
10
Flink by Use Case
11
Data Streaming Analysis
streaming dataflows
12
3 Parts of a Streaming Infrastructure
13
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
3 Parts of a Streaming Infrastructure
14
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
Result may be fed back to the broker
Cornerstones of Flink Streaming
 Pipelined stream processor (low latency)
 Expressive APIs
 Flexible operator state, streaming windows
 Efficient fault tolerance for streams and
state.
15
Pipelined stream processor
16
Streaming
Shuffle!
Expressive APIs
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Checkpointing / Recovery
18
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Operator checkpoint
starting
Checkpoint done
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
Long batch pipelines
Batch on Streaming
19
Batch Pipelines
20
Batch on Streaming
 Batch programs are a special kind of
streaming program
21
Infinite Streams Finite Streams
Stream Windows Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs
Batch Pipelines
22
Data exchange (shuffle / broadcast)
is mostly streamed
Some operators block (e.g. sorts / hash tables)
Operators Execution Overlaps
23
Memory Management
24
Memory Management
25
Smooth out-of-core performance
26
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
Table API
27
val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)
.filter( o => dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders
.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as revenue")
val result = items
.groupBy("orderId, orderDate, shipPrio")
.select('orderId, revenue.sum, orderDate, shipPrio")
Machine Learning Algorithms
Iterative data flows
28
Iterate by looping
 for/while loop in client submits one job per
iteration step
 Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
29
Iterate in the Dataflow
30
Example: Matrix Factorization
31
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
Graph Analysis
Stateful Iterations
32
Iterate natively with state/deltas
33
Effect of delta iterations…
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration
… fast graph analysis
35More at: http://data-artisans.com/data-analysis-with-flink.html
Closing
36
Flink Roadmap for 2015
Some examples:
 More flexible state and state backends in
streaming
 Master Failover
 Improved monitoring
 Integration with other Apache projects
• SAMOA, Zeppelin, Ignite
 More additions to the libraries
37
Flink Forward registration & call
for abstracts is open now
flink.apache.org 38
• 12. and 13. October 2015
• Kulturbrauerei Berlin
• With Flink Workshops/Training!
39
flink.apache.org
@ApacheFlink
41
42
Examples of optimization
 Task chaining
• Coalesce map/filter/etc tasks
 Join optimizations
• Broadcast/partition, build/probe side, hash or sort-
merge
 Interesting properties
• Re-use partitioning and sorting for later operations
 Automatic caching
• E.g., for iterations
43

More Related Content

What's hot (8)

PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PDF
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
PPTX
Case_Study_-_Advanced_Oracle_PLSQL
Ziemowit Jankowski
 
PDF
Tpl dataflow
Alex Kursov
 
PDF
TPL Dataflow – зачем и для кого?
GoSharp
 
PPTX
Presto overview
Shixiong Zhu
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Deep Dive into Apache Apex App Development
Apache Apex
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
Case_Study_-_Advanced_Oracle_PLSQL
Ziemowit Jankowski
 
Tpl dataflow
Alex Kursov
 
TPL Dataflow – зачем и для кого?
GoSharp
 
Presto overview
Shixiong Zhu
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 

Viewers also liked (8)

PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
PPTX
Flink Community Update April 2015
Robert Metzger
 
PDF
Dancing with Stream Processing
Sameera Horawalavithana
 
PPTX
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
PDF
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
PDF
Streaming architecture patterns
hadooparchbook
 
ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
Flink Community Update April 2015
Robert Metzger
 
Dancing with Stream Processing
Sameera Horawalavithana
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
Streaming architecture patterns
hadooparchbook
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Ad

Similar to Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015 (20)

PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PPTX
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
PDF
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Apache Flink Deep Dive
DataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Flink Streaming @BudapestData
Gyula Fóra
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Apache Flink internals
Kostas Tzoumas
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Ad

More from Robert Metzger (20)

PDF
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
Robert Metzger
 
PDF
dA Platform Overview
Robert Metzger
 
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
PPTX
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Robert Metzger
 
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
PPTX
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
Flink Community Update December 2015: Year in Review
Robert Metzger
 
PPTX
Flink September 2015 Community Update
Robert Metzger
 
PPTX
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
PPTX
August Flink Community Update
Robert Metzger
 
PPTX
Flink Cummunity Update July (Berlin Meetup)
Robert Metzger
 
PPTX
Apache Flink First Half of 2015 Community Update
Robert Metzger
 
PPTX
Apache Flink Hands On
Robert Metzger
 
PPTX
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
PPTX
Apache Flink Community Update March 2015
Robert Metzger
 
PPTX
Flink Community Update February 2015
Robert Metzger
 
PDF
Compute "Closeness" in Graphs using Apache Giraph.
Robert Metzger
 
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
Robert Metzger
 
dA Platform Overview
Robert Metzger
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Robert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Flink Community Update December 2015: Year in Review
Robert Metzger
 
Flink September 2015 Community Update
Robert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
August Flink Community Update
Robert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Robert Metzger
 
Apache Flink Hands On
Robert Metzger
 
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
Apache Flink Community Update March 2015
Robert Metzger
 
Flink Community Update February 2015
Robert Metzger
 
Compute "Closeness" in Graphs using Apache Giraph.
Robert Metzger
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 

Recently uploaded (20)

PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
July Patch Tuesday
Ivanti
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

  • 2. 1 year of Flink - code April 2014 April 2015
  • 3. Community growth 3 0 20 40 60 80 100 120 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15 #unique contributors by git commits
  • 4. What is Flink? 4 Gelly Table ML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime
  • 5. Program compilation 5 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  • 6. Native workload support 6 Flink Streaming topologies Long batch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis
  • 7. E.g.: Non-native iterations 7 Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }
  • 8. E.g.: Non-native streaming 8 stream discretizer Job Job Job Job while (true) { // get next few records // issue batch job }
  • 9. Native workload support 9 Flink Streaming topologies Heavy batch jobs Machine Learning at scale How can an engine natively support all these workloads? And what does native mean?
  • 10. Flink Engine 1. Execute everything as streams 2. Allow some iterative (cyclic) dataflows 3. Allow some mutable state 4. Operate on managed memory 10
  • 11. Flink by Use Case 11
  • 13. 3 Parts of a Streaming Infrastructure 13 Gathering Broker Analysis Sensors Transaction logs … Server Logs
  • 14. 3 Parts of a Streaming Infrastructure 14 Gathering Broker Analysis Sensors Transaction logs … Server Logs Result may be fed back to the broker
  • 15. Cornerstones of Flink Streaming  Pipelined stream processor (low latency)  Expressive APIs  Flexible operator state, streaming windows  Efficient fault tolerance for streams and state. 15
  • 17. Expressive APIs 17 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 18. Checkpointing / Recovery 18 Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots Pushes checkpoint barriers through the data flow Operator checkpoint starting Checkpoint done Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot Checkpoint done checkpoint in progress (backup till next snapshot)
  • 19. Long batch pipelines Batch on Streaming 19
  • 21. Batch on Streaming  Batch programs are a special kind of streaming program 21 Infinite Streams Finite Streams Stream Windows Global View Pipelined Data Exchange Pipelined or Blocking Exchange Streaming Programs Batch Programs
  • 22. Batch Pipelines 22 Data exchange (shuffle / broadcast) is mostly streamed Some operators block (e.g. sorts / hash tables)
  • 26. Smooth out-of-core performance 26 More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Blue bars are in-memory, orange bars (partially) out-of-core
  • 27. Table API 27 val customers = envreadCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE") val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio") val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue") val result = items .groupBy("orderId, orderDate, shipPrio") .select('orderId, revenue.sum, orderDate, shipPrio")
  • 29. Iterate by looping  for/while loop in client submits one job per iteration step  Data reuse by caching in memory and/or disk Step Step Step Step Step Client 29
  • 30. Iterate in the Dataflow 30
  • 31. Example: Matrix Factorization 31 Factorizing a matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html
  • 33. Iterate natively with state/deltas 33
  • 34. Effect of delta iterations… 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 51 56 61 #ofelementsupdated iteration
  • 35. … fast graph analysis 35More at: http://data-artisans.com/data-analysis-with-flink.html
  • 37. Flink Roadmap for 2015 Some examples:  More flexible state and state backends in streaming  Master Failover  Improved monitoring  Integration with other Apache projects • SAMOA, Zeppelin, Ignite  More additions to the libraries 37
  • 38. Flink Forward registration & call for abstracts is open now flink.apache.org 38 • 12. and 13. October 2015 • Kulturbrauerei Berlin • With Flink Workshops/Training!
  • 39. 39
  • 41. 41
  • 42. 42
  • 43. Examples of optimization  Task chaining • Coalesce map/filter/etc tasks  Join optimizations • Broadcast/partition, build/probe side, hash or sort- merge  Interesting properties • Re-use partitioning and sorting for later operations  Automatic caching • E.g., for iterations 43

Editor's Notes

  • #2: Working on Flink since 2012. Implemented YARN support
  • #3: Taking a look back: in only one year, a lot has happened. We were accepted in the ASF incubator, graduated quite fast … …. code wise, we are quickly adding new features and functionality (while not forgetting to keep existing users happy with fixes ;) ) I checked a few days ago and found that we’ve doubled the lines of code in one year
  • #4: we could have never done this alone without a very strong and amazing community.
  • #5: at the very heart, Flink is a streaming dataflow runtime. This means operators are running at the same time, sending data to each other. This allows exploiting parallelism, utilize the hardware etc. To get something out of that runtime, we over programming abstractions. There are DataSet and DataStream for batch and stream processing. On top of these APIs, user have build more: ….
  • #6: So how do we turn a simple java / scala program into a robust distributed program? type analysis / extraction (=think of it as “schema creation”) … creation of serializers optimization (data partitioning (global strategy), execution strategy (local strategy)) represented as a dataflow graph (with all the strategies set) -------- local / remote border ---- d) scheduling & job metadata @master e) workers process data
  • #7: What makes Flink special? natively supports a very broad range of use cases Common use cases are: - real time stream processing .. you want to process your data as it comes in large batch pipelines, reading data from many sources, joining, cleaning and analyzing. not only data intensive use cases, also work intensive use cases (machine learning, graph analysis) … how to intelligently distribute work through the cluster?
  • #8: iterations through loop unrolling: needed for many use cases, for example graph and machine learning explain approach  slow because rescheduling & state recreation necessary
  • #9: streaming through mini-batches discretize your stream into “small” sets and process them with your batch system.  high latency because you need to collect & start the batches
  • #10: How do we achieve this?
  • #11: everything is treated as data streams. multiple processing steps are happening at the same time. No materialization (=storing the result on disk) between processing steps We allow streams to have loops (feed in the result of earlier computation)  flink is aware of iterative processing, no need to redeploy, can automatically optimize Users can keep state between iterations (for example a model you are training). in streaming, we backup your state for you Flink always knows whats going on with its memory (instead of dealing with the “blackbox” GC)
  • #25: For batch processing (which is often very data intensive) we need … … explain … .. so this is nice, but now all the user data is just a bunch of bytes in an array?
  • #27: the fruits of our hard work
  • #28: The last highlight of the batch system: the best of both worlds: sql-style for the simple data lifting, custom functions for the complex / heavy stuff
  • #40: Shameless plug