SlideShare a Scribd company logo
Apache Flink
Fast and Reliable Large-Scale Data Processing
Fabian Hueske
fhueske@apache.org @fhueske
1
About me
Fabian Hueske
fhueske@apache.org
• PMC member of Apache Flink
• With Flink since its beginnings in 2009
– back then an academic research project called Stratosphere
What is Apache Flink?
Distributed Data Flow Processing System
• Focused on large-scale data analytics
• Real-time stream and batch processing
• Easy and powerful APIs (Java / Scala)
• Robust execution backend
3
4
Flink in the Hadoop Ecosystem
TableAPI
GellyLibrary
MLLibrary
ApacheSAMOA
Optimizer
DataSet API (Java/Scala) DataStream API (Java/Scala)
Stream Builder
Runtime
Local Cluster Yarn Apache TezEmbedded
ApacheMRQL
Dataflow
HDFS
S3JDBCHCatalog
Apache HBase Apache Kafka Apache Flume
RabbitMQ
Hadoop IO
...
Data
Sources
Execution
Environments
Flink Core
Libraries
TableAPI
What is Flink good at?
It‘s a general-purpose data analytics system
• Complex and heavy ETL jobs
• Real-time stream processing with flexible windows
• Analyzing huge graphs
• Machine-learning on large data sets
• ...
5
Flink in the ASF
• Flink entered the ASF about one year ago
– 04/2014: Incubation
– 12/2014: Graduation
• Strongly growing community
– 17 committers
– 15 PMC members
0
20
40
60
80
100
120
Nov-10 Apr-12 Aug-13 Dec-14
#unique git committers (w/o manual de-dup)
6
Where is Flink moving?
A "use-case complete" framework to unify
batch & stream processing
Data Streams
• Kafka
• RabbitMQ
• ...
“Historic” data
• HDFS
• JDBC
• ...
Analytical Workloads
• ETL
• Relational processing
• Graph analysis
• Machine learning
• Streaming data analysis
7
Goal: Treat batch as finite stream
HOW TO USE FLINK?
Programming Model & APIs
8
DataSets and Transformations
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input
.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first
.map(str -> str.toLowerCase());
second.print();
env.execute();
Input First Secondfilter map
9
Expressive Transformations
• Element-wise
– map, flatMap, filter, project
• Group-wise
– groupBy, reduce, reduceGroup, combineGroup,
mapPartition, aggregate, distinct
• Binary
– join, coGroup, union, cross
• Iterations
– iterate, iterateDelta
• Physical re-organization
– rebalance, partitionByHash, sortPartition
• Streaming
– window, windowMap, coMap, ...
10
Rich Type System
• Use any Java/Scala classes as a data type
– Special support for tuples, POJOs, and case classes
– Not restricted to key-value pairs
• Define (composite) keys directly on data types
– Expression
– Tuple position
– Selector function
11
Counting Words in Batch and Stream
12
case class WordCount (word: String, count: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => WordCount(word,1))}
.window(Count.of(1000)).every(Count.of(100))
.groupBy("word").sum("count")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => WordCount(word,1))}
.groupBy("word").sum("count")
.print()
DataSet API (batch):
DataStream API (streaming):
Table API
• Execute SQL-like expressions on table data
– Tight integration with Java and Scala APIs
– Available for batch and streaming programs
val orders = env.readCsvFile(…)
.as('oId, 'oDate, 'shipPrio)
.filter('shipPrio === 5)
val items = orders
.join(lineitems).where('oId === 'id)
.select('oId, 'oDate, 'shipPrio,
'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)
val result = items
.groupBy('oId, 'oDate, 'shipPrio)
.select('oId, 'revenue.sum, 'oDate, 'shipPrio)
13
WHAT IS HAPPENING INSIDE?
Processing Engine
14
System Architecture
Coordination
Processing
Optimization
Client (pre-flight) Master
Workers
...
Flink
Program
...
15
Lots of cool technology inside Flink
• Batch and Streaming in one system
• Memory-safe execution
• Native data flow iterations
• Cost-based data flow optimizer
• Flexible windows on data streams
• Type extraction and custom serialization utilities
• Static code analysis on user functions
• and much more...
16
STREAM AND BATCH IN ONE SYSTEM
Pipelined Data Transfer
17
Stream and Batch in one System
• Most systems do either stream or batch
• In the past, Flink focused on batch processing
– Flink‘s runtime has always done stream processing
– Operators pipeline data forward as soon as it is processed
– Some operators are blocking (such as sort)
• Pipelining gives better performance for many
batch workloads
– Avoids materialization of large intermediate results
18
Pipelined Data Transfer
Large
Input
Small
Input
Small
Input
Resultjoin
Large
Input
map
Interm.
DataSet
Build
HT
Result
Program
Pipelined
Execution
Pipeline 1
Pipeline 2
joinProbe
HT
map No intermediate
materialization!
19
Flink Data Stream Processing
• Pipelining enables Flink‘s runtime to process streams
– API to define stream programs
– Operators for stream processing
• Stream API and operators are recent contributions
– Evolving very quickly under heavy development
– Integration with batch mode
20
MEMORY SAFE EXECUTION
Memory Management and Out-of-Core Algorithms
21
Memory-safe Execution
• Challenge of JVM-based data processing systems
– OutOfMemoryErrors due to data objects on the heap
• Flink runs complex data flows without memory tuning
– C++-style memory management
– Robust out-of-core algorithms
22
Managed Memory
• Active memory management
– Workers allocate 70% of JVM memory as byte arrays
– Algorithms serialize data objects into byte arrays
– In-memory processing as long as data is small enough
– Otherwise partial destaging to disk
• Benefits
– Safe memory bounds (no OutOfMemoryError)
– Scales to very large JVMs
– Reduced GC pressure
23
Going out-of-core
Single-core join of 1KB Java objects beyond memory (4 GB)
Blue bars are in-memory, orange bars (partially) out-of-core
24
GRAPH ANALYSIS
Native Data Flow Iterations
25
Native Data Flow Iterations
• Many graph and ML algorithms require iterations
• Flink features two types of iterations
– Bulk iterations
– Delta iterations
26
2
1
5
4
3
0.1
0.5
0.2
0.4
0.7
0.3
0.9
Iterative Data Flows
• Flink runs iterations „natively“ as cyclic data flows
– No loop unrolling
– Operators are scheduled once
– Data is fed back through backflow channel
– Loop-invariant data is cached
• Operator state is preserved across iterations!
initial
input
Iteration
head
resultreducejoin
Iteration
tail
other
datasets 27
Delta Iterations
• Delta iteration computes
– Delta update of solution set
– Work set for next iteration
• Work set drives computations of next iteration
– Workload of later iterations significantly reduced
– Fast convergence
• Applicable to certain problem domains
– Graph processing
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 5
#ofelementsupdated
# of iterations
28
Iteration Performance
PageRank on Twitter Follower Graph
30 Iterations
61 Iterations (Convergence)
29
WHAT IS COMING NEXT?
Roadmap
30
Just released a new version
• 0.9.0-milestone1 preview release just got out!
• Features
– Table API
– Flink ML Library
– Gelly Library
– Flink on Tez
– …
31
Flink’s Roadmap
Mission: Unified stream and batch processing
• Exactly-once streaming semantics with
flexible state checkpointing
• Extending the ML library
• Extending graph library
• Integration with Apache Zeppelin (incubating)
• SQL on top of the Table API
• And much more…
32
tl;dr – What’s worth to remember?
• Flink is a general-purpose data analytics system
• Unifies streaming and batch processing
• Expressive high-level APIs
• Robust and fast execution engine
33
I Flink, do you? ;-)
If you find Flink exciting,
get involved and start a discussion on Flink‘s ML
or stay tuned by
subscribing to news@flink.apache.org or
following @ApacheFlink on Twitter
34
35
BACKUP
36
Data Flow Optimizer
• Database-style optimizations for parallel data flows
• Optimizes all batch programs
• Optimizations
– Task chaining
– Join algorithms
– Re-use partitioning and sorting for later operations
– Caching for iterations
37
Data Flow Optimizer
val orders = …
val lineitems = …
val filteredOrders = orders
.filter(o => dataFormat.parse(l.shipDate).after(date))
.filter(o => o.shipPrio > 2)
val lineitemsOfOrders = filteredOrders
.join(lineitems)
.where(“orderId”).equalTo(“orderId”)
.apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice))
val priceSums = lineitemsOfOrders
.groupBy(“orderDate”)
.sum(“l.extdPrice”);
38
Data Flow Optimizer
DataSource
orders.tbl
Filter DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
Reduce
sort[0,1]
DataSource
orders.tbl
Filter DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
Reduce
sort[0,1]
Best plan
depends on
relative sizes of
input filespartial sort[0,1]
39

More Related Content

What's hot (20)

PPTX
Apache flink
Ahmed Nader
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PDF
Apache Spark vs Apache Flink
AKASH SIHAG
 
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
PPTX
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PDF
Introduction to Apache Flink
datamantra
 
PPTX
Fabian Hueske – Cascading on Flink
Flink Forward
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PPTX
Data Analysis With Apache Flink
DataWorks Summit
 
PDF
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Apache flink
Ahmed Nader
 
Flink Streaming @BudapestData
Gyula Fóra
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Flink history, roadmap and vision
Stephan Ewen
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Apache Spark vs Apache Flink
AKASH SIHAG
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Introduction to Apache Flink
datamantra
 
Fabian Hueske – Cascading on Flink
Flink Forward
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Data Analysis With Apache Flink
DataWorks Summit
 
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 

Viewers also liked (15)

PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Apache Flink - Community Update January 2015
Fabian Hueske
 
PPTX
Apache Flink - A Sneek Preview on Language Integrated Queries
Fabian Hueske
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PPTX
Apache Flink - Akka for the Win!
Fabian Hueske
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
PDF
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 
DOCX
Hadoop Report
Nishant Gandhi
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Apache Flink - Community Update January 2015
Fabian Hueske
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Fabian Hueske
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Apache Flink - Akka for the Win!
Fabian Hueske
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 
Hadoop Report
Nishant Gandhi
 
Ad

Similar to ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing (20)

PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PPTX
HBaseCon2015-final
Maryann Xue
 
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Advanced
mxmxm
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PDF
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
PDF
Impala Architecture presentation
hadooparchbook
 
PDF
Flink in Zalando's world of Microservices
ZalandoHayley
 
PDF
Flink in Zalando's World of Microservices
Zalando Technology
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PDF
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Apache Flink Deep Dive
DataWorks Summit
 
HBaseCon2015-final
Maryann Xue
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Advanced
mxmxm
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Impala Architecture presentation
hadooparchbook
 
Flink in Zalando's world of Microservices
ZalandoHayley
 
Flink in Zalando's World of Microservices
Zalando Technology
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Data base management system Transactions.ppt
gandhamcharan2006
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

  • 1. Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske [email protected] @fhueske 1
  • 2. About me Fabian Hueske [email protected] PMC member of Apache Flink • With Flink since its beginnings in 2009 – back then an academic research project called Stratosphere
  • 3. What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 3
  • 4. 4 Flink in the Hadoop Ecosystem TableAPI GellyLibrary MLLibrary ApacheSAMOA Optimizer DataSet API (Java/Scala) DataStream API (Java/Scala) Stream Builder Runtime Local Cluster Yarn Apache TezEmbedded ApacheMRQL Dataflow HDFS S3JDBCHCatalog Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO ... Data Sources Execution Environments Flink Core Libraries TableAPI
  • 5. What is Flink good at? It‘s a general-purpose data analytics system • Complex and heavy ETL jobs • Real-time stream processing with flexible windows • Analyzing huge graphs • Machine-learning on large data sets • ... 5
  • 6. Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community – 17 committers – 15 PMC members 0 20 40 60 80 100 120 Nov-10 Apr-12 Aug-13 Dec-14 #unique git committers (w/o manual de-dup) 6
  • 7. Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka • RabbitMQ • ... “Historic” data • HDFS • JDBC • ... Analytical Workloads • ETL • Relational processing • Graph analysis • Machine learning • Streaming data analysis 7 Goal: Treat batch as finite stream
  • 8. HOW TO USE FLINK? Programming Model & APIs 8
  • 9. DataSets and Transformations ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> input = env.readTextFile(input); DataSet<String> first = input .filter (str -> str.contains(“Apache Flink“)); DataSet<String> second = first .map(str -> str.toLowerCase()); second.print(); env.execute(); Input First Secondfilter map 9
  • 10. Expressive Transformations • Element-wise – map, flatMap, filter, project • Group-wise – groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct • Binary – join, coGroup, union, cross • Iterations – iterate, iterateDelta • Physical re-organization – rebalance, partitionByHash, sortPartition • Streaming – window, windowMap, coMap, ... 10
  • 11. Rich Type System • Use any Java/Scala classes as a data type – Special support for tuples, POJOs, and case classes – Not restricted to key-value pairs • Define (composite) keys directly on data types – Expression – Tuple position – Selector function 11
  • 12. Counting Words in Batch and Stream 12 case class WordCount (word: String, count: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => WordCount(word,1))} .window(Count.of(1000)).every(Count.of(100)) .groupBy("word").sum("count") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => WordCount(word,1))} .groupBy("word").sum("count") .print() DataSet API (batch): DataStream API (streaming):
  • 13. Table API • Execute SQL-like expressions on table data – Tight integration with Java and Scala APIs – Available for batch and streaming programs val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5) val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio) 13
  • 14. WHAT IS HAPPENING INSIDE? Processing Engine 14
  • 16. Lots of cool technology inside Flink • Batch and Streaming in one system • Memory-safe execution • Native data flow iterations • Cost-based data flow optimizer • Flexible windows on data streams • Type extraction and custom serialization utilities • Static code analysis on user functions • and much more... 16
  • 17. STREAM AND BATCH IN ONE SYSTEM Pipelined Data Transfer 17
  • 18. Stream and Batch in one System • Most systems do either stream or batch • In the past, Flink focused on batch processing – Flink‘s runtime has always done stream processing – Operators pipeline data forward as soon as it is processed – Some operators are blocking (such as sort) • Pipelining gives better performance for many batch workloads – Avoids materialization of large intermediate results 18
  • 20. Flink Data Stream Processing • Pipelining enables Flink‘s runtime to process streams – API to define stream programs – Operators for stream processing • Stream API and operators are recent contributions – Evolving very quickly under heavy development – Integration with batch mode 20
  • 21. MEMORY SAFE EXECUTION Memory Management and Out-of-Core Algorithms 21
  • 22. Memory-safe Execution • Challenge of JVM-based data processing systems – OutOfMemoryErrors due to data objects on the heap • Flink runs complex data flows without memory tuning – C++-style memory management – Robust out-of-core algorithms 22
  • 23. Managed Memory • Active memory management – Workers allocate 70% of JVM memory as byte arrays – Algorithms serialize data objects into byte arrays – In-memory processing as long as data is small enough – Otherwise partial destaging to disk • Benefits – Safe memory bounds (no OutOfMemoryError) – Scales to very large JVMs – Reduced GC pressure 23
  • 24. Going out-of-core Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core 24
  • 25. GRAPH ANALYSIS Native Data Flow Iterations 25
  • 26. Native Data Flow Iterations • Many graph and ML algorithms require iterations • Flink features two types of iterations – Bulk iterations – Delta iterations 26 2 1 5 4 3 0.1 0.5 0.2 0.4 0.7 0.3 0.9
  • 27. Iterative Data Flows • Flink runs iterations „natively“ as cyclic data flows – No loop unrolling – Operators are scheduled once – Data is fed back through backflow channel – Loop-invariant data is cached • Operator state is preserved across iterations! initial input Iteration head resultreducejoin Iteration tail other datasets 27
  • 28. Delta Iterations • Delta iteration computes – Delta update of solution set – Work set for next iteration • Work set drives computations of next iteration – Workload of later iterations significantly reduced – Fast convergence • Applicable to certain problem domains – Graph processing 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 5 #ofelementsupdated # of iterations 28
  • 29. Iteration Performance PageRank on Twitter Follower Graph 30 Iterations 61 Iterations (Convergence) 29
  • 30. WHAT IS COMING NEXT? Roadmap 30
  • 31. Just released a new version • 0.9.0-milestone1 preview release just got out! • Features – Table API – Flink ML Library – Gelly Library – Flink on Tez – … 31
  • 32. Flink’s Roadmap Mission: Unified stream and batch processing • Exactly-once streaming semantics with flexible state checkpointing • Extending the ML library • Extending graph library • Integration with Apache Zeppelin (incubating) • SQL on top of the Table API • And much more… 32
  • 33. tl;dr – What’s worth to remember? • Flink is a general-purpose data analytics system • Unifies streaming and batch processing • Expressive high-level APIs • Robust and fast execution engine 33
  • 34. I Flink, do you? ;-) If you find Flink exciting, get involved and start a discussion on Flink‘s ML or stay tuned by subscribing to [email protected] or following @ApacheFlink on Twitter 34
  • 35. 35
  • 37. Data Flow Optimizer • Database-style optimizations for parallel data flows • Optimizes all batch programs • Optimizations – Task chaining – Join algorithms – Re-use partitioning and sorting for later operations – Caching for iterations 37
  • 38. Data Flow Optimizer val orders = … val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”) .sum(“l.extdPrice”); 38
  • 39. Data Flow Optimizer DataSource orders.tbl Filter DataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine Reduce sort[0,1] DataSource orders.tbl Filter DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] Reduce sort[0,1] Best plan depends on relative sizes of input filespartial sort[0,1] 39