SlideShare a Scribd company logo
Aljoscha Krettek, engineering manager & co-founder
Till Rohrmann, engineering manager & co-founder
The Past, Present, and
Future of Apache Flink®
© 2018 data Artisans2
Past
© 2018 data Artisans3
It all started in 2014
2009 - 2014 since 2014
● Batch processor on top of streaming runtime
● First Apache Flink 0.6.0 release August 2014
© 2018 data Artisans4
August 2014
Batch processing
© 2018 data Artisans5
Flink learns to stream in real time
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
© 2018 data Artisans6
● Continuous & real-time
November 2014
Batch processing Stream processing
© 2018 data Artisans7
Flink learns to remember
© 2018 data Artisans8
Flink learns to remember
© 2018 data Artisans9
Flink learns to remember
© 2018 data Artisans10
Flink learns to remember
Remember where we left off
© 2018 data Artisans11
● Continuous & real-time
● Stateful & exactly once
June 2015
Batch processing Stream processing
© 2018 data Artisans12
Latency vs. Throughput?
high latency
low high throughput
Prevailing
belief
● 10s of millions of events/s
● Latency down to 1 ms
≠
© 2018 data Artisans13
Flink becomes event-time aware
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Episode
VIII
Processing
time
1977 1980 1983 1999 2002 2005 2015 2017
Flink becomes event-time aware
© 2018 data Artisans14
Flink becomes event-time aware
Episode
I
Episode
II
Episode
III
Episode
IV
Episode
V
Episode
VI
Episode
VII
Episode
VIII
Processing time
1999 2002 2005 1977 1980 1983 2015 2017
Processing time
Event time
© 2018 data Artisans15
● Continuous & real-time
● Stateful & exactly once
● High throughput & low
latency
● Event time
November 2015
Batch processing Stream processing
© 2018 data Artisans16
More than just analytics: ProcessFunction
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…)
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = {
// work with event and state and schedule timers
}
def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = {
// handle callback when event-/processing- time instant is reached
}
}
● ProcessFunction gives access to state, time
and events
● Low level API
● Enables data-driven applications
© 2018 data Artisans17
● Continuous & real-time
● Stateful & exactly once
● High throughput & low
latency
● Event time
February 2017
Batch processing Stream processing
Data-driven
applications
© 2018 data Artisans18
Present & Future
© 2018 data Artisans19
Hardening
Faster network stack
Application level flow control
Resolving dependency hell
Present in a nutshell
Scaling
Incremental snapshots
Local recovery
Scalable timers
Interoperability
Resource elasticity
REST client-server interface
Container entrypoint
Stream SQL
SQL client
User-defined functions
More powerful joins
Misc
State TTL
Broadcast state
Kafka exactly-once producer
© 2018 data Artisans20
Large, larger, Flink
Time
State
Incremental
snapshots
● Snapshot only state diff
● Incremental snapshots allow to handle very large state
© 2018 data Artisans21
Faster failover is always better
© 2018 data Artisans22
Varying workloads
• Violating SLAs vs. wasting money
• Varying workloads require to adapt resources
© 2018 data Artisans23
Revamped distributed architecture
ResourceManager ClusterManager
TaskManagerJobManager
Dispatcher
Client
1. Submit job 2. Start job
3. Request slots
4. Allocate resources
5. Start TaskManager
6. Execute job
● Support for full resource elasticity
● Application parallelism can be dynamically changed
© 2018 data Artisans24
● Continuous & real-time
● Stateful & exactly once
● High throughput & low
latency
● Event time
● Applications as first
class citizens
Present & Future
Batch processing Stream processing
Data-driven
applications
© 2018 data Artisans25
Flink as a library (and still as a framework)
• Deploying Flink applications should be as easy as starting a process
• Bundle application code and Flink into a single image
• Process connects to other application processes and figures out its role
• Removing the cluster out of the equation
P2 P3
P1
P4
New process
© 2018 data Artisans26
How much control do I need?
Batch
processing
Continuous
processing
Real-time &
data-driven
applications
● Multiple short lived stages
● Different resource requirements
per stage
● Efficient execution requires
control over resources
● Flink allocates actively resources
● Continuously processing operators
● Constrained by external systems,
SLAs and application logic
● External system can assign
resources
● Flink reacts to available resources
© 2018 data Artisans27
Active vs. reactive mode
• Active mode
‒ Flink is aware of underlying cluster framework
‒ Flink allocate resources
‒ E.g. existing YARN and Mesos integration
• Reactive mode
‒ Flink is oblivious to its runtime environment
‒ External system allocates and releases resources
‒ Flink scales with respect to available resources
‒ Relevant for environments: Kubernetes, Docker, as a
library
© 2018 data Artisans28
Scaling automatically
• Latency
• Throughput
• Resource utilization
• Connector signals
© 2018 data Artisans29
How we create Flink Jobs
Flink APIs
Stream/Batch Processing
Runtime
Distributed Streaming Data Flow
Java/Scala
© 2018 data Artisans30
Flink SQL
Flink APIs
Stream/Batch Processing
Runtime
Distributed Streaming Data Flow
Java/Scala SQL
“NO CODING REQUIRED”
Source/Sink
definition in YAML
Configuration in
YAML
SQL commandline
User-defined
functions
Streaming and
Batch
Event time and
processing time
*since Flink 0.9.0 (June 2015)
© 2018 data Artisans31
© 2018 data Artisans32
“Join” me for some trading
buy buy sell buy
Join
$ 17
£ 42
12.5
buy sell
₪
© 2018 data Artisans33
Introducing Time-versioned Table Joins
buy buy sell buy
Join
buy sell
curr rate time
£ 42 3
£ 12 17
1453
31753
14
event time
© 2018 data Artisans34
SQL for pattern analysis?
SELECT * from ?
© 2018 data Artisans35
Introducing MATCH_RECOGNIZE
SELECT *
FROM TaxiRides
MATCH_RECOGNIZE (
PARTITION BY driverId
ORDER BY rideTime
MEASURES
S.rideId as sRideId
AFTER MATCH SKIP PAST LAST ROW
PATTERN (S M{2,} E)
DEFINE
S AS S.isStart = true,
M AS M.rideId <> S.rideId,
E AS E.isStart = false
AND E.rideId = S.rideId
)
© 2018 data Artisans36
Todays processing landscape
Streaming Batch
© 2018 data Artisans37
Batch/streaming unification
© 2018 data Artisans38
Into the Future
Big state
SQL
© 2018 data Artisans39
Thank
s!

More Related Content

What's hot (20)

PDF
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
PDF
Flink Forward Berlin 2018: Viktor Klang - Keynote "The convergence of stream ...
Flink Forward
 
PDF
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward
 
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
PDF
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward
 
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
PPTX
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
PDF
dA Platform Overview
Robert Metzger
 
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
PDF
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
PDF
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
PDF
Real-Time Dynamic Data Export Using the Kafka Ecosystem
confluent
 
PDF
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
confluent
 
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
PDF
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
PDF
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward
 
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
Flink Forward Berlin 2018: Viktor Klang - Keynote "The convergence of stream ...
Flink Forward
 
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
dA Platform Overview
Robert Metzger
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
confluent
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
confluent
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward
 

Similar to The Past, Present, and Future of Apache Flink (20)

PPTX
The Past, Present, and Future of Apache Flink®
Aljoscha Krettek
 
PDF
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Matt Stubbs
 
PDF
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward
 
PPTX
(Past), Present, and Future of Apache Flink
Aljoscha Krettek
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PPTX
Stream processing for the practitioner: Blueprints for common stream processi...
Aljoscha Krettek
 
PPTX
Stream Processing with Apache Apex
Pramod Immaneni
 
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
data Artisans Product Announcement
Flink Forward
 
PDF
Introduction to Flink Streaming
datamantra
 
PPTX
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
Yahoo Developer Network
 
PPTX
Apache Apex - Hadoop Users Group
Pramod Immaneni
 
PPTX
Stream Processing @ Lyft
Jamie Grier
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Mingmin Chen
 
The Past, Present, and Future of Apache Flink®
Aljoscha Krettek
 
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Matt Stubbs
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward
 
(Past), Present, and Future of Apache Flink
Aljoscha Krettek
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Stream processing for the practitioner: Blueprints for common stream processi...
Aljoscha Krettek
 
Stream Processing with Apache Apex
Pramod Immaneni
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
data Artisans Product Announcement
Flink Forward
 
Introduction to Flink Streaming
datamantra
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
Yahoo Developer Network
 
Apache Apex - Hadoop Users Group
Pramod Immaneni
 
Stream Processing @ Lyft
Jamie Grier
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Mingmin Chen
 
Ad

More from Aljoscha Krettek (12)

PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
PPTX
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
PPTX
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
PPTX
Robust stream processing with Apache Flink
Aljoscha Krettek
 
PDF
Unified stateful big data processing in Apache Beam (incubating)
Aljoscha Krettek
 
PPTX
Advanced Flink Training - Design patterns for streaming applications
Aljoscha Krettek
 
PPTX
Apache Flink - A Stream Processing Engine
Aljoscha Krettek
 
PPTX
Adventures in Timespace - How Apache Flink Handles Time and Windows
Aljoscha Krettek
 
PPTX
Flink 0.10 - Upcoming Features
Aljoscha Krettek
 
PPTX
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
PPTX
Apache Flink Hands-On
Aljoscha Krettek
 
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
Robust stream processing with Apache Flink
Aljoscha Krettek
 
Unified stateful big data processing in Apache Beam (incubating)
Aljoscha Krettek
 
Advanced Flink Training - Design patterns for streaming applications
Aljoscha Krettek
 
Apache Flink - A Stream Processing Engine
Aljoscha Krettek
 
Adventures in Timespace - How Apache Flink Handles Time and Windows
Aljoscha Krettek
 
Flink 0.10 - Upcoming Features
Aljoscha Krettek
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
Apache Flink Hands-On
Aljoscha Krettek
 
Ad

Recently uploaded (20)

PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 

The Past, Present, and Future of Apache Flink

  • 1. Aljoscha Krettek, engineering manager & co-founder Till Rohrmann, engineering manager & co-founder The Past, Present, and Future of Apache Flink®
  • 2. © 2018 data Artisans2 Past
  • 3. © 2018 data Artisans3 It all started in 2014 2009 - 2014 since 2014 ● Batch processor on top of streaming runtime ● First Apache Flink 0.6.0 release August 2014
  • 4. © 2018 data Artisans4 August 2014 Batch processing
  • 5. © 2018 data Artisans5 Flink learns to stream in real time DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow
  • 6. © 2018 data Artisans6 ● Continuous & real-time November 2014 Batch processing Stream processing
  • 7. © 2018 data Artisans7 Flink learns to remember
  • 8. © 2018 data Artisans8 Flink learns to remember
  • 9. © 2018 data Artisans9 Flink learns to remember
  • 10. © 2018 data Artisans10 Flink learns to remember Remember where we left off
  • 11. © 2018 data Artisans11 ● Continuous & real-time ● Stateful & exactly once June 2015 Batch processing Stream processing
  • 12. © 2018 data Artisans12 Latency vs. Throughput? high latency low high throughput Prevailing belief ● 10s of millions of events/s ● Latency down to 1 ms ≠
  • 13. © 2018 data Artisans13 Flink becomes event-time aware Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Episode VIII Processing time 1977 1980 1983 1999 2002 2005 2015 2017 Flink becomes event-time aware
  • 14. © 2018 data Artisans14 Flink becomes event-time aware Episode I Episode II Episode III Episode IV Episode V Episode VI Episode VII Episode VIII Processing time 1999 2002 2005 1977 1980 1983 2015 2017 Processing time Event time
  • 15. © 2018 data Artisans15 ● Continuous & real-time ● Stateful & exactly once ● High throughput & low latency ● Event time November 2015 Batch processing Stream processing
  • 16. © 2018 data Artisans16 More than just analytics: ProcessFunction class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state and schedule timers } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } } ● ProcessFunction gives access to state, time and events ● Low level API ● Enables data-driven applications
  • 17. © 2018 data Artisans17 ● Continuous & real-time ● Stateful & exactly once ● High throughput & low latency ● Event time February 2017 Batch processing Stream processing Data-driven applications
  • 18. © 2018 data Artisans18 Present & Future
  • 19. © 2018 data Artisans19 Hardening Faster network stack Application level flow control Resolving dependency hell Present in a nutshell Scaling Incremental snapshots Local recovery Scalable timers Interoperability Resource elasticity REST client-server interface Container entrypoint Stream SQL SQL client User-defined functions More powerful joins Misc State TTL Broadcast state Kafka exactly-once producer
  • 20. © 2018 data Artisans20 Large, larger, Flink Time State Incremental snapshots ● Snapshot only state diff ● Incremental snapshots allow to handle very large state
  • 21. © 2018 data Artisans21 Faster failover is always better
  • 22. © 2018 data Artisans22 Varying workloads • Violating SLAs vs. wasting money • Varying workloads require to adapt resources
  • 23. © 2018 data Artisans23 Revamped distributed architecture ResourceManager ClusterManager TaskManagerJobManager Dispatcher Client 1. Submit job 2. Start job 3. Request slots 4. Allocate resources 5. Start TaskManager 6. Execute job ● Support for full resource elasticity ● Application parallelism can be dynamically changed
  • 24. © 2018 data Artisans24 ● Continuous & real-time ● Stateful & exactly once ● High throughput & low latency ● Event time ● Applications as first class citizens Present & Future Batch processing Stream processing Data-driven applications
  • 25. © 2018 data Artisans25 Flink as a library (and still as a framework) • Deploying Flink applications should be as easy as starting a process • Bundle application code and Flink into a single image • Process connects to other application processes and figures out its role • Removing the cluster out of the equation P2 P3 P1 P4 New process
  • 26. © 2018 data Artisans26 How much control do I need? Batch processing Continuous processing Real-time & data-driven applications ● Multiple short lived stages ● Different resource requirements per stage ● Efficient execution requires control over resources ● Flink allocates actively resources ● Continuously processing operators ● Constrained by external systems, SLAs and application logic ● External system can assign resources ● Flink reacts to available resources
  • 27. © 2018 data Artisans27 Active vs. reactive mode • Active mode ‒ Flink is aware of underlying cluster framework ‒ Flink allocate resources ‒ E.g. existing YARN and Mesos integration • Reactive mode ‒ Flink is oblivious to its runtime environment ‒ External system allocates and releases resources ‒ Flink scales with respect to available resources ‒ Relevant for environments: Kubernetes, Docker, as a library
  • 28. © 2018 data Artisans28 Scaling automatically • Latency • Throughput • Resource utilization • Connector signals
  • 29. © 2018 data Artisans29 How we create Flink Jobs Flink APIs Stream/Batch Processing Runtime Distributed Streaming Data Flow Java/Scala
  • 30. © 2018 data Artisans30 Flink SQL Flink APIs Stream/Batch Processing Runtime Distributed Streaming Data Flow Java/Scala SQL “NO CODING REQUIRED” Source/Sink definition in YAML Configuration in YAML SQL commandline User-defined functions Streaming and Batch Event time and processing time *since Flink 0.9.0 (June 2015)
  • 31. © 2018 data Artisans31
  • 32. © 2018 data Artisans32 “Join” me for some trading buy buy sell buy Join $ 17 £ 42 12.5 buy sell ₪
  • 33. © 2018 data Artisans33 Introducing Time-versioned Table Joins buy buy sell buy Join buy sell curr rate time £ 42 3 £ 12 17 1453 31753 14 event time
  • 34. © 2018 data Artisans34 SQL for pattern analysis? SELECT * from ?
  • 35. © 2018 data Artisans35 Introducing MATCH_RECOGNIZE SELECT * FROM TaxiRides MATCH_RECOGNIZE ( PARTITION BY driverId ORDER BY rideTime MEASURES S.rideId as sRideId AFTER MATCH SKIP PAST LAST ROW PATTERN (S M{2,} E) DEFINE S AS S.isStart = true, M AS M.rideId <> S.rideId, E AS E.isStart = false AND E.rideId = S.rideId )
  • 36. © 2018 data Artisans36 Todays processing landscape Streaming Batch
  • 37. © 2018 data Artisans37 Batch/streaming unification
  • 38. © 2018 data Artisans38 Into the Future Big state SQL
  • 39. © 2018 data Artisans39 Thank s!