SlideShare a Scribd company logo
© 2019 Ververica 1
Apache Flink®
An Introduction and Outlook into the Future
Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.
Timo Walther
Follow me: @twalthr (yes, without the e)
© 2019 Ververica 2
About me
Timo Walther
● Apache Flink Committer and PMC Member
● Part of Flink since 2013 (before it was actually called "Flink")
● Software Engineer @ Ververica
(formerly dataArtisans, now part of Alibaba Group)
© 2019 Ververica 3
About Ververica
Original creators of
Apache Flink®
Complete Stream
Processing Infrastructure
© 2019 Ververica 4
Ververica Platform
© 2019 Ververica 5
This talk is about Apache Flink
● What is Flink?
● Use Cases & Users
● Stateful Stream Processing
● Event-Time Processing
● APIs
● Ecosystem
● Community
● Roadmap & Future
© 2019 Ververica 6
What is Flink?
© 2019 Ververica 7
Event Streams State (Event) Time Snapshots
Core Building Blocks for Stream Processing
real-time and
replay
complex
business logic
consistency with
out-of-order data
and late data
forking /
versioning /
time-travel
© 2019 Ververica 8
What is Apache Flink?
Scalable embedded state
Access at memory speed &
scales with parallel operators.
© 2019 Ververica 9
9
What is Apache Flink?
Stateful computations over streams
real-time and historic:
fast, scalable, fault tolerant,
event time, large state, exactly-once
© 2019 Ververica 10
Flink Unifies Stream and Batch Processing
● Processes unbounded (stream) and bounded (batch) data
● Processes recorded (offline) and live (real-time) data
● Serves most streaming & batch use cases
– Data Pipelines, Analytics, CEP, Event-driven Applications
© 2019 Ververica 11
Consistency, Scale, Ecosystem
● Flexible and expressive APIs
● Guaranteed correctness
○ Exactly-once state consistency
○ Event-time semantics
● In-memory processing at massive scale
○ Runs on 10000s of cores
○ Manages 10s TBs of state
● Flexible deployments and large ecosystem
○ Kubernetes, YARN, Mesos, Docker, S3, HDFS, Kafka, Kinesis, …
© 2019 Ververica 12
Use Case & Users
© 2019 Ververica 13
Use Case: ETL and Data Pipelining
● Periodic ETL is the traditional
approach
○ External tool periodically triggers
ETL batch job
○ Also supported by Flink
● Data pipelines continuously
move data
○ Ingestion with low latency
○ No external tool
○ No artificial data boundaries
© 2019 Ververica 14
Use Case: Batch & Stream Analytics
● Batch analytics is great for ad-hoc
queries
○ Queries change faster than data
○ Interactive analytics / prototyping
● Stream analytics continuously
processes data
○ Data changes faster than queries
○ Live / low latency results
○ No Lambda architecture required!
© 2019 Ververica 15
Use Case: Event-Driven Applications
● Traditional application design
○ Compute & data tier architecture
○ React to and process events
○ State is stored in (remote) database
● Event-driven application
○ State is maintained locally
○ Guaranteed consistency by
periodic state checkpoints
○ Tight coupling of logic and data
(microservice architecture)
○ Highly scalable design
© 2019 Ververica 16
Powered By Apache Flink
Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html
© 2019 Ververica 17
Rapidly Growing Adoption
Source: Qubole “2018 Survey of Big Data Trends and Challenges.”
A survey among 400+ technology decisions makers about their big data projects.
125%
© 2019 Ververica 18
Stateful Stream Processing
© 2019 Ververica 19
Designing Applications as Data Flows
● Data Flows are a common programming abstraction.
● Events flow from operator to operator.
● Data Flows can be executed in parallelized.
Src SnkMap
User
Function
Window
User
Function
keyBy
© 2019 Ververica 20
What is State in a Streaming Application?
● Many functions are stateful
○ Streaming data arrives over time
○ Functions need to remember records or temporary results
● Any variable that lives across function invocations is state
● State must not be lost in case of a failure
© 2019 Ververica 21
Maintaining and Checkpointing State
● Flink maintains state locally per task (in-mem / on-disk)
○ Fast access!
● State is periodically checkpointed to durable storage
○ A checkpoint is a consistent snapshot of the state of all tasks
© 2019 Ververica 22
Checkpoint Consistency
● All tasks copy their state exactly! when they processed all events up
to the same position in the input
o State of source tasks includes current read position in input (e.g., Kafka offset)
Task State
(Read Position)
Stateless Task
Task State
(Partial Aggregate)
© 2019 Ververica 23
Recovery and Guaranteed Consistency
● Recovery is like loading a saved computer game.
● Flink recovers state with exactly-once consistency.
○ After a failure, the application is restarted.
○ All tasks load their state from the latest checkpoint.
○ The application continues as if the failure never happened..
Loading
Game...
Game
saved!
GAME
OVER!
© 2019 Ververica 24
Much More Than Just Exactly-Once Recovery!
● Suspend and resume applications
● Fix and upgrade applications
● Migrate applications to a different / upgraded cluster
● Scale applications in and out
● A/B test applications
● ...
© 2019 Ververica 25
Event-Time Processing
© 2019 Ververica 26
What is Time in a Streaming Application?
● Streaming data arrives over time.
● Many streaming computations are defined based on time.
○ “Count the number of records every 10 minutes.”
○ “Run some logic 1 hour after you saw this record.”
○ “Wait for 30 more seconds for data to arrive.”
● This raises some questions.
○ How does Flink measure time?
○ How does time relate to data?
© 2019 Ververica 27
Event-Time and Processing-Time
Event
Generator
● Mobile App
● Webserver
● Sensor
● ...
12:00:01 11:59:56 11:58:37
Event with
timestamp
Processing-time job
Event-time job
11:57:12
11:57:12
Application time
driven by data
Application time
driven by
machine clock
© 2019 Ververica 28
What is Processing-Time?
● A record is processed based on the wall-clock time when it arrives.
● Results are inherently non-deterministic and depend on
○ Clocks, load, and processing speed of machines
○ Arrival / ingestion rate of data and possibly backpressure
○ ...
● Applications of processing-time
○ Does not work for recorded data.
○ Does not work for data that arrives out-of-order
○ Might be sufficient for approximate, low-latency results
© 2019 Ververica 29
What is Event-Time?
• A record is processed based on an embedded timestamp.
○ Timestamp typically denotes time when record was created.
• The “current” time is determined by watermarks
○ A watermark is a special record with a timestamp w
○ Denotes that no more records with a time t <= w will arrive
• Properties of event-time processing
○ Results are deterministic
○ Same semantics when processing recorded and live data
○ Can trade result latency for result completeness
© 2019 Ververica 30
APIs
© 2019 Ververica 31
Layered APIs
© 2019 Ververica 32
SQL & Table API
● Unified APIs for streaming data and data at rest
○ Run the same query on batch and streaming data
○ ANSI SQL: No stream-specific syntax or semantics!
○ Many common stream analytics use cases supported
SELECT
userId,
COUNT(*) AS cnt
SESSION_START(clicktime, INTERVAL '30' MINUTE)
FROM clicks
GROUP BY
SESSION(clicktime, INTERVAL '30' MINUTE),
userId
Count clicks per user and session
(defined by 30 min. gap of inactivity).
© 2019 Ververica 33
DataStream API
● Programs are composed as data flows
● Logic is implemented as custom user functions
○ map, flatMap, reduce, window aggregation, window join,
asynchronous request function, …
● Data is processed as arbitrary Java/Scala objects
○ (Avro) POJOs, Tuple, Row
© 2019 Ververica 34
DataStream API Example
// a stream of website clicks
DataStream<Click> clicks = ...
DataStream<Tuple2<String, Long>> result = clicks
// project clicks to userId and add a 1 for counting
.map(
// define function by implementing the MapFunction interface.
new MapFunction<Click, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(Click click) {
return Tuple2.of(click.userId, 1L);
}
})
// key by userId (field 0)
.keyBy(0)
// define session window with 30 minute gap
.window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
// count clicks per session. Define function as lambda function.
.reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));
Count clicks per user and session
(defined by 30 min. gap of inactivity).
Same use case as previous SQL query.
© 2019 Ververica 35
ProcessFunctions
● Flink’s most expressive function interfaces
○ Expose access to State and Time
○ Are embedded in DataStream programs
● Enable powerful applications
○ Put events or intermediate results into state for future computations
○ Register timers to be called back once “time is up”
● A collection of multiple function interfaces
○ 1 input, 1 windowed input,
2 key-partitioned inputs, 2 broadcasted/forwarded inputs, ...
© 2019 Ververica 36
DSL & Libraries
● Stateful Functions
○ API to build lightweight, stateful, and strongly consistent applications.
○ Apps are composed of stateful functions that can arbitrary message each other.
○ Contribution in progress
● DataSet API for batch processing
○ Flink is a great batch processing engine!
○ Process data in binary representation in managed memory.
● CEP Library for complex event processing
○ Detect patterns in event streams.
© 2019 Ververica 37
Ecosystem
© 2019 Ververica 38
Framework & Library Deployments
Framework Deployment Library Deployment
© 2019 Ververica 39
Selected Connectors
● Event logs:
○ Kafka, Kinesis, Pulsar*
● File systems:
○ S3, HDFS, NFS, MapR FS, …
● Encodings:
○ Avro, JSON, CSV, ORC, Parquet
● Databases:
○ JDBC, Hive
● Key-Value Stores
○ Cassandra, Elasticsearch, Redis*
* Connectors available as part of other projects.
© 2019 Ververica 40
Community
© 2019 Ververica 41
Development & Releases
● Apache Flink is developed by an open source community
○ Everybody is welcome to contribute.
● Fast development pace
○ Feature releases every 3-4 months
○ Bugfix releases more frequently as needed
1.7.0
11/2018
1.5.0
05/2018
1.5.1: 07/2018
1.5.2: 07/2018
1.5.3: 08/2018
1.5.4: 09/2018
1.5.5: 10/2018
1.6.0
08/2018
1.6.1: 09/2018
1.6.2: 10/2018
1.7.1: 12/2018
1.7.2: 02/2019
1.6.3: 12/2018
1.6.4: 02/2019
1.5.6: 12/2018
1.9.0
08/2019
1.8.0
04/2019
1.8.1: 07/2019
1.8.2: 09/2019
1.9.1: 10/2019
© 2019 Ververica 42
Growing & Active Community
● Flink’s community is very active and growing
● The community is answering many questions every day
○ In 2018, we had the most active user mailing lists of all 200+ ASF projects
○ ~4000 questions on Stack Overflow: [apache-flink], [flink-streaming], [flink-sql]
© 2019 Ververica 43
Roadmap & Future
© 2019 Ververica 44
Unified Batch and Stream Processing
● First OS system with a unified batch and stream processing engine
○ Based on a “true” streaming engine
● Porting DataSet API into DataStream API as “Bounded Streams”
● Why?
○ One engine to maintain and improve
○ One API for all use cases (incl. backfilling and state bootstrapping)
○ Competitive performance compared to best systems of each category
○ (Proving it’s possible)
© 2019 Ververica 45
SQL, Machine Learning & Notebooks
● Full-fletched Batch and Stream SQL engine
○ Full TPC-DS support
○ Batch queries with competitive performance
○ Continuous SQL queries over streaming data
● Python Table API
● Machine Learning, Data Exploration, and Notebook Support
● Integration with Hive ecosystem
© 2019 Ververica 46
API + Runtime for Stateful Applications
● Contribution of Stateful Functions API
○ Strongly consistent, stateful applications without transactional DBMS
○ Like Functions-as-a-Service + State
○ Arbitrary and reliable messaging between functions
● Unaligned Checkpoints to enable more fine-grained checkpoints
○ Faster checkpoints yield faster recovery and tighter SLAs
© 2019 Ververica 47
Summary
● Flink powers the world’s most demanding stateful streaming
applications
● Scope of applications expands quickly beyond “classical streaming”
○ Batch SQL, ML, Python, interactive notebooks
○ Event-driven, stateful applications
● Large and helpful community
© 2019 Ververica 48
@VervericaDatawww.ververica.com
Follow me @twalthr (yes, without the e) and grab a Flink sticker!

More Related Content

What's hot (20)

PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
Apache Kafka Introduction
Amita Mirajkar
 
PPTX
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
Flink SQL: The Challenges to Build a Streaming SQL Engine
HostedbyConfluent
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Apache Kafka Introduction
Amita Mirajkar
 
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Productizing Structured Streaming Jobs
Databricks
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Flink SQL: The Challenges to Build a Streaming SQL Engine
HostedbyConfluent
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Kafka replication apachecon_2013
Jun Rao
 

Similar to Stream processing with Apache Flink (Timo Walther - Ververica) (20)

PDF
Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
StreamNative
 
PPTX
KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified ...
Flink Forward
 
PDF
Apache Flink Worst Practices
Konstantin Knauf
 
PDF
Don't Cross the Streams! (or do, we got you)
Caito Scherr
 
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
PDF
Better, Faster, Stronger Streaming: Your First Dive into Flink SQL
Caito Scherr
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Flink SQL in Action
Fabian Hueske
 
PPTX
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward
 
PPTX
The Past, Present, and Future of Apache Flink
Aljoscha Krettek
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Apache flink
pranay kumar
 
PDF
Zurich Flink Meetup
Konstantinos Kloudas
 
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Flink Forward
 
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Flink Forward
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HostedbyConfluent
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
StreamNative
 
KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified ...
Flink Forward
 
Apache Flink Worst Practices
Konstantin Knauf
 
Don't Cross the Streams! (or do, we got you)
Caito Scherr
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
Better, Faster, Stronger Streaming: Your First Dive into Flink SQL
Caito Scherr
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Flink SQL in Action
Fabian Hueske
 
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward
 
The Past, Present, and Future of Apache Flink
Aljoscha Krettek
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Apache flink
pranay kumar
 
Zurich Flink Meetup
Konstantinos Kloudas
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Flink Forward
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Flink Forward
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HostedbyConfluent
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Ad

More from KafkaZone (7)

PPTX
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
PDF
Real time data processing and model inferncing platform with Kafka streams (N...
KafkaZone
 
PDF
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
KafkaZone
 
PDF
Tale of two streaming frameworks (Karthik D - Walmart)
KafkaZone
 
PPTX
Stream processing at Hotstar
KafkaZone
 
PDF
Data science at scale with Kafka and Flink (Razorpay)
KafkaZone
 
PDF
Key considerations in productionizing streaming applications
KafkaZone
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
Real time data processing and model inferncing platform with Kafka streams (N...
KafkaZone
 
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
KafkaZone
 
Tale of two streaming frameworks (Karthik D - Walmart)
KafkaZone
 
Stream processing at Hotstar
KafkaZone
 
Data science at scale with Kafka and Flink (Razorpay)
KafkaZone
 
Key considerations in productionizing streaming applications
KafkaZone
 
Ad

Recently uploaded (20)

PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 

Stream processing with Apache Flink (Timo Walther - Ververica)

  • 1. © 2019 Ververica 1 Apache Flink® An Introduction and Outlook into the Future Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. Timo Walther Follow me: @twalthr (yes, without the e)
  • 2. © 2019 Ververica 2 About me Timo Walther ● Apache Flink Committer and PMC Member ● Part of Flink since 2013 (before it was actually called "Flink") ● Software Engineer @ Ververica (formerly dataArtisans, now part of Alibaba Group)
  • 3. © 2019 Ververica 3 About Ververica Original creators of Apache Flink® Complete Stream Processing Infrastructure
  • 4. © 2019 Ververica 4 Ververica Platform
  • 5. © 2019 Ververica 5 This talk is about Apache Flink ● What is Flink? ● Use Cases & Users ● Stateful Stream Processing ● Event-Time Processing ● APIs ● Ecosystem ● Community ● Roadmap & Future
  • 6. © 2019 Ververica 6 What is Flink?
  • 7. © 2019 Ververica 7 Event Streams State (Event) Time Snapshots Core Building Blocks for Stream Processing real-time and replay complex business logic consistency with out-of-order data and late data forking / versioning / time-travel
  • 8. © 2019 Ververica 8 What is Apache Flink? Scalable embedded state Access at memory speed & scales with parallel operators.
  • 9. © 2019 Ververica 9 9 What is Apache Flink? Stateful computations over streams real-time and historic: fast, scalable, fault tolerant, event time, large state, exactly-once
  • 10. © 2019 Ververica 10 Flink Unifies Stream and Batch Processing ● Processes unbounded (stream) and bounded (batch) data ● Processes recorded (offline) and live (real-time) data ● Serves most streaming & batch use cases – Data Pipelines, Analytics, CEP, Event-driven Applications
  • 11. © 2019 Ververica 11 Consistency, Scale, Ecosystem ● Flexible and expressive APIs ● Guaranteed correctness ○ Exactly-once state consistency ○ Event-time semantics ● In-memory processing at massive scale ○ Runs on 10000s of cores ○ Manages 10s TBs of state ● Flexible deployments and large ecosystem ○ Kubernetes, YARN, Mesos, Docker, S3, HDFS, Kafka, Kinesis, …
  • 12. © 2019 Ververica 12 Use Case & Users
  • 13. © 2019 Ververica 13 Use Case: ETL and Data Pipelining ● Periodic ETL is the traditional approach ○ External tool periodically triggers ETL batch job ○ Also supported by Flink ● Data pipelines continuously move data ○ Ingestion with low latency ○ No external tool ○ No artificial data boundaries
  • 14. © 2019 Ververica 14 Use Case: Batch & Stream Analytics ● Batch analytics is great for ad-hoc queries ○ Queries change faster than data ○ Interactive analytics / prototyping ● Stream analytics continuously processes data ○ Data changes faster than queries ○ Live / low latency results ○ No Lambda architecture required!
  • 15. © 2019 Ververica 15 Use Case: Event-Driven Applications ● Traditional application design ○ Compute & data tier architecture ○ React to and process events ○ State is stored in (remote) database ● Event-driven application ○ State is maintained locally ○ Guaranteed consistency by periodic state checkpoints ○ Tight coupling of logic and data (microservice architecture) ○ Highly scalable design
  • 16. © 2019 Ververica 16 Powered By Apache Flink Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html
  • 17. © 2019 Ververica 17 Rapidly Growing Adoption Source: Qubole “2018 Survey of Big Data Trends and Challenges.” A survey among 400+ technology decisions makers about their big data projects. 125%
  • 18. © 2019 Ververica 18 Stateful Stream Processing
  • 19. © 2019 Ververica 19 Designing Applications as Data Flows ● Data Flows are a common programming abstraction. ● Events flow from operator to operator. ● Data Flows can be executed in parallelized. Src SnkMap User Function Window User Function keyBy
  • 20. © 2019 Ververica 20 What is State in a Streaming Application? ● Many functions are stateful ○ Streaming data arrives over time ○ Functions need to remember records or temporary results ● Any variable that lives across function invocations is state ● State must not be lost in case of a failure
  • 21. © 2019 Ververica 21 Maintaining and Checkpointing State ● Flink maintains state locally per task (in-mem / on-disk) ○ Fast access! ● State is periodically checkpointed to durable storage ○ A checkpoint is a consistent snapshot of the state of all tasks
  • 22. © 2019 Ververica 22 Checkpoint Consistency ● All tasks copy their state exactly! when they processed all events up to the same position in the input o State of source tasks includes current read position in input (e.g., Kafka offset) Task State (Read Position) Stateless Task Task State (Partial Aggregate)
  • 23. © 2019 Ververica 23 Recovery and Guaranteed Consistency ● Recovery is like loading a saved computer game. ● Flink recovers state with exactly-once consistency. ○ After a failure, the application is restarted. ○ All tasks load their state from the latest checkpoint. ○ The application continues as if the failure never happened.. Loading Game... Game saved! GAME OVER!
  • 24. © 2019 Ververica 24 Much More Than Just Exactly-Once Recovery! ● Suspend and resume applications ● Fix and upgrade applications ● Migrate applications to a different / upgraded cluster ● Scale applications in and out ● A/B test applications ● ...
  • 25. © 2019 Ververica 25 Event-Time Processing
  • 26. © 2019 Ververica 26 What is Time in a Streaming Application? ● Streaming data arrives over time. ● Many streaming computations are defined based on time. ○ “Count the number of records every 10 minutes.” ○ “Run some logic 1 hour after you saw this record.” ○ “Wait for 30 more seconds for data to arrive.” ● This raises some questions. ○ How does Flink measure time? ○ How does time relate to data?
  • 27. © 2019 Ververica 27 Event-Time and Processing-Time Event Generator ● Mobile App ● Webserver ● Sensor ● ... 12:00:01 11:59:56 11:58:37 Event with timestamp Processing-time job Event-time job 11:57:12 11:57:12 Application time driven by data Application time driven by machine clock
  • 28. © 2019 Ververica 28 What is Processing-Time? ● A record is processed based on the wall-clock time when it arrives. ● Results are inherently non-deterministic and depend on ○ Clocks, load, and processing speed of machines ○ Arrival / ingestion rate of data and possibly backpressure ○ ... ● Applications of processing-time ○ Does not work for recorded data. ○ Does not work for data that arrives out-of-order ○ Might be sufficient for approximate, low-latency results
  • 29. © 2019 Ververica 29 What is Event-Time? • A record is processed based on an embedded timestamp. ○ Timestamp typically denotes time when record was created. • The “current” time is determined by watermarks ○ A watermark is a special record with a timestamp w ○ Denotes that no more records with a time t <= w will arrive • Properties of event-time processing ○ Results are deterministic ○ Same semantics when processing recorded and live data ○ Can trade result latency for result completeness
  • 30. © 2019 Ververica 30 APIs
  • 31. © 2019 Ververica 31 Layered APIs
  • 32. © 2019 Ververica 32 SQL & Table API ● Unified APIs for streaming data and data at rest ○ Run the same query on batch and streaming data ○ ANSI SQL: No stream-specific syntax or semantics! ○ Many common stream analytics use cases supported SELECT userId, COUNT(*) AS cnt SESSION_START(clicktime, INTERVAL '30' MINUTE) FROM clicks GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId Count clicks per user and session (defined by 30 min. gap of inactivity).
  • 33. © 2019 Ververica 33 DataStream API ● Programs are composed as data flows ● Logic is implemented as custom user functions ○ map, flatMap, reduce, window aggregation, window join, asynchronous request function, … ● Data is processed as arbitrary Java/Scala objects ○ (Avro) POJOs, Tuple, Row
  • 34. © 2019 Ververica 34 DataStream API Example // a stream of website clicks DataStream<Click> clicks = ... DataStream<Tuple2<String, Long>> result = clicks // project clicks to userId and add a 1 for counting .map( // define function by implementing the MapFunction interface. new MapFunction<Click, Tuple2<String, Long>>() { @Override public Tuple2<String, Long> map(Click click) { return Tuple2.of(click.userId, 1L); } }) // key by userId (field 0) .keyBy(0) // define session window with 30 minute gap .window(EventTimeSessionWindows.withGap(Time.minutes(30L))) // count clicks per session. Define function as lambda function. .reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1)); Count clicks per user and session (defined by 30 min. gap of inactivity). Same use case as previous SQL query.
  • 35. © 2019 Ververica 35 ProcessFunctions ● Flink’s most expressive function interfaces ○ Expose access to State and Time ○ Are embedded in DataStream programs ● Enable powerful applications ○ Put events or intermediate results into state for future computations ○ Register timers to be called back once “time is up” ● A collection of multiple function interfaces ○ 1 input, 1 windowed input, 2 key-partitioned inputs, 2 broadcasted/forwarded inputs, ...
  • 36. © 2019 Ververica 36 DSL & Libraries ● Stateful Functions ○ API to build lightweight, stateful, and strongly consistent applications. ○ Apps are composed of stateful functions that can arbitrary message each other. ○ Contribution in progress ● DataSet API for batch processing ○ Flink is a great batch processing engine! ○ Process data in binary representation in managed memory. ● CEP Library for complex event processing ○ Detect patterns in event streams.
  • 37. © 2019 Ververica 37 Ecosystem
  • 38. © 2019 Ververica 38 Framework & Library Deployments Framework Deployment Library Deployment
  • 39. © 2019 Ververica 39 Selected Connectors ● Event logs: ○ Kafka, Kinesis, Pulsar* ● File systems: ○ S3, HDFS, NFS, MapR FS, … ● Encodings: ○ Avro, JSON, CSV, ORC, Parquet ● Databases: ○ JDBC, Hive ● Key-Value Stores ○ Cassandra, Elasticsearch, Redis* * Connectors available as part of other projects.
  • 40. © 2019 Ververica 40 Community
  • 41. © 2019 Ververica 41 Development & Releases ● Apache Flink is developed by an open source community ○ Everybody is welcome to contribute. ● Fast development pace ○ Feature releases every 3-4 months ○ Bugfix releases more frequently as needed 1.7.0 11/2018 1.5.0 05/2018 1.5.1: 07/2018 1.5.2: 07/2018 1.5.3: 08/2018 1.5.4: 09/2018 1.5.5: 10/2018 1.6.0 08/2018 1.6.1: 09/2018 1.6.2: 10/2018 1.7.1: 12/2018 1.7.2: 02/2019 1.6.3: 12/2018 1.6.4: 02/2019 1.5.6: 12/2018 1.9.0 08/2019 1.8.0 04/2019 1.8.1: 07/2019 1.8.2: 09/2019 1.9.1: 10/2019
  • 42. © 2019 Ververica 42 Growing & Active Community ● Flink’s community is very active and growing ● The community is answering many questions every day ○ In 2018, we had the most active user mailing lists of all 200+ ASF projects ○ ~4000 questions on Stack Overflow: [apache-flink], [flink-streaming], [flink-sql]
  • 43. © 2019 Ververica 43 Roadmap & Future
  • 44. © 2019 Ververica 44 Unified Batch and Stream Processing ● First OS system with a unified batch and stream processing engine ○ Based on a “true” streaming engine ● Porting DataSet API into DataStream API as “Bounded Streams” ● Why? ○ One engine to maintain and improve ○ One API for all use cases (incl. backfilling and state bootstrapping) ○ Competitive performance compared to best systems of each category ○ (Proving it’s possible)
  • 45. © 2019 Ververica 45 SQL, Machine Learning & Notebooks ● Full-fletched Batch and Stream SQL engine ○ Full TPC-DS support ○ Batch queries with competitive performance ○ Continuous SQL queries over streaming data ● Python Table API ● Machine Learning, Data Exploration, and Notebook Support ● Integration with Hive ecosystem
  • 46. © 2019 Ververica 46 API + Runtime for Stateful Applications ● Contribution of Stateful Functions API ○ Strongly consistent, stateful applications without transactional DBMS ○ Like Functions-as-a-Service + State ○ Arbitrary and reliable messaging between functions ● Unaligned Checkpoints to enable more fine-grained checkpoints ○ Faster checkpoints yield faster recovery and tighter SLAs
  • 47. © 2019 Ververica 47 Summary ● Flink powers the world’s most demanding stateful streaming applications ● Scope of applications expands quickly beyond “classical streaming” ○ Batch SQL, ML, Python, interactive notebooks ○ Event-driven, stateful applications ● Large and helpful community
  • 48. © 2019 Ververica 48 @VervericaDatawww.ververica.com Follow me @twalthr (yes, without the e) and grab a Flink sticker!