Stream processing with Apache Flink (Timo Walther - Ververica)

© 2019 Ververica 1
Apache Flink®
An Introduction and Outlook into the Future
Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.
Timo Walther
Follow me: @twalthr (yes, without the e)

© 2019 Ververica 2
About me
Timo Walther
● Apache Flink Committer and PMC Member
● Part of Flink since 2013 (before it was actually called "Flink")
● Software Engineer @ Ververica
(formerly dataArtisans, now part of Alibaba Group)

© 2019 Ververica 3
About Ververica
Original creators of
Apache Flink®
Complete Stream
Processing Infrastructure

© 2019 Ververica 4
Ververica Platform

© 2019 Ververica 5
This talk is about Apache Flink
● What is Flink?
● Use Cases & Users
● Stateful Stream Processing
● Event-Time Processing
● APIs
● Ecosystem
● Community
● Roadmap & Future

© 2019 Ververica 6
What is Flink?

© 2019 Ververica 7
Event Streams State (Event) Time Snapshots
Core Building Blocks for Stream Processing
real-time and
replay
complex
business logic
consistency with
out-of-order data
and late data
forking /
versioning /
time-travel

© 2019 Ververica 8
What is Apache Flink?
Scalable embedded state
Access at memory speed &
scales with parallel operators.

© 2019 Ververica 9
9
What is Apache Flink?
Stateful computations over streams
real-time and historic:
fast, scalable, fault tolerant,
event time, large state, exactly-once

© 2019 Ververica 10
Flink Unifies Stream and Batch Processing
● Processes unbounded (stream) and bounded (batch) data
● Processes recorded (offline) and live (real-time) data
● Serves most streaming & batch use cases
– Data Pipelines, Analytics, CEP, Event-driven Applications

Consistency, Scale, Ecosystem
● Flexible and expressive APIs
● Guaranteed correctness
○ Exactly-once state consistency
○ Event-time semantics
● In-memory processing at massive scale
○ Runs on 10000s of cores
○ Manages 10s TBs of state
● Flexible deployments and large ecosystem
○ Kubernetes, YARN, Mesos, Docker, S3, HDFS, Kafka, Kinesis, …

Use Case & Users

Use Case: ETL and Data Pipelining
● Periodic ETL is the traditional
approach
○ External tool periodically triggers
ETL batch job
○ Also supported by Flink
● Data pipelines continuously
move data
○ Ingestion with low latency
○ No external tool
○ No artificial data boundaries

Use Case: Batch & Stream Analytics
● Batch analytics is great for ad-hoc
queries
○ Queries change faster than data
○ Interactive analytics / prototyping
● Stream analytics continuously
processes data
○ Data changes faster than queries
○ Live / low latency results
○ No Lambda architecture required!

Use Case: Event-Driven Applications
● Traditional application design
○ Compute & data tier architecture
○ React to and process events
○ State is stored in (remote) database
● Event-driven application
○ State is maintained locally
○ Guaranteed consistency by
periodic state checkpoints
○ Tight coupling of logic and data
(microservice architecture)
○ Highly scalable design

Powered By Apache Flink
Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html

Rapidly Growing Adoption
Source: Qubole “2018 Survey of Big Data Trends and Challenges.”
A survey among 400+ technology decisions makers about their big data projects.
125%

Stateful Stream Processing

Designing Applications as Data Flows
● Data Flows are a common programming abstraction.
● Events flow from operator to operator.
● Data Flows can be executed in parallelized.
Src SnkMap
User
Function
Window
User
Function
keyBy

What is State in a Streaming Application?
● Many functions are stateful
○ Streaming data arrives over time
○ Functions need to remember records or temporary results
● Any variable that lives across function invocations is state
● State must not be lost in case of a failure

Maintaining and Checkpointing State
● Flink maintains state locally per task (in-mem / on-disk)
○ Fast access!
● State is periodically checkpointed to durable storage
○ A checkpoint is a consistent snapshot of the state of all tasks

Checkpoint Consistency
● All tasks copy their state exactly! when they processed all events up
to the same position in the input
o State of source tasks includes current read position in input (e.g., Kafka offset)
Task State
(Read Position)
Stateless Task
Task State
(Partial Aggregate)

Recovery and Guaranteed Consistency
● Recovery is like loading a saved computer game.
● Flink recovers state with exactly-once consistency.
○ After a failure, the application is restarted.
○ All tasks load their state from the latest checkpoint.
○ The application continues as if the failure never happened..
Loading
Game...
Game
saved!
GAME
OVER!

Much More Than Just Exactly-Once Recovery!
● Suspend and resume applications
● Fix and upgrade applications
● Migrate applications to a different / upgraded cluster
● Scale applications in and out
● A/B test applications
● ...

Event-Time Processing

What is Time in a Streaming Application?
● Streaming data arrives over time.
● Many streaming computations are defined based on time.
○ “Count the number of records every 10 minutes.”
○ “Run some logic 1 hour after you saw this record.”
○ “Wait for 30 more seconds for data to arrive.”
● This raises some questions.
○ How does Flink measure time?
○ How does time relate to data?

Event-Time and Processing-Time
Event
Generator
● Mobile App
● Webserver
● Sensor
● ...
12:00:01 11:59:56 11:58:37
Event with
timestamp
Processing-time job
Event-time job
11:57:12
11:57:12
Application time
driven by data
Application time
driven by
machine clock

What is Processing-Time?
● A record is processed based on the wall-clock time when it arrives.
● Results are inherently non-deterministic and depend on
○ Clocks, load, and processing speed of machines
○ Arrival / ingestion rate of data and possibly backpressure
○ ...
● Applications of processing-time
○ Does not work for recorded data.
○ Does not work for data that arrives out-of-order
○ Might be sufficient for approximate, low-latency results

What is Event-Time?
• A record is processed based on an embedded timestamp.
○ Timestamp typically denotes time when record was created.
• The “current” time is determined by watermarks
○ A watermark is a special record with a timestamp w
○ Denotes that no more records with a time t <= w will arrive
• Properties of event-time processing
○ Results are deterministic
○ Same semantics when processing recorded and live data
○ Can trade result latency for result completeness

Layered APIs

SQL & Table API
● Unified APIs for streaming data and data at rest
○ Run the same query on batch and streaming data
○ ANSI SQL: No stream-specific syntax or semantics!
○ Many common stream analytics use cases supported
SELECT
userId,
COUNT(*) AS cnt
SESSION_START(clicktime, INTERVAL '30' MINUTE)
FROM clicks
GROUP BY
SESSION(clicktime, INTERVAL '30' MINUTE),
userId
Count clicks per user and session
(defined by 30 min. gap of inactivity).

DataStream API
● Programs are composed as data flows
● Logic is implemented as custom user functions
○ map, flatMap, reduce, window aggregation, window join,
asynchronous request function, …
● Data is processed as arbitrary Java/Scala objects
○ (Avro) POJOs, Tuple, Row

DataStream API Example
// a stream of website clicks
DataStream<Click> clicks = ...
DataStream<Tuple2<String, Long>> result = clicks
// project clicks to userId and add a 1 for counting
.map(
// define function by implementing the MapFunction interface.
new MapFunction<Click, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(Click click) {
return Tuple2.of(click.userId, 1L);
}
})
// key by userId (field 0)
.keyBy(0)
// define session window with 30 minute gap
.window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
// count clicks per session. Define function as lambda function.
.reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));
Count clicks per user and session
(defined by 30 min. gap of inactivity).
Same use case as previous SQL query.

ProcessFunctions
● Flink’s most expressive function interfaces
○ Expose access to State and Time
○ Are embedded in DataStream programs
● Enable powerful applications
○ Put events or intermediate results into state for future computations
○ Register timers to be called back once “time is up”
● A collection of multiple function interfaces
○ 1 input, 1 windowed input,
2 key-partitioned inputs, 2 broadcasted/forwarded inputs, ...

DSL & Libraries
● Stateful Functions
○ API to build lightweight, stateful, and strongly consistent applications.
○ Apps are composed of stateful functions that can arbitrary message each other.
○ Contribution in progress
● DataSet API for batch processing
○ Flink is a great batch processing engine!
○ Process data in binary representation in managed memory.
● CEP Library for complex event processing
○ Detect patterns in event streams.

Ecosystem

Framework & Library Deployments
Framework Deployment Library Deployment

Selected Connectors
● Event logs:
○ Kafka, Kinesis, Pulsar*
● File systems:
○ S3, HDFS, NFS, MapR FS, …
● Encodings:
○ Avro, JSON, CSV, ORC, Parquet
● Databases:
○ JDBC, Hive
● Key-Value Stores
○ Cassandra, Elasticsearch, Redis*
* Connectors available as part of other projects.

Community

Development & Releases
● Apache Flink is developed by an open source community
○ Everybody is welcome to contribute.
● Fast development pace
○ Feature releases every 3-4 months
○ Bugfix releases more frequently as needed
1.7.0
11/2018
1.5.0
05/2018
1.5.1: 07/2018
1.5.2: 07/2018
1.5.3: 08/2018
1.5.4: 09/2018
1.5.5: 10/2018
1.6.0
08/2018
1.6.1: 09/2018
1.6.2: 10/2018
1.7.1: 12/2018
1.7.2: 02/2019
1.6.3: 12/2018
1.6.4: 02/2019
1.5.6: 12/2018
1.9.0
08/2019
1.8.0
04/2019
1.8.1: 07/2019
1.8.2: 09/2019
1.9.1: 10/2019

Growing & Active Community
● Flink’s community is very active and growing
● The community is answering many questions every day
○ In 2018, we had the most active user mailing lists of all 200+ ASF projects
○ ~4000 questions on Stack Overflow: [apache-flink], [flink-streaming], [flink-sql]

Roadmap & Future

Unified Batch and Stream Processing
● First OS system with a unified batch and stream processing engine
○ Based on a “true” streaming engine
● Porting DataSet API into DataStream API as “Bounded Streams”
● Why?
○ One engine to maintain and improve
○ One API for all use cases (incl. backfilling and state bootstrapping)
○ Competitive performance compared to best systems of each category
○ (Proving it’s possible)

SQL, Machine Learning & Notebooks
● Full-fletched Batch and Stream SQL engine
○ Full TPC-DS support
○ Batch queries with competitive performance
○ Continuous SQL queries over streaming data
● Python Table API
● Machine Learning, Data Exploration, and Notebook Support
● Integration with Hive ecosystem

API + Runtime for Stateful Applications
● Contribution of Stateful Functions API
○ Strongly consistent, stateful applications without transactional DBMS
○ Like Functions-as-a-Service + State
○ Arbitrary and reliable messaging between functions
● Unaligned Checkpoints to enable more fine-grained checkpoints
○ Faster checkpoints yield faster recovery and tighter SLAs

Summary
● Flink powers the world’s most demanding stateful streaming
applications
● Scope of applications expands quickly beyond “classical streaming”
○ Batch SQL, ML, Python, interactive notebooks
○ Event-driven, stateful applications
● Large and helpful community

@VervericaDatawww.ververica.com
Follow me @twalthr (yes, without the e) and grab a Flink sticker!

Stream processing with Apache Flink (Timo Walther - Ververica)

More Related Content

What's hot (20)

Similar to Stream processing with Apache Flink (Timo Walther - Ververica) (20)

More from KafkaZone (7)

Recently uploaded (20)

Stream processing with Apache Flink (Timo Walther - Ververica)