SlideShare a Scribd company logo
Apache Flink® Training
DataStream API Basic
August 26, 2015
DataStream API
 Stream Processing
 Java and Scala
 All examples here in Java
 Documentation available at
flink.apache.org
 Currently labeled as beta – some API
changes are pending
• Noted in the slides with a warning
2
DataStream API by Example
3
Window WordCount: main Method
4
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Stream Execution Environment
5
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Data Sources
6
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Data types
7
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Transformations
8
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
User functions
9
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
DataSinks
10
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Execute!
11
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Window WordCount: FlatMap
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
12
WordCount: Map: Interface
13
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
WordCount: Map: Types
14
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
WordCount: Map: Collector
15
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
DataStream API Concepts
16
(Selected) Data Types
 Basic Java Types
• String, Long, Integer, Boolean,…
• Arrays
 Composite Types
• Tuples
• Many more (covered in the advanced slides)
17
Tuples
 The easiest and most lightweight way of
encapsulating data in Flink
 Tuple1 up to Tuple25
Tuple2<String, String> person = new Tuple2<>("Max", "Mustermann”);
Tuple3<String, String, Integer> person = new Tuple3<>("Max", "Mustermann", 42);
Tuple4<String, String, Integer, Boolean> person =
new Tuple4<>("Max", "Mustermann", 42, true);
// zero based index!
String firstName = person.f0;
String secondName = person.f1;
Integer age = person.f2;
Boolean fired = person.f3;
18
Transformations: Map
DataStream<Integer> integers = env.fromElements(1, 2, 3, 4);
// Regular Map - Takes one element and produces one element
DataStream<Integer> doubleIntegers =
integers.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) {
return value * 2;
}
});
doubleIntegers.print();
> 2, 4, 6, 8
// Flat Map - Takes one element and produces zero, one, or more elements.
DataStream<Integer> doubleIntegers2 =
integers.flatMap(new FlatMapFunction<Integer, Integer>() {
@Override
public void flatMap(Integer value, Collector<Integer> out) {
out.collect(value * 2);
}
});
doubleIntegers2.print();
> 2, 4, 6, 8
19
Transformations: Filter
// The DataStream
DataStream<Integer> integers = env.fromElements(1, 2, 3, 4);
DataStream<Integer> filtered =
integers.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) {
return value != 3;
}
});
integers.print();
> 1, 2, 4
20
Transformations: Partitioning
 DataStreams can be partitioned by a key
21
// (name, age) of employees
DataStream<Tuple2<String, Integer>> passengers = …
// group by second field (age)
DataStream<Integer, Integer> grouped = passengers.groupBy(1)
Stephan, 18 Fabian, 23
Julia, 27 Anna, 18
Romeo, 27
Anna, 18 Stephan, 18
Julia, 27 Romeo, 27
Fabian, 23
Warning: Possible
renaming in next
releasesBen, 25
Ben, 25
Data Shipping Strategies
 Optionally, you can specify how data is shipped
between two transformations
 Forward: stream.forward()
• Only local communication
 Rebalance: stream.rebalance()
• Round-robin partitioning
 Partition by hash: stream.partitionByHash(...)
 Custom partitioning: stream.partitionCustom(...)
 Broadcast: stream.broadcast()
• Broadcast to all nodes
22
Data Sources
Collection
 fromCollection(collection)
 fromElements(1,2,3,4,5)
23
Data Sources (2)
Text socket
 socketTextStream("hostname",port)
Text file
 readFileStream(“/path/to/file”, 1000,
WatchType.PROCESS_ONLY_APPENDED)
Connectors
 E.g., Apache Kafka, RabbitMQ, …
24
Data Sources: Collections
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// read from elements
DataStream<String> names = env.fromElements(“Some”, “Example”, “Strings”);
// read from Java collection
List<String> list = new ArrayList<String>();
list.add(“Some”);
list.add(“Example”);
list.add(“Strings”);
DataStream<String> names = env.fromCollection(list);
25
Data Sources: Files,Sockets,Connectors
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// read text socket from port
DataStream<String> socketLines = env
.socketTextStream(”localhost", 9999);
// read a text file ingesting new elements every 100 milliseconds
DataStream<String> localLines = env
.readFileStream(”/path/to/file", 1000,
WatchType.PROCESS_ONLY_APPENDED);
26
Data Sinks
Text
 writeAsText(“/path/to/file”)
CSV
 writeAsCsv(“/path/to/file”)
Return data to the Client
 print()
27
Note: Identical to
DataSet API
Data Sinks (2)
Socket
 writeToSocket(hostname, port, SerializationSchema)
Connectors
 E.g., Apache Kafka, Elasticsearch,
Rolling HDFS Files
28
Data Sinks
 Lazily executed when env.execute() is called
DataStream<…> result;
// nothing happens
result.writeToSocket(...);
// nothing happens
result.writeAsText("/path/to/file", "n", "|");
// Execution really starts here
env.execute();
29
Fault Tolerance
30
Fault Tolerance in Flink
 Flink provides recovery by taking a consistent checkpoint every N
milliseconds and rolling back to the checkpointed state
• https://ci.apache.org/projects/flink/flink-docs-
master/internals/stream_checkpointing.html
 Exactly once (default)
• // Take checkpoint every 5000 milliseconds
env.enableCheckpointing (5000)
 At least once (for lower latency)
• // Take checkpoint every 5000 milliseconds
env.enableCheckpointing (5000, CheckpointingMode.AT_LEAST_ONCE)
 Setting the interval to few seconds should be good for most
applications
 If checkpointing is not enabled, no recovery guarantees are provided
31
Best Practices
32
Some advice
 Use env.fromElements(..) or env.fromCollection(..) to
quickly get a DataStream to experiment
with
 Use print() to quickly print a DataStream
33
Update Guide
34
From 0.9 to 0.10
 groupBy(…) -> keyBy(…)
 DataStream renames:
• KeyedDataStream -> KeyedStream
• WindowedDataStream -> WindowedStream
• ConnectedDataStream -> ConnectedStream
• JoinOperator -> JoinedStreams
35
36

More Related Content

What's hot (20)

PPTX
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
PPTX
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
HostedbyConfluent
 
PPTX
Dataplane programming with eBPF: architecture and tools
Stefano Salsano
 
PDF
Kafka Streams: What it is, and how to use it?
confluent
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PDF
Jenkins-CI
Gong Haibing
 
PPTX
Apache Kafka
Saroj Panyasrivanit
 
PPTX
Cassandra Operations at Netflix
greggulrich
 
PDF
Service discovery with Eureka and Spring Cloud
Marcelo Serpa
 
PPTX
kafka
Amikam Snir
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
PPT
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
PDF
Why Splunk Chose Pulsar_Karthik Ramasamy
StreamNative
 
PPTX
Apache flink
Ahmed Nader
 
PDF
JavaScript Promises
Derek Willian Stavis
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PDF
Automate Your Kafka Cluster with Kubernetes Custom Resources
confluent
 
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
HostedbyConfluent
 
Dataplane programming with eBPF: architecture and tools
Stefano Salsano
 
Kafka Streams: What it is, and how to use it?
confluent
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Introduction to Kafka Streams
Guozhang Wang
 
Jenkins-CI
Gong Haibing
 
Apache Kafka
Saroj Panyasrivanit
 
Cassandra Operations at Netflix
greggulrich
 
Service discovery with Eureka and Spring Cloud
Marcelo Serpa
 
Introduction to Apache Kafka
Jeff Holoman
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
Why Splunk Chose Pulsar_Karthik Ramasamy
StreamNative
 
Apache flink
Ahmed Nader
 
JavaScript Promises
Derek Willian Stavis
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
confluent
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 

Viewers also liked (20)

PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PPTX
Fabian Hueske – Cascading on Flink
Flink Forward
 
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PDF
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Fabian Hueske – Cascading on Flink
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Ad

Similar to Apache Flink Training: DataStream API Part 1 Basic (20)

PPTX
Apache Flink Training: DataSet API Basics
Flink Forward
 
PDF
Apache Flink Stream Processing
Suneel Marthi
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PPTX
Introduction to Apache Flink
mxmxm
 
PDF
Real Time Big Data Management
Albert Bifet
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PDF
Streaming Dataflow with Apache Flink
huguk
 
PPTX
Advanced
mxmxm
 
PDF
A Sceptical Guide to Functional Programming
Garth Gilmour
 
PPTX
Flink Batch Processing and Iterations
Sameer Wadkar
 
PDF
Java 8 lambda expressions
Logan Chien
 
PDF
Demystifying functional programming with Scala
Denis
 
PDF
OOP and FP: become a better programmer - Simone Bordet, Mario Fusco - Codemot...
Codemotion
 
PDF
OOP and FP - Become a Better Programmer
Mario Fusco
 
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Taiwan User Group
 
PPTX
Lambdas Hands On Lab
Simon Ritter
 
PPTX
Flink internals web
Kostas Tzoumas
 
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
PPTX
Lambdas And Streams Hands On Lab, JavaOne 2014
Simon Ritter
 
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
Apache Flink Training: DataSet API Basics
Flink Forward
 
Apache Flink Stream Processing
Suneel Marthi
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Introduction to Apache Flink
mxmxm
 
Real Time Big Data Management
Albert Bifet
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Streaming Dataflow with Apache Flink
huguk
 
Advanced
mxmxm
 
A Sceptical Guide to Functional Programming
Garth Gilmour
 
Flink Batch Processing and Iterations
Sameer Wadkar
 
Java 8 lambda expressions
Logan Chien
 
Demystifying functional programming with Scala
Denis
 
OOP and FP: become a better programmer - Simone Bordet, Mario Fusco - Codemot...
Codemotion
 
OOP and FP - Become a Better Programmer
Mario Fusco
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Taiwan User Group
 
Lambdas Hands On Lab
Simon Ritter
 
Flink internals web
Kostas Tzoumas
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Lambdas And Streams Hands On Lab, JavaOne 2014
Simon Ritter
 
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 

Apache Flink Training: DataStream API Part 1 Basic

  • 1. Apache Flink® Training DataStream API Basic August 26, 2015
  • 2. DataStream API  Stream Processing  Java and Scala  All examples here in Java  Documentation available at flink.apache.org  Currently labeled as beta – some API changes are pending • Noted in the slides with a warning 2
  • 3. DataStream API by Example 3
  • 4. Window WordCount: main Method 4 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 5. Stream Execution Environment 5 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 6. Data Sources 6 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 7. Data types 7 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 8. Transformations 8 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 9. User functions 9 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 10. DataSinks 10 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 11. Execute! 11 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 12. Window WordCount: FlatMap public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } } 12
  • 13. WordCount: Map: Interface 13 public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } }
  • 14. WordCount: Map: Types 14 public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } }
  • 15. WordCount: Map: Collector 15 public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } }
  • 17. (Selected) Data Types  Basic Java Types • String, Long, Integer, Boolean,… • Arrays  Composite Types • Tuples • Many more (covered in the advanced slides) 17
  • 18. Tuples  The easiest and most lightweight way of encapsulating data in Flink  Tuple1 up to Tuple25 Tuple2<String, String> person = new Tuple2<>("Max", "Mustermann”); Tuple3<String, String, Integer> person = new Tuple3<>("Max", "Mustermann", 42); Tuple4<String, String, Integer, Boolean> person = new Tuple4<>("Max", "Mustermann", 42, true); // zero based index! String firstName = person.f0; String secondName = person.f1; Integer age = person.f2; Boolean fired = person.f3; 18
  • 19. Transformations: Map DataStream<Integer> integers = env.fromElements(1, 2, 3, 4); // Regular Map - Takes one element and produces one element DataStream<Integer> doubleIntegers = integers.map(new MapFunction<Integer, Integer>() { @Override public Integer map(Integer value) { return value * 2; } }); doubleIntegers.print(); > 2, 4, 6, 8 // Flat Map - Takes one element and produces zero, one, or more elements. DataStream<Integer> doubleIntegers2 = integers.flatMap(new FlatMapFunction<Integer, Integer>() { @Override public void flatMap(Integer value, Collector<Integer> out) { out.collect(value * 2); } }); doubleIntegers2.print(); > 2, 4, 6, 8 19
  • 20. Transformations: Filter // The DataStream DataStream<Integer> integers = env.fromElements(1, 2, 3, 4); DataStream<Integer> filtered = integers.filter(new FilterFunction<Integer>() { @Override public boolean filter(Integer value) { return value != 3; } }); integers.print(); > 1, 2, 4 20
  • 21. Transformations: Partitioning  DataStreams can be partitioned by a key 21 // (name, age) of employees DataStream<Tuple2<String, Integer>> passengers = … // group by second field (age) DataStream<Integer, Integer> grouped = passengers.groupBy(1) Stephan, 18 Fabian, 23 Julia, 27 Anna, 18 Romeo, 27 Anna, 18 Stephan, 18 Julia, 27 Romeo, 27 Fabian, 23 Warning: Possible renaming in next releasesBen, 25 Ben, 25
  • 22. Data Shipping Strategies  Optionally, you can specify how data is shipped between two transformations  Forward: stream.forward() • Only local communication  Rebalance: stream.rebalance() • Round-robin partitioning  Partition by hash: stream.partitionByHash(...)  Custom partitioning: stream.partitionCustom(...)  Broadcast: stream.broadcast() • Broadcast to all nodes 22
  • 24. Data Sources (2) Text socket  socketTextStream("hostname",port) Text file  readFileStream(“/path/to/file”, 1000, WatchType.PROCESS_ONLY_APPENDED) Connectors  E.g., Apache Kafka, RabbitMQ, … 24
  • 25. Data Sources: Collections StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // read from elements DataStream<String> names = env.fromElements(“Some”, “Example”, “Strings”); // read from Java collection List<String> list = new ArrayList<String>(); list.add(“Some”); list.add(“Example”); list.add(“Strings”); DataStream<String> names = env.fromCollection(list); 25
  • 26. Data Sources: Files,Sockets,Connectors StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // read text socket from port DataStream<String> socketLines = env .socketTextStream(”localhost", 9999); // read a text file ingesting new elements every 100 milliseconds DataStream<String> localLines = env .readFileStream(”/path/to/file", 1000, WatchType.PROCESS_ONLY_APPENDED); 26
  • 27. Data Sinks Text  writeAsText(“/path/to/file”) CSV  writeAsCsv(“/path/to/file”) Return data to the Client  print() 27 Note: Identical to DataSet API
  • 28. Data Sinks (2) Socket  writeToSocket(hostname, port, SerializationSchema) Connectors  E.g., Apache Kafka, Elasticsearch, Rolling HDFS Files 28
  • 29. Data Sinks  Lazily executed when env.execute() is called DataStream<…> result; // nothing happens result.writeToSocket(...); // nothing happens result.writeAsText("/path/to/file", "n", "|"); // Execution really starts here env.execute(); 29
  • 31. Fault Tolerance in Flink  Flink provides recovery by taking a consistent checkpoint every N milliseconds and rolling back to the checkpointed state • https://ci.apache.org/projects/flink/flink-docs- master/internals/stream_checkpointing.html  Exactly once (default) • // Take checkpoint every 5000 milliseconds env.enableCheckpointing (5000)  At least once (for lower latency) • // Take checkpoint every 5000 milliseconds env.enableCheckpointing (5000, CheckpointingMode.AT_LEAST_ONCE)  Setting the interval to few seconds should be good for most applications  If checkpointing is not enabled, no recovery guarantees are provided 31
  • 33. Some advice  Use env.fromElements(..) or env.fromCollection(..) to quickly get a DataStream to experiment with  Use print() to quickly print a DataStream 33
  • 35. From 0.9 to 0.10  groupBy(…) -> keyBy(…)  DataStream renames: • KeyedDataStream -> KeyedStream • WindowedDataStream -> WindowedStream • ConnectedDataStream -> ConnectedStream • JoinOperator -> JoinedStreams 35
  • 36. 36