SlideShare a Scribd company logo
Marton Balassi – data Artisans
Gyula Fora - SICS
Flink committers
mbalassi@apache.org / gyfora@apache.org
Real-time Stream Processing
with Apache Flink
Stream Processing
2
 Data stream: Infinite sequence of data arriving in a continuous fashion.
 Stream processing: Analyzing and acting on real-time streaming data,
using continuous queries
Streaming landscape
3
Apache Storm
•True streaming, low latency - lower throughput
•Low level API (Bolts, Spouts) + Trident
Spark Streaming
•Stream processing on top of batch system, high throughput - higher latency
•Functional API (DStreams), restricted by batch runtime
Apache Samza
•True streaming built on top of Apache Kafka, state is first class citizen
•Slightly different stream notion, low level API
Apache Flink
•True streaming with adjustable latency-throughput trade-off
•Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
Apache Storm
4
 True streaming, low latency - lower throughput
 Low level API (Bolts, Spouts) + Trident
 At-least-once processing guarantees Issues
 Costly fault tolerance
 Serialization
 Low level API
Spark Streaming
5
 Stream processing emulated on a batch system
 High throughput - higher latency
 Functional API (DStreams)
 Exactly-once processing guarantees Issues
 Restricted streaming
semantics
 Windowing
 High latency
Apache Samza
6
 True streaming built on top of Apache Kafka
 Slightly different stream notion, low level API
 At-least-once processing guarantees with state
Issues
 High disk IO
 Low level API
Apache Flink
7
 True streaming with adjustable latency and throughput
 Rich functional API exploiting streaming runtime
 Flexible windowing semantics
 Exactly-once processing guarantees with (small) state
Issues
 Limited state size
 HA issue
Apache Flink
8
What is Flink
9
A "use-case complete" framework to unify
batch and stream processing
Event logs
Historic data
ETL
Relational
Graph analysis
Machine learning
Streaming analysis
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
Real-time data
streams
What is Flink
An engine that puts equal emphasis
to streaming and batch
10
Flink stack
11
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
*current Flink master + few PRs
Streaming Optimizer
Flink Streaming
12
Overview of the API
 Data stream sources
• File system
• Message queue connectors
• Arbitrary source functionality
 Stream transformations
• Basic transformations: Map, Reduce, Filter, Aggregations…
• Binary stream transformations: CoMap, CoReduce…
• Windowing semantics: Policy based flexible windowing (Time, Count, Delta…)
• Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
 Data stream outputs
 For the details please refer to the programming guide:
• http://flink.apache.org/docs/latest/streaming_guide.html
13
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
Use-case: Financial analytics
14
 Reading from multiple inputs
• Merge stock data from various sources
 Window aggregations
• Compute simple statistics over windows of data
 Data driven windows
• Define arbitrary windowing semantics
 Combine with sentiment analysis
• Enrich your analytics with social media feeds (Twitter)
 Streaming joins
• Join multiple data streams
 Detailed explanation and source code on our blog
• http://flink.apache.org/news/2015/02/09/streaming-example.html
Reading from multiple inputs
case class StockPrice(symbol : String, price : Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val socketStockStream = env.socketTextStream("localhost", 9999)
.map(x => { val split = x.split(",")
StockPrice(split(0), split(1).toDouble) })
val SPX_Stream = env.addSource(generateStock("SPX")(10) _)
val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _)
val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM) 15
(1)
(2)
(4)
(3)
(1)
(2)
(3)
(4)
"HDP, 23.8"
"HDP, 26.6"
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Window aggregations
val windowedStream = stockStream
.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))
val lowest = windowedStream.minBy("price")
val maxByStock = windowedStream.groupBy("symbol").maxBy("price")
val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)
16
(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
StockPrice(HDP, 23.8)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 26.6)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 25.2)
Data-driven windows
case class Count(symbol : String, count : Int)
val priceWarnings = stockStream.groupBy("symbol")
.window(Delta.of(0.05, priceChange, defaultPrice))
.mapWindow(sendWarning _)
val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count") 17
(1)
(2) (4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Count(HDP, 1)StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Combining with a Twitter stream
val tweetStream = env.addSource(generateTweets _)
val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" "))
.map(_.toUpperCase())
.filter(symbols.contains(_))
val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count")
18
"hdp is on the rise!"
"I wish I bought more
YHOO and HDP stocks"
Count(HDP, 2)
Count(YHOO, 1)(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
Streaming joins
val tweetsAndWarning = warningsPerStock.join(tweetsPerStock)
.onWindow(30, SECONDS)
.where("symbol")
.equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) }
val rollingCorrelation = tweetsAndWarning
.window(Time.of(30, SECONDS))
.mapWindow(computeCorrelation _)
19
Count(HDP, 2)
Count(YHOO, 1)
Count(HDP, 1)
(1,2)
(1) (2)
(1)
(2)
0.5
Fault tolerance
 Exactly once semantics
• Asynchronous barrier snapshotting
• Checkpoint barriers streamed from the sources
• Operator state checkpointing + source backup
• Pluggable backend for state management
20
1
1
2 3
JM
SM
State manager
Job manager
Operator
Snapshot barrier
Event channel
Data channel
Checkpoint
JM
SM
Performance
21
 Performance optimizations
• Effective serialization due to strongly typed topologies
• Operator chaining (thread sharing/no serialization)
• Different automatic query optimizations
 Competitive performance
• ~ 1.5m events / sec / core
• As a comparison Storm promises ~ 1m tuples / sec / node
Roadmap
22
 Persistent, high-throughput state backend
 Job manager high availability
 Application libraries
• General statistics over streams
• Pattern matching
• Machine learning pipelines library
• Streaming graph processing library
 Integration with other frameworks
• Zeppelin (Notebook)
• SAMOA (Online ML)
Summary
 Flink is a use-case complete framework to unify batch
and stream processing
 True streaming runtime with high-level APIs
 Flexible, data-driven windowing semantics
 Competitive performance
 We are just getting started!
23
Flink Community
24
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
Unique git contributors
flink.apache.org
@ApacheFlink

More Related Content

What's hot (20)

PDF
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Apache flink
Ahmed Nader
 
PDF
MyRocks Deep Dive
Yoshinori Matsunobu
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PDF
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
PPTX
kafka
Amikam Snir
 
PPTX
Introduction to Apache Flink
mxmxm
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
Kafka Streams: What it is, and how to use it?
confluent
 
PPTX
Flink vs. Spark
Slim Baltagi
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Intro to HBase
alexbaranau
 
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Apache flink
Ahmed Nader
 
MyRocks Deep Dive
Yoshinori Matsunobu
 
Apache Spark Architecture
Alexey Grishchenko
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
Apache Flink Deep Dive
DataWorks Summit
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Introduction to Apache Flink
mxmxm
 
The Apache Spark File Format Ecosystem
Databricks
 
Cassandra Introduction & Features
DataStax Academy
 
Kafka Streams: What it is, and how to use it?
confluent
 
Flink vs. Spark
Slim Baltagi
 
Stream processing using Kafka
Knoldus Inc.
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Intro to HBase
alexbaranau
 

Viewers also liked (16)

PDF
[246] foursquare데이터라이프사이클 설현준
NAVER D2
 
PPTX
[115] clean fe development_윤지수
NAVER D2
 
PDF
Large scale data processing pipelines at trivago
Clemens Valiente
 
PDF
[211]대규모 시스템 시각화 현동석김광림
NAVER D2
 
PDF
Test strategies for data processing pipelines
Lars Albertsson
 
PDF
Real-time Big Data Processing with Storm
viirya
 
PDF
[225]yarn 기반의 deep learning application cluster 구축 김제민
NAVER D2
 
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
PDF
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka
 
PPTX
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
NAVER D2
 
PPTX
[112]rest에서 graph ql과 relay로 갈아타기 이정우
NAVER D2
 
PDF
[236] 카카오의데이터파이프라인 윤도영
NAVER D2
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
[246] foursquare데이터라이프사이클 설현준
NAVER D2
 
[115] clean fe development_윤지수
NAVER D2
 
Large scale data processing pipelines at trivago
Clemens Valiente
 
[211]대규모 시스템 시각화 현동석김광림
NAVER D2
 
Test strategies for data processing pipelines
Lars Albertsson
 
Real-time Big Data Processing with Storm
viirya
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
NAVER D2
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
NAVER D2
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
NAVER D2
 
[236] 카카오의데이터파이프라인 윤도영
NAVER D2
 
Big Data Architecture
Guido Schmutz
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Ad

Similar to Real-time Stream Processing with Apache Flink (20)

PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
PDF
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
PDF
Stream Processing with Apache Flink
C4Media
 
PPTX
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Data Stream Analytics - Why they are important
Paris Carbone
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Flink Streaming @BudapestData
Gyula Fóra
 
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
Stream Processing with Apache Flink
C4Media
 
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Introduction to Flink Streaming
datamantra
 
Data Stream Analytics - Why they are important
Paris Carbone
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Stream processing - Apache flink
Renato Guimaraes
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Real-time Stream Processing with Apache Flink

  • 1. Marton Balassi – data Artisans Gyula Fora - SICS Flink committers [email protected] / [email protected] Real-time Stream Processing with Apache Flink
  • 2. Stream Processing 2  Data stream: Infinite sequence of data arriving in a continuous fashion.  Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
  • 3. Streaming landscape 3 Apache Storm •True streaming, low latency - lower throughput •Low level API (Bolts, Spouts) + Trident Spark Streaming •Stream processing on top of batch system, high throughput - higher latency •Functional API (DStreams), restricted by batch runtime Apache Samza •True streaming built on top of Apache Kafka, state is first class citizen •Slightly different stream notion, low level API Apache Flink •True streaming with adjustable latency-throughput trade-off •Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
  • 4. Apache Storm 4  True streaming, low latency - lower throughput  Low level API (Bolts, Spouts) + Trident  At-least-once processing guarantees Issues  Costly fault tolerance  Serialization  Low level API
  • 5. Spark Streaming 5  Stream processing emulated on a batch system  High throughput - higher latency  Functional API (DStreams)  Exactly-once processing guarantees Issues  Restricted streaming semantics  Windowing  High latency
  • 6. Apache Samza 6  True streaming built on top of Apache Kafka  Slightly different stream notion, low level API  At-least-once processing guarantees with state Issues  High disk IO  Low level API
  • 7. Apache Flink 7  True streaming with adjustable latency and throughput  Rich functional API exploiting streaming runtime  Flexible windowing semantics  Exactly-once processing guarantees with (small) state Issues  Limited state size  HA issue
  • 9. What is Flink 9 A "use-case complete" framework to unify batch and stream processing Event logs Historic data ETL Relational Graph analysis Machine learning Streaming analysis
  • 10. Flink Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ... Event logs Real-time data streams What is Flink An engine that puts equal emphasis to streaming and batch 10
  • 11. Flink stack 11 Python Gelly Table FlinkML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow *current Flink master + few PRs Streaming Optimizer
  • 13. Overview of the API  Data stream sources • File system • Message queue connectors • Arbitrary source functionality  Stream transformations • Basic transformations: Map, Reduce, Filter, Aggregations… • Binary stream transformations: CoMap, CoReduce… • Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) • Temporal binary stream operators: Joins, Crosses… • Native support for iterations  Data stream outputs  For the details please refer to the programming guide: • http://flink.apache.org/docs/latest/streaming_guide.html 13 Reduce Merge Filter Sum Map Src Sink Src
  • 14. Use-case: Financial analytics 14  Reading from multiple inputs • Merge stock data from various sources  Window aggregations • Compute simple statistics over windows of data  Data driven windows • Define arbitrary windowing semantics  Combine with sentiment analysis • Enrich your analytics with social media feeds (Twitter)  Streaming joins • Join multiple data streams  Detailed explanation and source code on our blog • http://flink.apache.org/news/2015/02/09/streaming-example.html
  • 15. Reading from multiple inputs case class StockPrice(symbol : String, price : Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val socketStockStream = env.socketTextStream("localhost", 9999) .map(x => { val split = x.split(",") StockPrice(split(0), split(1).toDouble) }) val SPX_Stream = env.addSource(generateStock("SPX")(10) _) val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _) val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM) 15 (1) (2) (4) (3) (1) (2) (3) (4) "HDP, 23.8" "HDP, 26.6" StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6)
  • 16. Window aggregations val windowedStream = stockStream .window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS)) val lowest = windowedStream.minBy("price") val maxByStock = windowedStream.groupBy("symbol").maxBy("price") val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _) 16 (1) (2) (4) (3) (1) (2) (4) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) StockPrice(HDP, 23.8) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 26.6) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 25.2)
  • 17. Data-driven windows case class Count(symbol : String, count : Int) val priceWarnings = stockStream.groupBy("symbol") .window(Delta.of(0.05, priceChange, defaultPrice)) .mapWindow(sendWarning _) val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol") .window(Time.of(30, SECONDS)) .sum("count") 17 (1) (2) (4) (3) (1) (2) (4) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) Count(HDP, 1)StockPrice(HDP, 23.8) StockPrice(HDP, 26.6)
  • 18. Combining with a Twitter stream val tweetStream = env.addSource(generateTweets _) val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" ")) .map(_.toUpperCase()) .filter(symbols.contains(_)) val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol") .window(Time.of(30, SECONDS)) .sum("count") 18 "hdp is on the rise!" "I wish I bought more YHOO and HDP stocks" Count(HDP, 2) Count(YHOO, 1)(1) (2) (4) (3) (1) (2) (4) (3)
  • 19. Streaming joins val tweetsAndWarning = warningsPerStock.join(tweetsPerStock) .onWindow(30, SECONDS) .where("symbol") .equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) } val rollingCorrelation = tweetsAndWarning .window(Time.of(30, SECONDS)) .mapWindow(computeCorrelation _) 19 Count(HDP, 2) Count(YHOO, 1) Count(HDP, 1) (1,2) (1) (2) (1) (2) 0.5
  • 20. Fault tolerance  Exactly once semantics • Asynchronous barrier snapshotting • Checkpoint barriers streamed from the sources • Operator state checkpointing + source backup • Pluggable backend for state management 20 1 1 2 3 JM SM State manager Job manager Operator Snapshot barrier Event channel Data channel Checkpoint JM SM
  • 21. Performance 21  Performance optimizations • Effective serialization due to strongly typed topologies • Operator chaining (thread sharing/no serialization) • Different automatic query optimizations  Competitive performance • ~ 1.5m events / sec / core • As a comparison Storm promises ~ 1m tuples / sec / node
  • 22. Roadmap 22  Persistent, high-throughput state backend  Job manager high availability  Application libraries • General statistics over streams • Pattern matching • Machine learning pipelines library • Streaming graph processing library  Integration with other frameworks • Zeppelin (Notebook) • SAMOA (Online ML)
  • 23. Summary  Flink is a use-case complete framework to unify batch and stream processing  True streaming runtime with high-level APIs  Flexible, data-driven windowing semantics  Competitive performance  We are just getting started! 23
  • 24. Flink Community 24 0 20 40 60 80 100 120 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16 Unique git contributors

Editor's Notes

  • #14: 3 main components to the system: Connectors(sources), Operators(trafos), Sinks (outputs) The source interface is as general as it gets + preimplemented connectors Rich set of operators, designed for true streaming analytics (long-standing, stateful, windowing) Sinks are very general, same as sources. Simple interfaces + pre-implemented
  • #15: -The goal is to showcase the main features of the api on a “real world” example -Use-case: Analyze streams of stock market data, which consists of (Stock symbol, Stock price) pairs -Sentiment analysis: Combine the market information with information aquired from social media feeds (in this case the nr of times the stock symbol was mentioned in the twitter stream) -Use stream joins for this ^
  • #16: - As a first step we need to connect to our data streams, and parse our inputs Here we use a simple socket stream and convert it to a case-class -> Could have used kafka or any other message queue + or more advanced stock representation
  • #17: Talk about policy based windowing -> eviction(window size), trigger (slide size) Window operations: reduce, mapWindow Grouped vs non-grouped windowing
  • #18: Flexible windowing -> awesome features Delta policy Case-class Count for convenience Mention other cool use-cases: detecting user sessions
  • #19: Simple windowed word count on the filtered tweet stream for each symbol