SlideShare a Scribd company logo
An Introduction to Data Stream
Analytics
using Apache Flink
SeRC Big Data Workshop
Paris Carbone<parisc@kth.se>
PhD Candidate
KTH Royal Institute of Technology
1
Motivation
• Time-critical problems / Actionable Insights
• Stock market predictions
• Fraud detection
• Network security
• Fresh customer recommendations
2
more like First-World Problems..
How about Tsunamis
3
4
Q =
Q
Deploy Sensors
Analyse Data
Regularly
Collect
Data
evacuation
window
earth & wave activity
Motivation
5
Q Q
Q =
Motivation
6
Q
Standing Query
Q =
evacuation
window
Data Stream Paradigm
• Standing queries are evaluated continuously
• Input data is unbounded
• Queries operate on the full data stream or on the
most recent views of the stream ~ windows
7
Data Stream Basics
• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
8
f
S1
S2
So
S’1
S’2
Streaming Pipelines
9
stream1
stream2
approximations
predictions
alerts
……
Q
sources
sinks
Stream Analytics Systems
10
Proprietary Open Source
Google
DataFlow
IBM
Infosphere
Microsoft
Azure
Flink
Storm
Samza
Spark
Programming Models
11
Compositional Declarative
• Offer basic building blocks
for composing custom
operators and topologies
• Advanced behaviour such
as windowing is often
missing
• Custom Optimisation
• Expose a high-level API
• Operators are transformations
on abstract data types
• Advanced behaviour such as
windowing is supported
• Self-Optimisation
Introducing Apache Flink
0
20
40
60
80
100
120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by git
commits
• A Top-level project
• Community-driven open
source software development
• Publicly open to new
contributors
Native Workload Support
Apache Flink
Stream Pipelines
Batch Pipelines
Scalable
Machine Learning
Graph Analytics
14
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataflow
Deployment
• Bounded Data Sources
• Blocking Operations
• Structured Iterations
• Unbounded Data Sources
• Continuous Operations
• Asynchronous Iterations
The Big Picture
DataStreamDataSet
Distributed Dataflow
Deployment
Graph-Gelly
Table
ML
HadoopM/R
Table
CEP
SQL
SQL
ML
Graph-Gelly
16
Basic API Concept
Source
Data
Stream Operator
Data
Stream Sink
Source
Data
Set
Operator
Data
Set
Sink
Writing a Flink Program
1.Bootstrap Sources
2.Apply Operators
3.Output to Sinks
Data Streams as
Abstract Data Types
• Tasks are distributed and run in a pipelined fashion.
• State is kept within tasks.
• Transformations are applied per-record or window.
• Transformations: map, flatmap, filter, union…
• Aggregations: reduce, fold, sum
• Partitioning: forward, broadcast, shuffle, keyBy
• Sources/Sinks: custom or Kafka, Twitter, Collections…
17
DataStream
Example
18
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.sum(1)
.print()
“live and let live”
“live”	“and”	“let”	“live”
(live,1)	(and,1)	(let,1)	(live,1)
(live,1)
(and,1)
(let,1)
(live,2)
Working with Windows
19
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
#sec
40 80
SUM #2
0
SUM #1
20 60 100
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyedStream.timeWindow(
Time.seconds(60),
Time.seconds(20));
1) Sliding windows
2) Tumbling windows
myKeyedStream.timeWindow(
Time.seconds(60));
window buckets/panes
Example
20
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
“live and”
(live,1)	(and,1)
(let,1)	(live,1)
counting words over windows
“let live”
10:48
11:01
Window (10:45-10:50)
Window (11:00-11:05)
Example
21
printwindow sumflatMap
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.print()
map
where counts are
kept in state
Example
22
window sum
flatMap
textStream
.flatMap {_.split("W+")}
.map {(_, 1)}
.keyBy(0)
.timeWindow(Time.minutes(5))
.sum(1)
.setParallelism(4)
.print()
map print
Making State Explicit
23
• Explicitly defined state is durable to failures
• Flink supports two types of explicit states
• Operator State - full state
• Key-Value State - partitioned state per key
• State Backends: In-memory, RocksDB, HDFS
Fault Tolerance
24
t2t1
snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failures
When failures occur we
revert computation and state back to a snapshot
events
Also part of Apache Storm
Performance
• Twitter Hack Week - Flink as an in-memory data store
25
Jamie Grier - http://data-artisans.com/extending-the-
yahoo-streaming-benchmark/
So how is Flink different that
Spark?
26
Two major differences
1) Stream Execution
2) Mutable State
Flink vs Spark
27
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
S
• dedicated resources
• leased resources
• mutable state
• immutable state
What about DataSets?
28
• Sophisticated SQL-inspired optimiser
• Efficient Join Strategies
• Managed Memory bypasses Garbage Collection
• Fast, in-memory Iterative Bulk Computations
Some Interesting Libraries
29
Detecting Patterns
30
PatternStream<Event> tsunamiPattern =
CEP.pattern(sensorStream,
Pattern
.begin("seismic").where(evt -> evt.motion.equals(“ClassB”))
.next("tidal").where(evt -> evt.elevation > 500));
DataStream<Alert> result = tsunamiPattern.select(
pattern -> {
return getEvacuationAlert(pattern);
});
CEP Java library Example
Scala DSL coming soon
Mining Graphs with Gelly
31
• Iterative Graph Processing
• Scatter-Gather
• Gather-Sum-Apply
• Graph Transformations/Properties
• Library Methods: Community Detection, Label
Propagation, Connected Components,
PageRank.Shortest Paths, Triangle Count etc…
Coming Soon : Real-time graph stream support
Machine Learning Pipelines
32
• Scikit-learn inspired pipelining
• Supervised: SVM, Linear Regression
• Preprocessing: Polynomial Features, Scalers
• Recommendation: ALS
Relational Queries
33
Table table = tableEnv.fromDataSet(input);
Table filtered = table
.groupBy("word")
.select("word.count as count, word")
.filter("count = 2");
DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class);
Table API Example
SQL and Stream SQL coming soon
Real-Time Monitoring
34
…for real-time processing
Coming Soon
35
• SQL and Stream SQL
• Stream ML
• Stream Graph Processing (Gelly-Stream)
• Autoscaling
• Incremental Snapshots

More Related Content

PDF
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
Paris Carbone
 
PPTX
An Introduction to Distributed Data Streaming
Paris Carbone
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PDF
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Paris Carbone
 
An Introduction to Distributed Data Streaming
Paris Carbone
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Apache Flink Training: System Overview
Flink Forward
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 

What's hot (20)

PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
PPTX
SICS: Apache Flink Streaming
Turi, Inc.
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
PDF
Stateful stream processing with Apache Flink
Knoldus Inc.
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
PPTX
Debunking Six Common Myths in Stream Processing
Kostas Tzoumas
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PPTX
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
SICS: Apache Flink Streaming
Turi, Inc.
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Stateful stream processing with Apache Flink
Knoldus Inc.
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Debunking Six Common Myths in Stream Processing
Kostas Tzoumas
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Ad

Viewers also liked (20)

PDF
Chapter 2.1 : Data Stream
Ministry of Higher Education
 
PDF
Big Data and Stream Data Analysis at Politecnico di Milano
Marco Brambilla
 
PDF
Detecting Anomalies in Streaming Data
Subutai Ahmad
 
PPTX
Data streaming algorithms
Sandeep Joshi
 
PPTX
Cerrera DINWC2015
Dmitry Kalashnikov
 
PPTX
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
Rakuten Group, Inc.
 
PPTX
Streaming Algorithms
Joe Kelley
 
PPTX
Data Stream Outlier Detection Algorithm
Hamza Aslam
 
PPTX
Cloud computing - Pros and Cons
Savvycom Savvycom
 
PDF
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
PPTX
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
PDF
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
PDF
Márton Balassi Streaming ML with Flink-
Flink Forward
 
PPTX
IBM IoT Architecture and Capabilities at the Edge and Cloud
Pradeep Natarajan
 
PDF
Real-Time Big Data Stream Analytics
Albert Bifet
 
PDF
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PPTX
Kafka for data scientists
Jenn Rawlins
 
PPTX
Streaming datasets for personalization
Shriya Arora
 
Chapter 2.1 : Data Stream
Ministry of Higher Education
 
Big Data and Stream Data Analysis at Politecnico di Milano
Marco Brambilla
 
Detecting Anomalies in Streaming Data
Subutai Ahmad
 
Data streaming algorithms
Sandeep Joshi
 
Cerrera DINWC2015
Dmitry Kalashnikov
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
Rakuten Group, Inc.
 
Streaming Algorithms
Joe Kelley
 
Data Stream Outlier Detection Algorithm
Hamza Aslam
 
Cloud computing - Pros and Cons
Savvycom Savvycom
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Márton Balassi Streaming ML with Flink-
Flink Forward
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
Pradeep Natarajan
 
Real-Time Big Data Stream Analytics
Albert Bifet
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Kafka for data scientists
Jenn Rawlins
 
Streaming datasets for personalization
Shriya Arora
 
Ad

Similar to Data Stream Analytics - Why they are important (20)

PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Comparative Evaluation of Spark and Flink Stream Processing
Ehab Qadah
 
PPTX
Counting Elements in Streams
Jamie Grier
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PPTX
Apache flink
Ahmed Nader
 
PPTX
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
PDF
Stream Processing with Apache Flink
C4Media
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Zurich Flink Meetup
Konstantinos Kloudas
 
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Flink Streaming @BudapestData
Gyula Fóra
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Comparative Evaluation of Spark and Flink Stream Processing
Ehab Qadah
 
Counting Elements in Streams
Jamie Grier
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Stream processing - Apache flink
Renato Guimaraes
 
Apache flink
Ahmed Nader
 
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
Stream Processing with Apache Flink
C4Media
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Flink history, roadmap and vision
Stephan Ewen
 
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Zurich Flink Meetup
Konstantinos Kloudas
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 

More from Paris Carbone (10)

PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 
PDF
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Paris Carbone
 
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Paris Carbone
 
PDF
A Future Look of Data Stream Processing as an Architecture for AI
Paris Carbone
 
PDF
Continuous Deep Analytics
Paris Carbone
 
PDF
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Paris Carbone
 
PDF
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Paris Carbone
 
PDF
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Paris Carbone
 
PDF
Aggregate Sharing for User-Define Data Stream Windows
Paris Carbone
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Paris Carbone
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Paris Carbone
 
A Future Look of Data Stream Processing as an Architecture for AI
Paris Carbone
 
Continuous Deep Analytics
Paris Carbone
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Paris Carbone
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Paris Carbone
 
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Paris Carbone
 
Aggregate Sharing for User-Define Data Stream Windows
Paris Carbone
 

Recently uploaded (20)

PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
short term internship project on Data visualization
JMJCollegeComputerde
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 

Data Stream Analytics - Why they are important

  • 1. An Introduction to Data Stream Analytics using Apache Flink SeRC Big Data Workshop Paris Carbone<[email protected]> PhD Candidate KTH Royal Institute of Technology 1
  • 2. Motivation • Time-critical problems / Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations 2 more like First-World Problems..
  • 4. 4 Q = Q Deploy Sensors Analyse Data Regularly Collect Data evacuation window earth & wave activity
  • 7. Data Stream Paradigm • Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7
  • 8. Data Stream Basics • Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators: consume streams and generate new ones. • Events are consumed once - no backtracking! 8 f S1 S2 So S’1 S’2
  • 10. Stream Analytics Systems 10 Proprietary Open Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark
  • 11. Programming Models 11 Compositional Declarative • Offer basic building blocks for composing custom operators and topologies • Advanced behaviour such as windowing is often missing • Custom Optimisation • Expose a high-level API • Operators are transformations on abstract data types • Advanced behaviour such as windowing is supported • Self-Optimisation
  • 12. Introducing Apache Flink 0 20 40 60 80 100 120 juli-09 nov-10 apr-12 aug-13 dec-14 maj-16 #unique contributor ids by git commits • A Top-level project • Community-driven open source software development • Publicly open to new contributors
  • 13. Native Workload Support Apache Flink Stream Pipelines Batch Pipelines Scalable Machine Learning Graph Analytics
  • 14. 14 The Apache Flink Stack APIs Execution DataStreamDataSet Distributed Dataflow Deployment • Bounded Data Sources • Blocking Operations • Structured Iterations • Unbounded Data Sources • Continuous Operations • Asynchronous Iterations
  • 15. The Big Picture DataStreamDataSet Distributed Dataflow Deployment Graph-Gelly Table ML HadoopM/R Table CEP SQL SQL ML Graph-Gelly
  • 16. 16 Basic API Concept Source Data Stream Operator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
  • 17. Data Streams as Abstract Data Types • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • Transformations: map, flatmap, filter, union… • Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… 17 DataStream
  • 18. Example 18 textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live,1) (and,1) (let,1) (live,1) (live,1) (and,1) (let,1) (live,2)
  • 19. Working with Windows 19 Why windows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! #sec 40 80 SUM #2 0 SUM #1 20 60 100 #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 15 38 38 65 65 88 15 38 65 88 110 120 myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20)); 1) Sliding windows 2) Tumbling windows myKeyedStream.timeWindow( Time.seconds(60)); window buckets/panes
  • 20. Example 20 textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() “live and” (live,1) (and,1) (let,1) (live,1) counting words over windows “let live” 10:48 11:01 Window (10:45-10:50) Window (11:00-11:05)
  • 21. Example 21 printwindow sumflatMap textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() map where counts are kept in state
  • 22. Example 22 window sum flatMap textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() map print
  • 23. Making State Explicit 23 • Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS
  • 24. Fault Tolerance 24 t2t1 snap - t1 snap - t2 snapshotting snapshotting State is not affected by failures When failures occur we revert computation and state back to a snapshot events Also part of Apache Storm
  • 25. Performance • Twitter Hack Week - Flink as an in-memory data store 25 Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/
  • 26. So how is Flink different that Spark? 26 Two major differences 1) Stream Execution 2) Mutable State
  • 27. Flink vs Spark 27 (Spark Streaming) put new states in output RDDdstream.updateStateByKey(…) In S’ S • dedicated resources • leased resources • mutable state • immutable state
  • 28. What about DataSets? 28 • Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations
  • 30. Detecting Patterns 30 PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); CEP Java library Example Scala DSL coming soon
  • 31. Mining Graphs with Gelly 31 • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming Soon : Real-time graph stream support
  • 32. Machine Learning Pipelines 32 • Scikit-learn inspired pipelining • Supervised: SVM, Linear Regression • Preprocessing: Polynomial Features, Scalers • Recommendation: ALS
  • 33. Relational Queries 33 Table table = tableEnv.fromDataSet(input); Table filtered = table .groupBy("word") .select("word.count as count, word") .filter("count = 2"); DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class); Table API Example SQL and Stream SQL coming soon
  • 35. Coming Soon 35 • SQL and Stream SQL • Stream ML • Stream Graph Processing (Gelly-Stream) • Autoscaling • Incremental Snapshots