SlideShare a Scribd company logo
© 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
•  Features
•  Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming
•  Addressing the Improvement Areas
•  API
•  Fault Tolerance
•  Event Time
•  Managing Streaming queries
- Structured Streaming Examples
https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts
© 2015 IBM Corporation2
Spark Streaming 1. X
! Features of Spark Streaming
-  High Level API (stateful, joins, aggregates, windows etc.)
•  Overlap with RDD API (batch)
-  Fault – Tolerant (exactly once semantics achievable)
-  Back Pressure
-  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation3
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantees, user needs to do all the heavy lifting in
the Sink
Can that be handled in a very simple way for the end-user ?
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation4
Fault-Tolerant Semantics
Exactly	
  Once,	
  If	
  Outputs	
  are	
  Idempotent	
  or	
  transac6onal	
  
Exactly	
  Once,	
  as	
  long	
  as	
  received	
  data	
  is	
  not	
  lost	
  
Exactly	
  Once	
  needs	
  re-­‐playable	
  sources	
  (e.g.	
  Ka?a	
  Direct)	
  
Source
Receiver
Transforming
Outputting
Sink
© 2015 IBM Corporation5
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
-  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API
-  Request for more seamless API between Batch & Stream
-  Reduce complexities of streaming app *
! No Event Time support
-  Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation6
Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primary Batch API even to Streaming
- Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge
- Remove/Keep streaming complexities to minimum
!
© 2015 IBM Corporation7
Lets Dive in
© 2015 IBM Corporation8
SQL Batch vs SQL Streaming- Conceptually
© 2015 IBM Corporation9
Batch vs Streaming - Programmatically
© 2015 IBM Corporation10
Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
-  Complete – Entire updated Result table is written to external storage.
-  Append – Only new rows added in the Result table since last incremental query execution is
written to external storage.
-  Update - Only the rows updated in the Result table since last incremental query execution is
written to external storage.
Upto implementation of Storage connector to decide how to write.
* Aggregate queries only support complete mode and non-aggregate queries append mode
© 2015 IBM Corporation11
Supported Sinks & Modes in 2.0
*DEBUG	
  ONLY	
  
*DEBUG	
  ONLY	
  
© 2015 IBM Corporation12
Windowing in Structured Streaming
© 2015 IBM Corporation13
Window operations
!  Continuous time based aggregations are most common in Streaming applications.
-  Sliding window & Tumbling window
E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
!  Used in a Group By clause
Can be used in Batch as well
© 2015 IBM Corporation14
Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them
© 2015 IBM Corporation15
Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct
window group
! Use a normal filter in the SQL ?
! Watermarks
© 2015 IBM Corporation16
Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! What all can fail?
! Driver
! Executor
© 2015 IBM Corporation17
Spark 1.x Best Fault tolerance - Kafka Direct API
•  Simplified Parallelism
•  Less Storage Need
•  Exactly Once Semantics.
source & processing
Benefits	
  of	
  this	
  approach	
  
© 2015 IBM Corporation18
Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint	
  to	
  HDFS	
  
! Structured Streaming Checkpointing
Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any
processing is started for that trigger
Nth record in log indicates data that is currently being processed
N-1 entry in log indicates offsets idempotent written to Sink
Log entries are monotonically increasing integers
! On Recovery
Restart processing of nth entry in WAL
© 2015 IBM Corporation19
Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
-  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
-  Built-in Sources will *mostly* be only ones that support replay
https://issues.apache.org/jira/browse/SPARK-15842
© 2015 IBM Corporation20
Managing Streaming Queries
!  Streaming in 1.x was definetly lacking in
-  Starting / Stopping individual Streaming Queries
-  Changing the computation done in a Query.
-  When a Streaming Query abnormally terminates handle more gracefully than app crash.
© 2015 IBM Corporation21
Managing Streaming Queries
© 2015 IBM Corporation22
Managing Streaming Queries
© 2015 IBM Corporation23
Summary
!  Overall has a good set of features
-  Easier code share between Batch and Streaming (No different type hierarchies)
-  Window not tied to Batch interval
-  No Streaming context
-  Optimizer now available for your queries.
!  Getting started
-  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those.
-  Only get Runtime exceptions when you mess with above
!  How does it compare to Apache Beam ?
© 2015 IBM Corporation24
For Each Sink
© 2015 IBM Corporation25
Thank YOU

More Related Content

What's hot (20)

PDF
Core Services behind Spark Job Execution
datamantra
 
PDF
Productionalizing a spark application
datamantra
 
PDF
Introduction to Datasource V2 API
datamantra
 
PDF
Exploratory Data Analysis in Spark
datamantra
 
PDF
Productionalizing Spark ML
datamantra
 
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Understanding time in structured streaming
datamantra
 
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Real time ETL processing using Spark streaming
datamantra
 
PDF
Introduction to dataset
datamantra
 
PDF
Interactive workflow management using Azkaban
datamantra
 
PDF
Introduction to concurrent programming with Akka actors
Shashank L
 
PPTX
Portable Streaming Pipelines with Apache Beam
confluent
 
PDF
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 
PDF
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
PPTX
Apache Airflow in Production
Robert Sanders
 
Core Services behind Spark Job Execution
datamantra
 
Productionalizing a spark application
datamantra
 
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
datamantra
 
Productionalizing Spark ML
datamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
Interactive Data Analysis in Spark Streaming
datamantra
 
Spark on Kubernetes
datamantra
 
Understanding time in structured streaming
datamantra
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
Introduction to Spark Streaming
datamantra
 
Building end to end streaming application on Spark
datamantra
 
Real time ETL processing using Spark streaming
datamantra
 
Introduction to dataset
datamantra
 
Interactive workflow management using Azkaban
datamantra
 
Introduction to concurrent programming with Akka actors
Shashank L
 
Portable Streaming Pipelines with Apache Beam
confluent
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Apache Airflow in Production
Robert Sanders
 

Similar to Introduction to Structured Streaming (20)

PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
PDF
Structured streaming in Spark
Giri R Varatharajan
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PDF
Continuous Application with Structured Streaming 2.0
Anyscale
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
ODP
Introduction to Structured Streaming
Knoldus Inc.
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
PDF
Strtio Spark Streaming + Siddhi CEP Engine
Myung Ho Yun
 
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
PDF
Structured Streaming in Spark
Digital Vidya
 
PDF
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
PDF
Introduction to Apache Spark 2.0
Knoldus Inc.
 
PDF
Making Structured Streaming Ready for Production
Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
Structured streaming in Spark
Giri R Varatharajan
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Continuous Application with Structured Streaming 2.0
Anyscale
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Introduction to Structured Streaming
Knoldus Inc.
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
Strtio Spark Streaming + Siddhi CEP Engine
Myung Ho Yun
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
Structured Streaming in Spark
Digital Vidya
 
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Introduction to Apache Spark 2.0
Knoldus Inc.
 
Making Structured Streaming Ready for Production
Databricks
 
Ad

More from datamantra (15)

PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Testing Spark and Scala
datamantra
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PPTX
Telco analytics at scale
datamantra
 
PPTX
Platform for Data Scientists
datamantra
 
PDF
Building scalable rest service using Akka HTTP
datamantra
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Anatomy of spark catalyst
datamantra
 
PDF
Introduction to Spark 2.0 Dataset API
datamantra
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Understanding transactional writes in datasource v2
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
datamantra
 
Telco analytics at scale
datamantra
 
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
datamantra
 
Introduction to Spark 2.0 Dataset API
datamantra
 
Ad

Recently uploaded (20)

PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 

Introduction to Structured Streaming

  • 1. © 2015 IBM Corporation1 ! Agenda - Spark Streaming 1.X •  Features •  Areas for Improvement - Spark Streaming 2.0 – Structured Streaming •  Addressing the Improvement Areas •  API •  Fault Tolerance •  Event Time •  Managing Streaming queries - Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming - Summary thoughts
  • 2. © 2015 IBM Corporation2 Spark Streaming 1. X ! Features of Spark Streaming -  High Level API (stateful, joins, aggregates, windows etc.) •  Overlap with RDD API (batch) -  Fault – Tolerant (exactly once semantics achievable) -  Back Pressure -  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) ! Apache  Hadoop  Day  2015  
  • 3. © 2015 IBM Corporation3 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink Can that be handled in a very simple way for the end-user ? Apache  Hadoop  Day  2015  
  • 4. © 2015 IBM Corporation4 Fault-Tolerant Semantics Exactly  Once,  If  Outputs  are  Idempotent  or  transac6onal   Exactly  Once,  as  long  as  received  data  is  not  lost   Exactly  Once  needs  re-­‐playable  sources  (e.g.  Ka?a  Direct)   Source Receiver Transforming Outputting Sink
  • 5. © 2015 IBM Corporation5 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance -  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink ! API -  Request for more seamless API between Batch & Stream -  Reduce complexities of streaming app * ! No Event Time support -  Hard to support when processing time/batch time exposed in externals ! Streaming Query Management ! Micro-batch ! Apache  Hadoop  Day  2015  
  • 6. © 2015 IBM Corporation6 Spark Streaming 2.0 API ! Built on top of Spark SQL Engine ! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL. ! Challenge - Remove/Keep streaming complexities to minimum !
  • 7. © 2015 IBM Corporation7 Lets Dive in
  • 8. © 2015 IBM Corporation8 SQL Batch vs SQL Streaming- Conceptually
  • 9. © 2015 IBM Corporation9 Batch vs Streaming - Programmatically
  • 10. © 2015 IBM Corporation10 Output Modes - Sink ! Defined as what gets written from the Result table to external storage (Sink) ! Output modes -  Complete – Entire updated Result table is written to external storage. -  Append – Only new rows added in the Result table since last incremental query execution is written to external storage. -  Update - Only the rows updated in the Result table since last incremental query execution is written to external storage. Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode
  • 11. © 2015 IBM Corporation11 Supported Sinks & Modes in 2.0 *DEBUG  ONLY   *DEBUG  ONLY  
  • 12. © 2015 IBM Corporation12 Windowing in Structured Streaming
  • 13. © 2015 IBM Corporation13 Window operations !  Continuous time based aggregations are most common in Streaming applications. -  Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes ! New function that treats windowing as a regular aggregation !  Used in a Group By clause Can be used in Batch as well
  • 14. © 2015 IBM Corporation14 Event Time Windows ! Event-Time is time embedded within the data itself It is not the time Spark received the data ! What about processing time windows if you want them
  • 15. © 2015 IBM Corporation15 Handling Late Arrival in Event-Time ! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group ! Use a normal filter in the SQL ? ! Watermarks
  • 16. © 2015 IBM Corporation16 Fault Tolerance ! Why Care? ! Different guarantees for Data Loss ! Atleast Once ! Exactly Once ! What all can fail? ! Driver ! Executor
  • 17. © 2015 IBM Corporation17 Spark 1.x Best Fault tolerance - Kafka Direct API •  Simplified Parallelism •  Less Storage Need •  Exactly Once Semantics. source & processing Benefits  of  this  approach  
  • 18. © 2015 IBM Corporation18 Fault Tolerance in Structured Streaming Active Driver Checkpoint  to  HDFS   ! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers ! On Recovery Restart processing of nth entry in WAL
  • 19. © 2015 IBM Corporation19 Fault Tolerance in Structured Streaming ! End-to-End Exactly Once guarantees with -  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC) -  Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842
  • 20. © 2015 IBM Corporation20 Managing Streaming Queries !  Streaming in 1.x was definetly lacking in -  Starting / Stopping individual Streaming Queries -  Changing the computation done in a Query. -  When a Streaming Query abnormally terminates handle more gracefully than app crash.
  • 21. © 2015 IBM Corporation21 Managing Streaming Queries
  • 22. © 2015 IBM Corporation22 Managing Streaming Queries
  • 23. © 2015 IBM Corporation23 Summary !  Overall has a good set of features -  Easier code share between Batch and Streaming (No different type hierarchies) -  Window not tied to Batch interval -  No Streaming context -  Optimizer now available for your queries. !  Getting started -  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around * And not much control over those. -  Only get Runtime exceptions when you mess with above !  How does it compare to Apache Beam ?
  • 24. © 2015 IBM Corporation24 For Each Sink
  • 25. © 2015 IBM Corporation25 Thank YOU