SlideShare a Scribd company logo
Spark Streaming
Large-scale near-real-time stream
processing
Tathagata Das (TD)
UC Berkeley
UC BERKELEY
What is Spark Streaming?
 Framework for large scale stream processing
- Scales to 100s of nodes
- Can achieve second scale latencies
- Integrates with Spark’s batch and interactive processing
- Provides a simple batch-like API for implementing complex algorithm
- Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
Motivation
 Many important applications must process large streams of live data and provide
results in near-real-time
- Social network trends
- Website statistics
- Intrustion detection systems
- etc.
 Require large clusters to handle workloads
 Require latencies of few seconds
Need for a framework …
… for building such complex stream processing applications
But what are the requirements
from such a framework?
Requirements
 Scalable to large clusters
 Second-scale latencies
 Simple programming model
Case study: Conviva, Inc.
 Real-time monitoring of online video metadata
- HBO, ESPN, ABC, SyFy, …
 Two processing stacks
Custom-built distributed stream processing system
• 1000s complex metrics on millions of video sessions
• Requires many dozens of nodes for processing
Hadoop backend for offline analysis
•Generating daily and monthly reports
•Similar computation as the streaming system
Custom-built distributed stream processing system
• 1000s complex metrics on millions of videos sessions
• Requires many dozens of nodes for processing
Hadoop backend for offline analysis
•Generating daily and monthly reports
•Similar computation as the streaming system
Case study: XYZ, Inc.
 Any company who wants to process live streaming data has this problem
 Twice the effort to implement any new function
 Twice the number of bugs to solve
 Twice the headache
 Two processing stacks
Requirements
 Scalable to large clusters
 Second-scale latencies
 Simple programming model
 Integrated with batch & interactive processing
Stateful Stream Processing
 Traditional streaming systems have a event-
driven record-at-a-time processing model
- Each node has mutable state
- For each record, update state & send new
records
 State is lost if node dies!
 Making stateful stream processing be fault-
tolerant is challenging
mutable state
node 1
node 3
input
records
node 2
input
records
9
Existing Streaming Systems
 Storm
-Replays record if not processed by a node
-Processes each record at least once
-May update mutable state twice!
-Mutable state can be lost due to failure!
 Trident – Use transactions to update state
-Processes each record exactly once
-Per state transaction updates slow
10
Requirements
 Scalable to large clusters
 Second-scale latencies
 Simple programming model
 Integrated with batch & interactive processing
 Efficient fault-tolerance in stateful computations
Spark Streaming
12
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
13
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results
 Chop up the live stream into batches of X seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
 Finally, the processed results of the RDD operations
are returned in batches
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
14
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results
 Batch sizes as low as ½ second, latency ~ 1 second
 Potential for combining batch processing and
streaming processing in the same system
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a stream of data
batch @ t+1
batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed)
Twitter Streaming API
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one Dstream to create another DStream
new DStream
new RDDs created for
every batch
batch @ t+1
batch @ t batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1
batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch saved
to HDFS
Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object to define the transformation
Fault-tolerance
 RDDs are remember the sequence of
operations that created it from the
original fault-tolerant input data
 Batches of input data are replicated in
memory of multiple worker nodes,
therefore fault-tolerant
 Data lost due to worker failure, can be
recomputed from input data
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
Key concepts
 DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
 Transformations – modify data from on DStream to another
- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …
 Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
flatMap
map
reduceByKey
flatMap
map
reduceByKey
…
flatMap
map
reduceByKey
batch @ t+1
batch @ t batch @ t+2
hashTags
tweets
tagCounts
[(#cat, 10), (#dog, 25), ... ]
Example 3 – Count the hashtags over last 10 mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval
tagCounts
Example 3 – Counting the hashtags over last 10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
hashTags
t-1 t t+1 t+2 t+3
sliding window
countByValue
count over all
the data in the
window
?
Smart window-based countByValue
val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))
hashTags
t-1 t t+1 t+2 t+3
+
+
–
countByValue
add the counts
from the new
batch in the
window
subtract the
counts from
batch before
the window
tagCounts
Smart window-based reduce
 Technique to incrementally compute count generalizes to many reduce operations
- Need a function to “inverse reduce” (“subtract” for counting)
 Could have implemented counting as:
hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)
25
Demo
Fault-tolerant Stateful Processing
All intermediate data are RDDs, hence can be recomputed if lost
hashTags
t-1 t t+1 t+2 t+3
tagCounts
Fault-tolerant Stateful Processing
 State data not lost even if a worker node dies
- Does not change the value of your result
 Exactly once semantics to all transformations
- No double counting!
28
Other Interesting Operations
 Maintaining arbitrary state, track sessions
- Maintain per-user mood as state, and update it with his/her tweets
tweets.updateStateByKey(tweet => updateMood(tweet))
 Do arbitrary Spark RDD computation within DStream
- Join incoming tweets with a spam file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
Performance
Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency
- Tested with 100 streams of data on 100 EC2 instances with 4 cores each
30
Comparison with Storm and S4
Higher throughput than Storm
Spark Streaming: 670k records/second/node
Storm: 115k records/second/node
Apache S4: 7.5k records/second/node
31
Fast Fault Recovery
Recovers from faults/stragglers within 1 sec
32
Real Applications: Conviva
Real-time monitoring of video metadata
33
• Achieved 1-2 second latency
• Millions of video sessions processed
• Scales linearly with cluster size
Real Applications: Mobile Millennium Project
Traffic transit time estimation using online
machine learning on GPS observations
34
• Markov chain Monte Carlo simulations on GPS
observations
• Very CPU intensive, requires dozens of
machines for useful computation
• Scales linearly with cluster size
Vision - one stack to rule them all
Spark
+
Shark
+
Spark
Streaming
Spark program vs Spark Streaming program
Spark Streaming program on Twitter stream
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Spark program on Twitter log file
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Vision - one stack to rule them all
 Explore data interactively using Spark
Shell / PySpark to identify problems
 Use same code in Spark stand-alone
programs to identify problems in
production logs
 Use similar code in Spark Streaming to
identify problems in live log streams
$ ./spark-shell
scala> val file = sc.hadoopFile(“smallLogs”)
...
scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val mapped = file.map(...)
...
object ProcessProductionData {
def main(args: Array[String]) {
val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
val filtered = file.filter(_.contains(“ERROR”))
val mapped = file.map(...)
...
}
}
object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
val filtered = file.filter(_.contains(“ERROR”))
val mapped = file.map(...)
...
}
}
Vision - one stack to rule them all
 Explore data interactively using Spark
Shell / PySpark to identify problems
 Use same code in Spark stand-alone
programs to identify problems in
production logs
 Use similar code in Spark Streaming to
identify problems in live log streams
$ ./spark-shell
scala> val file = sc.hadoopFile(“smallLogs”)
...
scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val mapped = file.map(...)
...
object ProcessProductionData {
def main(args: Array[String]) {
val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
val filtered = file.filter(_.contains(“ERROR”))
val mapped = file.map(...)
...
}
}
object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
val filtered = file.filter(_.contains(“ERROR”))
val mapped = file.map(...)
...
}
}
Spark
+
Shark
+
Spark
Streaming
Alpha Release with Spark 0.7
 Integrated with Spark 0.7
- Import spark.streaming to get all the functionality
 Both Java and Scala API
 Give it a spin!
- Run locally or in a cluster
 Try it out in the hands-on tutorial later today
Summary
 Stream processing framework that is ...
- Scalable to large clusters
- Achieves second-scale latencies
- Has simple programming model
- Integrates with batch & interactive workloads
- Ensures efficient fault-tolerance in stateful computations
 For more information, checkout our paper: http://tinyurl.com/dstreams

More Related Content

PPT
Spark streaming
Venkateswaran Kandasamy
 
PPT
strata_spark_streaming.ppt
snowflakebatch
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
PPT
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Deep dive into spark streaming
Tao Li
 
PDF
Introduction to Spark Streaming
datamantra
 
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Spark streaming
Venkateswaran Kandasamy
 
strata_spark_streaming.ppt
snowflakebatch
 
strata_spark_streaming.ppt
AbhijitManna19
 
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
Apache Spark Components
Girish Khanzode
 
Deep dive into spark streaming
Tao Li
 
Introduction to Spark Streaming
datamantra
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 

Similar to strata_spark_streaming.ppt (20)

PDF
Spark streaming
Noam Shaish
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Toying with spark
Raymond Tay
 
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PPTX
Stream processing from single node to a cluster
Gal Marder
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PPTX
Learning spark ch10 - Spark Streaming
phanleson
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PPTX
Spark 计算模型
wang xing
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PPTX
Real time streaming analytics
Anirudh
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Spark streaming state of the union
Databricks
 
Spark streaming
Noam Shaish
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Toying with spark
Raymond Tay
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
Apache Spark Streaming
Bartosz Jankiewicz
 
Stream processing from single node to a cluster
Gal Marder
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Learning spark ch10 - Spark Streaming
phanleson
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Spark 计算模型
wang xing
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Real time streaming analytics
Anirudh
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Spark streaming state of the union
Databricks
 
Ad

Recently uploaded (20)

PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Ad

strata_spark_streaming.ppt

  • 1. Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY
  • 2. What is Spark Streaming?  Framework for large scale stream processing - Scales to 100s of nodes - Can achieve second scale latencies - Integrates with Spark’s batch and interactive processing - Provides a simple batch-like API for implementing complex algorithm - Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
  • 3. Motivation  Many important applications must process large streams of live data and provide results in near-real-time - Social network trends - Website statistics - Intrustion detection systems - etc.  Require large clusters to handle workloads  Require latencies of few seconds
  • 4. Need for a framework … … for building such complex stream processing applications But what are the requirements from such a framework?
  • 5. Requirements  Scalable to large clusters  Second-scale latencies  Simple programming model
  • 6. Case study: Conviva, Inc.  Real-time monitoring of online video metadata - HBO, ESPN, ABC, SyFy, …  Two processing stacks Custom-built distributed stream processing system • 1000s complex metrics on millions of video sessions • Requires many dozens of nodes for processing Hadoop backend for offline analysis •Generating daily and monthly reports •Similar computation as the streaming system
  • 7. Custom-built distributed stream processing system • 1000s complex metrics on millions of videos sessions • Requires many dozens of nodes for processing Hadoop backend for offline analysis •Generating daily and monthly reports •Similar computation as the streaming system Case study: XYZ, Inc.  Any company who wants to process live streaming data has this problem  Twice the effort to implement any new function  Twice the number of bugs to solve  Twice the headache  Two processing stacks
  • 8. Requirements  Scalable to large clusters  Second-scale latencies  Simple programming model  Integrated with batch & interactive processing
  • 9. Stateful Stream Processing  Traditional streaming systems have a event- driven record-at-a-time processing model - Each node has mutable state - For each record, update state & send new records  State is lost if node dies!  Making stateful stream processing be fault- tolerant is challenging mutable state node 1 node 3 input records node 2 input records 9
  • 10. Existing Streaming Systems  Storm -Replays record if not processed by a node -Processes each record at least once -May update mutable state twice! -Mutable state can be lost due to failure!  Trident – Use transactions to update state -Processes each record exactly once -Per state transaction updates slow 10
  • 11. Requirements  Scalable to large clusters  Second-scale latencies  Simple programming model  Integrated with batch & interactive processing  Efficient fault-tolerance in stateful computations
  • 13. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 13 Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  • 14. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 14 Spark Spark Streaming batches of X seconds live data stream processed results  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system
  • 15. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  • 16. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one Dstream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  • 17. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  • 18. Java Example Scala val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the transformation
  • 19. Fault-tolerance  RDDs are remember the sequence of operations that created it from the original fault-tolerant input data  Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant  Data lost due to worker failure, can be recomputed from input data input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  • 20. Key concepts  DStream – sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets  Transformations – modify data from on DStream to another - Standard RDD operations – map, countByValue, reduce, join, … - Stateful operations – window, countByValueAndWindow, …  Output Operations – send data to external entity - saveAsHadoopFiles – saves to HDFS - foreach – do anything with each batch of results
  • 21. Example 2 – Count the hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() flatMap map reduceByKey flatMap map reduceByKey … flatMap map reduceByKey batch @ t+1 batch @ t batch @ t+2 hashTags tweets tagCounts [(#cat, 10), (#dog, 25), ... ]
  • 22. Example 3 – Count the hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() sliding window operation window length sliding interval
  • 23. tagCounts Example 3 – Counting the hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window
  • 24. ? Smart window-based countByValue val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) hashTags t-1 t t+1 t+2 t+3 + + – countByValue add the counts from the new batch in the window subtract the counts from batch before the window tagCounts
  • 25. Smart window-based reduce  Technique to incrementally compute count generalizes to many reduce operations - Need a function to “inverse reduce” (“subtract” for counting)  Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …) 25
  • 26. Demo
  • 27. Fault-tolerant Stateful Processing All intermediate data are RDDs, hence can be recomputed if lost hashTags t-1 t t+1 t+2 t+3 tagCounts
  • 28. Fault-tolerant Stateful Processing  State data not lost even if a worker node dies - Does not change the value of your result  Exactly once semantics to all transformations - No double counting! 28
  • 29. Other Interesting Operations  Maintaining arbitrary state, track sessions - Maintain per-user mood as state, and update it with his/her tweets tweets.updateStateByKey(tweet => updateMood(tweet))  Do arbitrary Spark RDD computation within DStream - Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
  • 30. Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency - Tested with 100 streams of data on 100 EC2 instances with 4 cores each 30
  • 31. Comparison with Storm and S4 Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node 31
  • 32. Fast Fault Recovery Recovers from faults/stragglers within 1 sec 32
  • 33. Real Applications: Conviva Real-time monitoring of video metadata 33 • Achieved 1-2 second latency • Millions of video sessions processed • Scales linearly with cluster size
  • 34. Real Applications: Mobile Millennium Project Traffic transit time estimation using online machine learning on GPS observations 34 • Markov chain Monte Carlo simulations on GPS observations • Very CPU intensive, requires dozens of machines for useful computation • Scales linearly with cluster size
  • 35. Vision - one stack to rule them all Spark + Shark + Spark Streaming
  • 36. Spark program vs Spark Streaming program Spark Streaming program on Twitter stream val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Spark program on Twitter log file val tweets = sc.hadoopFile("hdfs://...") val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFile("hdfs://...")
  • 37. Vision - one stack to rule them all  Explore data interactively using Spark Shell / PySpark to identify problems  Use same code in Spark stand-alone programs to identify problems in production logs  Use similar code in Spark Streaming to identify problems in live log streams $ ./spark-shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) ... scala> val mapped = file.map(...) ... object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... } } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... } }
  • 38. Vision - one stack to rule them all  Explore data interactively using Spark Shell / PySpark to identify problems  Use same code in Spark stand-alone programs to identify problems in production logs  Use similar code in Spark Streaming to identify problems in live log streams $ ./spark-shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) ... scala> val mapped = file.map(...) ... object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... } } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... } } Spark + Shark + Spark Streaming
  • 39. Alpha Release with Spark 0.7  Integrated with Spark 0.7 - Import spark.streaming to get all the functionality  Both Java and Scala API  Give it a spin! - Run locally or in a cluster  Try it out in the hands-on tutorial later today
  • 40. Summary  Stream processing framework that is ... - Scalable to large clusters - Achieves second-scale latencies - Has simple programming model - Integrates with batch & interactive workloads - Ensures efficient fault-tolerance in stateful computations  For more information, checkout our paper: http://tinyurl.com/dstreams