SlideShare a Scribd company logo
© 2015 IBM Corporation
Apache Hadoop Day 2015
Paranth Thiruvengadam – Architect @ IBM
Sachin Aggarwal – Developer @ IBM
© 2015 IBM Corporation
Spark Streaming
 Features of Spark Streaming
 High Level API (joins, windows etc.)
 Fault – Tolerant (exactly once semantics achievable)
 Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX
etc.)
Apache Hadoop Day 2015
© 2015 IBM Corporation
Architecture
Apache Hadoop Day 2015
© 2015 IBM Corporation
High Level Overview
Apache Hadoop Day 2015
© 2015 IBM Corporation
Receiving Data
Driver
RECEIVER
Input
Source
Executor
Executor
Data Blocks
Data Blocks
Data Blocks
Are replicated
To another
Executor
Driver runs
Receiver as
Long running
tasks
Receiver divides
Streams into
Blocks and
keeps in
memory
Apache Hadoop Day 2015
© 2015 IBM Corporation
Processing Data
Driver
RECEIVER
Executor
Executor
Data Blocks
Data Blocks
Every batch
Internal Driver
Launches tasks
To process the
blocks
Data
Store
results
results
© 2015 IBM Corporation
What’s different from other
Streaming applications?
© 2015 IBM Corporation
Traditional Stream Processing
© 2015 IBM Corporation
Load Balancing…
© 2015 IBM Corporation
Node failure / Stragglers…
© 2015 IBM Corporation
Word Count with Kafka
© 2015 IBM Corporation
Fault Tolerance
© 2015 IBM Corporation
Fault Tolerance
 Why Care?
 Different guarantees for Data Loss
 Atleast Once
 Exactly Once
 What all can fail?
 Driver
 Executor
© 2015 IBM Corporation
What happens when executor fails?
© 2015 IBM Corporation
What happens when Driver fails?
© 2015 IBM Corporation
Recovering Driver – Checkpointing
© 2015 IBM Corporation
Driver restart
© 2015 IBM Corporation
Driver restart – ToDO List
 Configure automatic driver restart
 Spark Standalone
 YARN
 Set Checkpoint in HDFS compatible file system
streamingContext.checkpiont(hdfsDirectory)
 Ensure the Code uses checkpoints for recovery
Def setupStreamingContext() : StreamingContext = {
Val context = new StreamingContext(…)
Val lines = KafkaUtils.createStream(…)
…
Context.checkpoint(hdfsDir)
Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext)
Context.start()
© 2015 IBM Corporation
WAL for no data loss
© 2015 IBM Corporation
Recover using WAL
© 2015 IBM Corporation
Configuration – Enabling WAL
 Enable Checkpointing.
 Enable WAL in Spark Configuration
 sparkConf.set(“spark.streaming.receiver.writeAheadLog.en
able”, “true”)
 Receiver should acknowledge the input source after data
written to WAL
 Disable in-memory replication
© 2015 IBM Corporation
Normal Processing
© 2015 IBM Corporation
Restarting Failed Driver
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Aleast Once, with Checkpointing / WAL
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Exactly Once, with Kafka Direct API
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
How to achieve “exactly once”
guarantee?
© 2015 IBM Corporation
Before Kafka Direct API
© 2015 IBM Corporation
Kafka Direct API
• Simplified Parallelism
• Less Storage Need
• Exactly Once Semantics
Benefits of this approach
© 2015 IBM Corporation
Demo
D E M O
SPARK STREAMING
OVERVIEW OF SPARK STREAMING
DISCRETIZED STREAMS (DSTREAMS)
• Dstream is basic abstraction in Spark Streaming.
• It is represented by a continuous series of RDDs(of the
same type).
• Each RDD in a DStream contains data from a certain
interval
• DStreams can either be created from live data (such as,
data from TCP sockets, Kafka, Flume, etc.) using a
Streaming Context or it can be generated by
transforming existing DStreams using operations such
as `map`, `window` and `reduceByKeyAndWindow`.
DISCRETIZED STREAMS (DSTREAMS)
WORD COUNT
val sparkConf = new SparkConf()
.setMaster("local[2]”)
.setAppName("WordCount")
val sc = new SparkContext(
sparkConf)
val file = sc.textFile(“filePath”)
val words = file
.flatMap(_.split(" "))
Val pairs = words
.map(x => (x, 1))
val wordCounts =pairs
.reduceByKey(_ + _)
wordCounts.saveAsTextFile(args(1))
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("SocketStreaming")
val ssc = new StreamingContext(
conf, Seconds(2))
val lines = ssc
.socketTextStream("localhost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
DEMO
KAFKA STREAM
val lines = ssc
.socketTextStream("localh
ost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
val zkQuorum="localhost:2181”;
val group="test";
val topics="test";
val numThreads="1";
val topicMap = topics
.split(",")
.map((_, numThreads.toInt))
.toMap
val lines = KafkaUtils
.createStream(
ssc, zkQuorum, group, topicMap)
.map(_._2)
val words = lines
.flatMap(_.split(" "))
……..
DEMO
OPERATIONS
• Repartition
• Operation on RDD
(Example print partition count
of each RDD)
Val re_lines=lines
.repartition(5)
re_lines
.foreachRDD(x =>fun(x))
def fun (rdd:RDD[String]) ={
print("partition count”
+ rdd.partitions.length)
}
DEMO
STATELESS TRANSFORMATIONS
• map() Apply a function to each element in the DStream and return a DStream of the result.
• ds.map(x => x + 1)
• flatMap() Apply a function to each element in the DStream and return a DStream of the contents
of the iterators returned.
• ds.flatMap(x => x.split(" "))
• filter() Return a DStream consisting of only elements that pass the condition passed to filter.
• ds.filter(x => x != 1)
• repartition() Change the number of partitions of the DStream.
• ds.repartition(10)
• reduceBy Combine values with the same Key() key in each batch.
• ds.reduceByKey( (x,y)=>x+y)
• groupBy Group values with the same Key() key in each batch.
• ds.groupByKey()
DEMO
STATEFUL TRANSFORMATIONS
Stateful transformations require checkpointing to be
enabled in your StreamingContext for fault tolerance
• Windowed transformations: windowed computations
allow you to apply transformations over a sliding window
of data
• UpdateStateByKey transformation: Enables this by
providing access to a state variable for DStreams of
key/value pairs
DEMO
WINDOW OPERATIONS
This shows that any window operation needs to specify two
parameters.
• window length - The duration of the window.
• sliding interval - The interval at which the window
operation is performed.
These two parameters must be multiples of the batch
interval of the source Dstream
DEMO
WINDOWED TRANSFORMATIONS
• window(windowLength, slideInterval)
• Return a new Dstream, computed based on windowed batches of the source Dstream.
• countByWindow(windowLength, slideInterval)
• Return a sliding window count of elements in the stream.
• val totalWordCount= words.countByWindow(Seconds(30), Seconds(10))
• reduceByWindow(func, windowLength, slideInterval)
• Return a new single-element stream, created by aggregating elements in the stream over a sliding
interval using func.
• The function should be associative so that it can be computed correctly in parallel.
• val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10)
• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
• Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the
given reduce function func over batches in a sliding window
• val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
• countByValueAndWindow(windowLength, slideInterval)
• Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a
sliding window.
• val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
DEMO
UPDATE STATE BY KEY
TRANSFORMATION
• updateStateByKey()
• Enables this by providing access to a state variable for DStreams of
key/value pairs
• User provide a function updateFunc(events, oldState) and initialRDD
• val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1),
("world", 1)))
• val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
• val stateCount= pairs.updateStateByKey[Int](updateFunc)
DEMO
TRANSFORM OPERATION
• The transform operation allows arbitrary RDD-to-RDD
functions to be applied on a DStream.
• It can be used to apply any RDD operation that is not
exposed in the DStream API.
• For example, the functionality of joining every batch in a
data stream with another dataset is not directly exposed
in the DStream API.
• val cleanedDStream = wordCounts.transform(rdd => {
rdd.join(data)
})
DEMO
JOIN OPERATIONS
• Stream-stream joins:
• Streams can be very easily joined with other streams.
• val stream1: DStream[String, String] = ...
• val stream2: DStream[String, String] = ...
• val joinedStream = stream1.join(stream2)
• Windowed join
• val windowedStream1 = stream1.window(Seconds(20))
• val windowedStream2 = stream2.window(Minutes(1))
• val joinedStream = windowedStream1.join(windowedStream2)
• Stream-dataset joins
• val dataset: RDD[String, String] = ...
• val windowedStream = stream.window(Seconds(20))...
• val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
DEMO
USING FOREACHRDD()
• foreachRDD is a powerful primitive that allows data to be sent out to
external systems.
• dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection)
}
}
• Using foreachRDD, Each RDD is converted to a DataFrame, registered
as a temporary table and then queried using SQL.
• words.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.registerTempTable("words")
val wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word")
wordCountsDataFrame.show()
}
DEMO
DSTREAMS (SPARK CODE)
• DStreams internally is characterized by a few basic properties:
• A list of other DStreams that the DStream depends on
• A time interval at which the DStream generates an RDD
• A function that is used to generate an RDD after each time interval
• Methods that should be implemented by subclasses of Dstream
• Time interval after which the DStream generates a RDD
• def slideDuration: Duration
• List of parent DStreams on which this DStream depends on
• def dependencies: List[DStream[_]]
• Method that generates a RDD for the given time
• def compute(validTime: Time): Option[RDD[T]]
• This class contains the basic operations available on all DStreams, such as
`map`, `filter` and `window`. In addition, PairDStreamFunctions contains
operations available only on DStreams of key-value pairs, such as
`groupByKeyAndWindow` and `join`. These operations are automatically
available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit
conversions.
© 2015 IBM Corporation

More Related Content

What's hot (20)

PPTX
Linux MMAP & Ioremap introduction
Gene Chang
 
PDF
DB2 TABLESPACES
Rahul Anand
 
PDF
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
PDF
Cisco ucs presentation
Abdelkader YEDDES
 
PDF
The Oracle RAC Family of Solutions - Presentation
Markus Michalewicz
 
PDF
oracle 9i cheat sheet
Piyush Mittal
 
PDF
Let's talk about Garbage Collection
Haim Yadid
 
PDF
IBM Integration Bus High Availability Overview
Peter Broadhurst
 
PDF
Dynamic routing
ajeela mushtaq
 
PDF
Netapp Storage
Prime Infoserv
 
PPTX
Vsam
kapa rohit
 
PPT
MAINVIEW for DB2.ppt
Sreedhar Ambatipudi
 
PDF
Uboot startup sequence
Houcheng Lin
 
PDF
Qemu Pcie
The Linux Foundation
 
PDF
IBM MQ Update, including 9.1.2 CD
David Ware
 
PDF
Access Control List & its Types
Netwax Lab
 
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
PPTX
RocksDB compaction
MIJIN AN
 
PPTX
Getting started with YANG
CoreStack
 
Linux MMAP & Ioremap introduction
Gene Chang
 
DB2 TABLESPACES
Rahul Anand
 
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Cisco ucs presentation
Abdelkader YEDDES
 
The Oracle RAC Family of Solutions - Presentation
Markus Michalewicz
 
oracle 9i cheat sheet
Piyush Mittal
 
Let's talk about Garbage Collection
Haim Yadid
 
IBM Integration Bus High Availability Overview
Peter Broadhurst
 
Dynamic routing
ajeela mushtaq
 
Netapp Storage
Prime Infoserv
 
MAINVIEW for DB2.ppt
Sreedhar Ambatipudi
 
Uboot startup sequence
Houcheng Lin
 
IBM MQ Update, including 9.1.2 CD
David Ware
 
Access Control List & its Types
Netwax Lab
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
RocksDB compaction
MIJIN AN
 
Getting started with YANG
CoreStack
 

Viewers also liked (20)

PPTX
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Spark streaming: Best Practices
Prakash Chockalingam
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Spark meetup stream processing use cases
punesparkmeetup
 
PPTX
Spark streaming high level overview
Avi Levi
 
PPTX
Spark Streaming and Expert Systems
Jim Haughwout
 
PPTX
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
PDF
MOLDEAS at City College
CARLOS III UNIVERSITY OF MADRID
 
PPTX
WP4-QoS Management in the Cloud
CARLOS III UNIVERSITY OF MADRID
 
PDF
Reactive Streams 1.0 and Akka Streams
Dean Wampler
 
ODP
Graph Data -- RDF and Property Graphs
andyseaborne
 
PPT
PSL Overview
stephenbach
 
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Introduction to Spark Streaming
datamantra
 
Spark streaming: Best Practices
Prakash Chockalingam
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
Apache Spark Architecture
Alexey Grishchenko
 
Spark meetup stream processing use cases
punesparkmeetup
 
Spark streaming high level overview
Avi Levi
 
Spark Streaming and Expert Systems
Jim Haughwout
 
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
MOLDEAS at City College
CARLOS III UNIVERSITY OF MADRID
 
WP4-QoS Management in the Cloud
CARLOS III UNIVERSITY OF MADRID
 
Reactive Streams 1.0 and Akka Streams
Dean Wampler
 
Graph Data -- RDF and Property Graphs
andyseaborne
 
PSL Overview
stephenbach
 
Ad

Similar to Apache Spark Streaming: Architecture and Fault Tolerance (20)

PPTX
Reactive programming every day
Vadym Khondar
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Angular for Java Enterprise Developers: Oracle Code One 2018
Loiane Groner
 
PPT
SQL Server 2008 Integration Services
Eduardo Castro
 
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
PPTX
Have your cake and eat it too
Gwen (Chen) Shapira
 
PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PPTX
Serverless in-action
Assaf Gannon
 
PPTX
Data Pipeline at Tapad
Toby Matejovsky
 
ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
PDF
What's new for Apache Flink's Table & SQL APIs?
Timo Walther
 
PDF
Complex Event Processor 3.0.0 - An overview of upcoming features
WSO2
 
PDF
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
PDF
Towards sql for streams
Radu Tudoran
 
PDF
Ice mini guide
Ady Liu
 
PDF
Qubell — Component Model
Roman Timushev
 
PPTX
El camino a las Cloud Native Apps - Introduction
Plain Concepts
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PDF
StackWatch: A prototype CloudWatch service for CloudStack
Chiradeep Vittal
 
PPTX
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
InfluxData
 
Reactive programming every day
Vadym Khondar
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Loiane Groner
 
SQL Server 2008 Integration Services
Eduardo Castro
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Have your cake and eat it too
Gwen (Chen) Shapira
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Serverless in-action
Assaf Gannon
 
Data Pipeline at Tapad
Toby Matejovsky
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
What's new for Apache Flink's Table & SQL APIs?
Timo Walther
 
Complex Event Processor 3.0.0 - An overview of upcoming features
WSO2
 
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
Towards sql for streams
Radu Tudoran
 
Ice mini guide
Ady Liu
 
Qubell — Component Model
Roman Timushev
 
El camino a las Cloud Native Apps - Introduction
Plain Concepts
 
Productionizing your Streaming Jobs
Databricks
 
StackWatch: A prototype CloudWatch service for CloudStack
Chiradeep Vittal
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
InfluxData
 
Ad

Recently uploaded (20)

PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
Digital water marking system project report
Kamal Acharya
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Digital water marking system project report
Kamal Acharya
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 

Apache Spark Streaming: Architecture and Fault Tolerance

  • 1. © 2015 IBM Corporation Apache Hadoop Day 2015 Paranth Thiruvengadam – Architect @ IBM Sachin Aggarwal – Developer @ IBM
  • 2. © 2015 IBM Corporation Spark Streaming  Features of Spark Streaming  High Level API (joins, windows etc.)  Fault – Tolerant (exactly once semantics achievable)  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) Apache Hadoop Day 2015
  • 3. © 2015 IBM Corporation Architecture Apache Hadoop Day 2015
  • 4. © 2015 IBM Corporation High Level Overview Apache Hadoop Day 2015
  • 5. © 2015 IBM Corporation Receiving Data Driver RECEIVER Input Source Executor Executor Data Blocks Data Blocks Data Blocks Are replicated To another Executor Driver runs Receiver as Long running tasks Receiver divides Streams into Blocks and keeps in memory Apache Hadoop Day 2015
  • 6. © 2015 IBM Corporation Processing Data Driver RECEIVER Executor Executor Data Blocks Data Blocks Every batch Internal Driver Launches tasks To process the blocks Data Store results results
  • 7. © 2015 IBM Corporation What’s different from other Streaming applications?
  • 8. © 2015 IBM Corporation Traditional Stream Processing
  • 9. © 2015 IBM Corporation Load Balancing…
  • 10. © 2015 IBM Corporation Node failure / Stragglers…
  • 11. © 2015 IBM Corporation Word Count with Kafka
  • 12. © 2015 IBM Corporation Fault Tolerance
  • 13. © 2015 IBM Corporation Fault Tolerance  Why Care?  Different guarantees for Data Loss  Atleast Once  Exactly Once  What all can fail?  Driver  Executor
  • 14. © 2015 IBM Corporation What happens when executor fails?
  • 15. © 2015 IBM Corporation What happens when Driver fails?
  • 16. © 2015 IBM Corporation Recovering Driver – Checkpointing
  • 17. © 2015 IBM Corporation Driver restart
  • 18. © 2015 IBM Corporation Driver restart – ToDO List  Configure automatic driver restart  Spark Standalone  YARN  Set Checkpoint in HDFS compatible file system streamingContext.checkpiont(hdfsDirectory)  Ensure the Code uses checkpoints for recovery Def setupStreamingContext() : StreamingContext = { Val context = new StreamingContext(…) Val lines = KafkaUtils.createStream(…) … Context.checkpoint(hdfsDir) Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext) Context.start()
  • 19. © 2015 IBM Corporation WAL for no data loss
  • 20. © 2015 IBM Corporation Recover using WAL
  • 21. © 2015 IBM Corporation Configuration – Enabling WAL  Enable Checkpointing.  Enable WAL in Spark Configuration  sparkConf.set(“spark.streaming.receiver.writeAheadLog.en able”, “true”)  Receiver should acknowledge the input source after data written to WAL  Disable in-memory replication
  • 22. © 2015 IBM Corporation Normal Processing
  • 23. © 2015 IBM Corporation Restarting Failed Driver
  • 24. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Aleast Once, with Checkpointing / WAL Source Receiving Transforming Outputting Sink
  • 25. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Exactly Once, with Kafka Direct API Source Receiving Transforming Outputting Sink
  • 26. © 2015 IBM Corporation How to achieve “exactly once” guarantee?
  • 27. © 2015 IBM Corporation Before Kafka Direct API
  • 28. © 2015 IBM Corporation Kafka Direct API • Simplified Parallelism • Less Storage Need • Exactly Once Semantics Benefits of this approach
  • 29. © 2015 IBM Corporation Demo
  • 30. D E M O SPARK STREAMING
  • 31. OVERVIEW OF SPARK STREAMING
  • 32. DISCRETIZED STREAMS (DSTREAMS) • Dstream is basic abstraction in Spark Streaming. • It is represented by a continuous series of RDDs(of the same type). • Each RDD in a DStream contains data from a certain interval • DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, etc.) using a Streaming Context or it can be generated by transforming existing DStreams using operations such as `map`, `window` and `reduceByKeyAndWindow`.
  • 34. WORD COUNT val sparkConf = new SparkConf() .setMaster("local[2]”) .setAppName("WordCount") val sc = new SparkContext( sparkConf) val file = sc.textFile(“filePath”) val words = file .flatMap(_.split(" ")) Val pairs = words .map(x => (x, 1)) val wordCounts =pairs .reduceByKey(_ + _) wordCounts.saveAsTextFile(args(1)) val conf = new SparkConf() .setMaster("local[2]") .setAppName("SocketStreaming") val ssc = new StreamingContext( conf, Seconds(2)) val lines = ssc .socketTextStream("localhost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 35. DEMO
  • 36. KAFKA STREAM val lines = ssc .socketTextStream("localh ost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) val zkQuorum="localhost:2181”; val group="test"; val topics="test"; val numThreads="1"; val topicMap = topics .split(",") .map((_, numThreads.toInt)) .toMap val lines = KafkaUtils .createStream( ssc, zkQuorum, group, topicMap) .map(_._2) val words = lines .flatMap(_.split(" ")) ……..
  • 37. DEMO
  • 38. OPERATIONS • Repartition • Operation on RDD (Example print partition count of each RDD) Val re_lines=lines .repartition(5) re_lines .foreachRDD(x =>fun(x)) def fun (rdd:RDD[String]) ={ print("partition count” + rdd.partitions.length) }
  • 39. DEMO
  • 40. STATELESS TRANSFORMATIONS • map() Apply a function to each element in the DStream and return a DStream of the result. • ds.map(x => x + 1) • flatMap() Apply a function to each element in the DStream and return a DStream of the contents of the iterators returned. • ds.flatMap(x => x.split(" ")) • filter() Return a DStream consisting of only elements that pass the condition passed to filter. • ds.filter(x => x != 1) • repartition() Change the number of partitions of the DStream. • ds.repartition(10) • reduceBy Combine values with the same Key() key in each batch. • ds.reduceByKey( (x,y)=>x+y) • groupBy Group values with the same Key() key in each batch. • ds.groupByKey()
  • 41. DEMO
  • 42. STATEFUL TRANSFORMATIONS Stateful transformations require checkpointing to be enabled in your StreamingContext for fault tolerance • Windowed transformations: windowed computations allow you to apply transformations over a sliding window of data • UpdateStateByKey transformation: Enables this by providing access to a state variable for DStreams of key/value pairs
  • 43. DEMO
  • 44. WINDOW OPERATIONS This shows that any window operation needs to specify two parameters. • window length - The duration of the window. • sliding interval - The interval at which the window operation is performed. These two parameters must be multiples of the batch interval of the source Dstream
  • 45. DEMO
  • 46. WINDOWED TRANSFORMATIONS • window(windowLength, slideInterval) • Return a new Dstream, computed based on windowed batches of the source Dstream. • countByWindow(windowLength, slideInterval) • Return a sliding window count of elements in the stream. • val totalWordCount= words.countByWindow(Seconds(30), Seconds(10)) • reduceByWindow(func, windowLength, slideInterval) • Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. • The function should be associative so that it can be computed correctly in parallel. • val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10) • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) • Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window • val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) • countByValueAndWindow(windowLength, slideInterval) • Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. • val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
  • 47. DEMO
  • 48. UPDATE STATE BY KEY TRANSFORMATION • updateStateByKey() • Enables this by providing access to a state variable for DStreams of key/value pairs • User provide a function updateFunc(events, oldState) and initialRDD • val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) • val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } • val stateCount= pairs.updateStateByKey[Int](updateFunc)
  • 49. DEMO
  • 50. TRANSFORM OPERATION • The transform operation allows arbitrary RDD-to-RDD functions to be applied on a DStream. • It can be used to apply any RDD operation that is not exposed in the DStream API. • For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. • val cleanedDStream = wordCounts.transform(rdd => { rdd.join(data) })
  • 51. DEMO
  • 52. JOIN OPERATIONS • Stream-stream joins: • Streams can be very easily joined with other streams. • val stream1: DStream[String, String] = ... • val stream2: DStream[String, String] = ... • val joinedStream = stream1.join(stream2) • Windowed join • val windowedStream1 = stream1.window(Seconds(20)) • val windowedStream2 = stream2.window(Minutes(1)) • val joinedStream = windowedStream1.join(windowedStream2) • Stream-dataset joins • val dataset: RDD[String, String] = ... • val windowedStream = stream.window(Seconds(20))... • val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
  • 53. DEMO
  • 54. USING FOREACHRDD() • foreachRDD is a powerful primitive that allows data to be sent out to external systems. • dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) } } • Using foreachRDD, Each RDD is converted to a DataFrame, registered as a temporary table and then queried using SQL. • words.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) import sqlContext.implicits._ val wordsDataFrame = rdd.toDF("word") wordsDataFrame.registerTempTable("words") val wordCountsDataFrame = sqlContext.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() }
  • 55. DEMO
  • 56. DSTREAMS (SPARK CODE) • DStreams internally is characterized by a few basic properties: • A list of other DStreams that the DStream depends on • A time interval at which the DStream generates an RDD • A function that is used to generate an RDD after each time interval • Methods that should be implemented by subclasses of Dstream • Time interval after which the DStream generates a RDD • def slideDuration: Duration • List of parent DStreams on which this DStream depends on • def dependencies: List[DStream[_]] • Method that generates a RDD for the given time • def compute(validTime: Time): Option[RDD[T]] • This class contains the basic operations available on all DStreams, such as `map`, `filter` and `window`. In addition, PairDStreamFunctions contains operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and `join`. These operations are automatically available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit conversions.
  • 57. © 2015 IBM Corporation

Editor's Notes

  • #5: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #9: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #10: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #11: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #19: Have to have a sample code before coming to this slide.
  • #23: reference ids of the blocks for locating their data in the executor memory, (ii) offset information of the block data in the logs
  • #26: Have to read on Kafka Direct API.