SlideShare a Scribd company logo
Stream processing from
single node to a cluster
What are we going to talk about?
• What is stream processing?
• What are the challenges?
• Reactive streams
• Implementing reactive streams with Akka streams
• Spark streaming
• Questions ?
What is a stream?
• A sequence of data elements that becomes available over time
• Can be finite (not interesting)
• List of items
• or infinite
• A live video stream
• Web analytics stream
• IOT event stream
• Processed one by one
• So what is the best way to process a stream?
Synchronous processing
• Items in the stream are processed one by one
• Every processing action blocks and waits to finish
• Plus: easy to implement
• Minus: can’t handle load
A-Synchronous processing
• Items in the stream are stored in a buffer
• The consumer fetches items from the buffer in his own time
• Plus: not blocking any more
• Minus: what happens if the buffer fills up ?
Solving the fast publisher problem
1. Increase the buffer size
• temporary solution
• good for picks
• May cause OOM error
2. Drop messages and signal the publisher to resend
• Messages are “wasted”
• TCP works this way
Reactive streams
• Ask the publisher for a specific amount of messages
• No out of memory
• No messages wasted
• Part of the Java 9 JDK :
• Processor
• Publisher
• Subscriber
• Subscription
Reactive streams
@FunctionalInterface
public static interface Flow.Publisher<T> {
public void subscribe(Flow.Subscriber<? super T> subscriber);
}
public static interface Flow.Subscriber<T> {
public void onSubscribe(Flow.Subscription subscription);
public void onNext(T item) ;
public void onError(Throwable throwable) ;
public void onComplete() ;
}
public static interface Flow.Subscription {
public void request(long n);
public void cancel() ;
}
public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
Akka streams
• High level stream API that implements reactive streams
• Based on the Akka actor toolkit
Actor A
Hello msg
Actor B
Talk streams to me
• Graph - description how the stream is processed, composed of
processing stages
• Processing stage – the basic unit of the graph, may transform,
receive or emit elements – must not block
• Source – a processing stage that has single output – emits
elements when the downstream stages are ready
• Sink – a processing stage with a single input – requests and
accepts data
• Flow - a processing stage with a single input and output
Demo
Runnable Graph
• Runnable Graph = Source + Flow + Sink
• Executed by calling run()
• Till calling run the graph doesn’t run
• Materialization is when he materializer takes the stream “recipe”
and actually executes it.
• How? remember the akka actors?
Complex stream graphs
• We want that the lines of the file will get to two different flows
• Its called “Broadcast” in the Akka streams
• The sign “~>” is used as a connector in the GraphDSL
• Once the graph is connected we can return closed shape
File
Lines
mapper
Word
counter
Cleaner
Print
Top
words
Longest
Line
Demo
Batching
• There some cases when we want to collect several items and only
then apply our business logic
• Aggregative logic
• Batch writes to a db
• We can use the batch(max,seedFunction)(aggFunction) – In case
of back pressure aggregates the elements till max elements
• max- defines the maximal number of elements
• seed – a function to create a batch of single element
• aggFunction – combines the existing batch with the next element
To summarize
• Backpressure enables us to handle stream in an efficent manner
• Akka streams implement the reactive streams api using Source,
Flow, Graph, Sink
• Graph is a blue print (“recipe”) of processing stages
• We can build complex flows using the Graph DSL
• We also can batch
Stream processing requirements
• What if I need to have the same logic for stream processing and
batch processing?
• I want to run a cluster of stream processors
• I want it to recover from fail automatically
• Handle multiple stream sources out of the box
• High level API
Spark streaming
• A Spark module for building scalable, fault tolerant stream
processing
Taken from official spark documentation
Remember Spark?
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a distributed collection of items which their source may for
example: Hadoop (HDFS), Kafka, Kinesis …
D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation
Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()
DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD
Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)
Updating the state
• All the operations so far didn't have state
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}
Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data
Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)
Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}
Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}
To summarize
• Spark streaming provides high level micro-batch API
• It is distributed by using RDD
• It is fault tolerant because due to the checkpoints
• You can have state that is updated over time
• Use for each RDD carefully
Questions?

More Related Content

What's hot (20)

KEY
Building Distributed Systems in Scala
Alex Payne
 
ODP
Akka streams
Knoldus Inc.
 
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward
 
KEY
The Why and How of Scala at Twitter
Alex Payne
 
PDF
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Lightbend
 
PPT
Introduction to Spark Streaming
Knoldus Inc.
 
PDF
Performance van Java 8 en verder - Jeroen Borgers
NLJUG
 
PDF
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
PDF
Scala Days NYC 2016
Martin Odersky
 
PDF
Internship final report@Treasure Data Inc.
Ryuichi ITO
 
PDF
Developing Secure Scala Applications With Fortify For Scala
Lightbend
 
PPTX
Akka Actor presentation
Gene Chang
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Developing distributed applications with Akka and Akka Cluster
Konstantin Tsykulenko
 
PDF
Akka Streams and HTTP
Roland Kuhn
 
PDF
Go faster with_native_compilation Part-2
Rajeev Rastogi (KRR)
 
ODP
Introduction to ScalaZ
Knoldus Inc.
 
PPT
Spark stream - Kafka
Dori Waldman
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PDF
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Summit
 
Building Distributed Systems in Scala
Alex Payne
 
Akka streams
Knoldus Inc.
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward
 
The Why and How of Scala at Twitter
Alex Payne
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Lightbend
 
Introduction to Spark Streaming
Knoldus Inc.
 
Performance van Java 8 en verder - Jeroen Borgers
NLJUG
 
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
Scala Days NYC 2016
Martin Odersky
 
Internship final report@Treasure Data Inc.
Ryuichi ITO
 
Developing Secure Scala Applications With Fortify For Scala
Lightbend
 
Akka Actor presentation
Gene Chang
 
Scala and spark
Fabio Fumarola
 
Developing distributed applications with Akka and Akka Cluster
Konstantin Tsykulenko
 
Akka Streams and HTTP
Roland Kuhn
 
Go faster with_native_compilation Part-2
Rajeev Rastogi (KRR)
 
Introduction to ScalaZ
Knoldus Inc.
 
Spark stream - Kafka
Dori Waldman
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Summit
 

Similar to Stream processing from single node to a cluster (20)

PPT
Spark streaming
Venkateswaran Kandasamy
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
PPT
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
PPT
strata_spark_streaming.ppt
snowflakebatch
 
PPT
strata_spark_streaming.ppt
rveiga100
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Deep dive into spark streaming
Tao Li
 
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
PDF
Apache Spark Overview part2 (20161117)
Steve Min
 
PDF
Introduction to Spark Streaming
datamantra
 
ODP
Introduction to Akka Streams [Part-I]
Knoldus Inc.
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PPTX
Spark streaming high level overview
Avi Levi
 
PDF
Reactive Streams: Handling Data-Flow the Reactive Way
Roland Kuhn
 
PPTX
Intro to Akka Streams
Michael Kendra
 
PDF
Journey into Reactive Streams and Akka Streams
Kevin Webber
 
PPTX
Learning spark ch10 - Spark Streaming
phanleson
 
PDF
So you think you can stream.pptx
Prakash Chockalingam
 
Spark streaming
Venkateswaran Kandasamy
 
strata_spark_streaming.ppt
AbhijitManna19
 
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
strata_spark_streaming.ppt
snowflakebatch
 
strata_spark_streaming.ppt
rveiga100
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Deep dive into spark streaming
Tao Li
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
Apache Spark Overview part2 (20161117)
Steve Min
 
Introduction to Spark Streaming
datamantra
 
Introduction to Akka Streams [Part-I]
Knoldus Inc.
 
Apache Spark Components
Girish Khanzode
 
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Spark streaming high level overview
Avi Levi
 
Reactive Streams: Handling Data-Flow the Reactive Way
Roland Kuhn
 
Intro to Akka Streams
Michael Kendra
 
Journey into Reactive Streams and Akka Streams
Kevin Webber
 
Learning spark ch10 - Spark Streaming
phanleson
 
So you think you can stream.pptx
Prakash Chockalingam
 
Ad

Recently uploaded (20)

PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Ad

Stream processing from single node to a cluster

  • 1. Stream processing from single node to a cluster
  • 2. What are we going to talk about? • What is stream processing? • What are the challenges? • Reactive streams • Implementing reactive streams with Akka streams • Spark streaming • Questions ?
  • 3. What is a stream? • A sequence of data elements that becomes available over time • Can be finite (not interesting) • List of items • or infinite • A live video stream • Web analytics stream • IOT event stream • Processed one by one • So what is the best way to process a stream?
  • 4. Synchronous processing • Items in the stream are processed one by one • Every processing action blocks and waits to finish • Plus: easy to implement • Minus: can’t handle load
  • 5. A-Synchronous processing • Items in the stream are stored in a buffer • The consumer fetches items from the buffer in his own time • Plus: not blocking any more • Minus: what happens if the buffer fills up ?
  • 6. Solving the fast publisher problem 1. Increase the buffer size • temporary solution • good for picks • May cause OOM error 2. Drop messages and signal the publisher to resend • Messages are “wasted” • TCP works this way
  • 7. Reactive streams • Ask the publisher for a specific amount of messages • No out of memory • No messages wasted • Part of the Java 9 JDK : • Processor • Publisher • Subscriber • Subscription
  • 8. Reactive streams @FunctionalInterface public static interface Flow.Publisher<T> { public void subscribe(Flow.Subscriber<? super T> subscriber); } public static interface Flow.Subscriber<T> { public void onSubscribe(Flow.Subscription subscription); public void onNext(T item) ; public void onError(Throwable throwable) ; public void onComplete() ; } public static interface Flow.Subscription { public void request(long n); public void cancel() ; } public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
  • 9. Akka streams • High level stream API that implements reactive streams • Based on the Akka actor toolkit Actor A Hello msg Actor B
  • 10. Talk streams to me • Graph - description how the stream is processed, composed of processing stages • Processing stage – the basic unit of the graph, may transform, receive or emit elements – must not block • Source – a processing stage that has single output – emits elements when the downstream stages are ready • Sink – a processing stage with a single input – requests and accepts data • Flow - a processing stage with a single input and output
  • 11. Demo
  • 12. Runnable Graph • Runnable Graph = Source + Flow + Sink • Executed by calling run() • Till calling run the graph doesn’t run • Materialization is when he materializer takes the stream “recipe” and actually executes it. • How? remember the akka actors?
  • 13. Complex stream graphs • We want that the lines of the file will get to two different flows • Its called “Broadcast” in the Akka streams • The sign “~>” is used as a connector in the GraphDSL • Once the graph is connected we can return closed shape File Lines mapper Word counter Cleaner Print Top words Longest Line
  • 14. Demo
  • 15. Batching • There some cases when we want to collect several items and only then apply our business logic • Aggregative logic • Batch writes to a db • We can use the batch(max,seedFunction)(aggFunction) – In case of back pressure aggregates the elements till max elements • max- defines the maximal number of elements • seed – a function to create a batch of single element • aggFunction – combines the existing batch with the next element
  • 16. To summarize • Backpressure enables us to handle stream in an efficent manner • Akka streams implement the reactive streams api using Source, Flow, Graph, Sink • Graph is a blue print (“recipe”) of processing stages • We can build complex flows using the Graph DSL • We also can batch
  • 17. Stream processing requirements • What if I need to have the same logic for stream processing and batch processing? • I want to run a cluster of stream processors • I want it to recover from fail automatically • Handle multiple stream sources out of the box • High level API
  • 18. Spark streaming • A Spark module for building scalable, fault tolerant stream processing Taken from official spark documentation
  • 19. Remember Spark? •Spark is a cluster computing engine. •Provides high-level API in Scala, Java, Python and R. •The basic abstraction in Spark is the RDD. •Stands for: Resilient Distributed Dataset. •It is a distributed collection of items which their source may for example: Hadoop (HDFS), Kafka, Kinesis …
  • 20. D is for Partitioned • Partition is a sub-collection of data that should fit into memory • Partition + transformation = Task • This is the distributed part of the RDD • Partitions are recomputed in case of failure - Resilient Foo bar .. Line 2 Hello … … Line 100.. Line #... … … Line 200.. Line #... … … Line 300.. Line #... …
  • 21. RDD Actions •Return values by evaluating the RDD (not lazy): •collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. •count() – returns the number of the elements in the RDD. •first() – returns the first element of the RDD. •foreach(f) – performs the function on each element of the RDD.
  • 22. RDD Transformations •Return pointer to new RDD with transformation meta-data •map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
  • 23. Micro batching with Spark Streaming • Takes a partitioned stream of data • Slices it up by time – usually seconds • DStream – composed of RDD slices that contains a collection of items Taken from official spark documentation
  • 24. Example val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint(checkpoint.toString()) val dstream: DStream[Int] = ssc.textFileStream(s"file://$folder/").map(_.trim.toInt) dstream.print() ssc.start() ssc.awaitTermination()
  • 25. DSTream operations • Similar to RDD operations with small changes •map(func) - returns a new DSTream applying func on every element of the original stream. •filter(func) - returns a new DSTream formed by selecting those elements of the source stream on which func returns true. •reduce(func) – returns a new Dstream of single-element RDDs by applying the reduce func on every source RDD
  • 26. Using your existing batch logic • transform(func) - operation that creates a new DStream by a applying func to DStream RDDs. dstream.transform(existingBuisnessFunction)
  • 27. Updating the state • All the operations so far didn't have state • How do I accumulate results with the current batch? • updateStateByKey(updateFunc) – a transformation that creates a new DStream with key-value where the value is updated according to the state and the new values. def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = { runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum)) }
  • 28. Checkpoints • Checkpoints – periodically saves to reliable storage (HDFS/S3/…) necessary data to recover from failures • Metadata checkpoints • Configuration of the stream context • DStream definition and operations • Incomplete batches • Data checkpoints • saving stateful RDD data
  • 29. Checkpoints • To configure checkpoint usage : • streamingContext.checkpoint(directory) • To create a recoverable streaming application: • StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
  • 30. Working with the foreach RDD • A common practice is to use the foreachRDD(func) to push data to an external system. • Don’t do: dstream.foreachRDD { rdd => val myExternalResource = ... // Created on the driver rdd.foreachPartition { partition => myExternalResource.save(partition) } }
  • 31. Working with the foreach RDD • Instead do: dstream.foreachRDD { rdd => rdd.foreachPartition { partition => val myExternalResource = ... // Created on the executor myExternalResource.save(partition) } }
  • 32. To summarize • Spark streaming provides high level micro-batch API • It is distributed by using RDD • It is fault tolerant because due to the checkpoints • You can have state that is updated over time • Use for each RDD carefully