Stream processing from single node to a cluster

Stream processing from
single node to a cluster

What are we going to talk about?
• What is stream processing?
• What are the challenges?
• Reactive streams
• Implementing reactive streams with Akka streams
• Spark streaming
• Questions ?

What is a stream?
• A sequence of data elements that becomes available over time
• Can be finite (not interesting)
• List of items
• or infinite
• A live video stream
• Web analytics stream
• IOT event stream
• Processed one by one
• So what is the best way to process a stream?

Synchronous processing
• Items in the stream are processed one by one
• Every processing action blocks and waits to finish
• Plus: easy to implement
• Minus: can’t handle load

A-Synchronous processing
• Items in the stream are stored in a buffer
• The consumer fetches items from the buffer in his own time
• Plus: not blocking any more
• Minus: what happens if the buffer fills up ?

Solving the fast publisher problem
1. Increase the buffer size
• temporary solution
• good for picks
• May cause OOM error
2. Drop messages and signal the publisher to resend
• Messages are “wasted”
• TCP works this way

Reactive streams
• Ask the publisher for a specific amount of messages
• No out of memory
• No messages wasted
• Part of the Java 9 JDK :
• Processor
• Publisher
• Subscriber
• Subscription

Reactive streams
@FunctionalInterface
public static interface Flow.Publisher<T> {
public void subscribe(Flow.Subscriber<? super T> subscriber);
}
public static interface Flow.Subscriber<T> {
public void onSubscribe(Flow.Subscription subscription);
public void onNext(T item) ;
public void onError(Throwable throwable) ;
public void onComplete() ;
}
public static interface Flow.Subscription {
public void request(long n);
public void cancel() ;
}
public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}

Akka streams
• High level stream API that implements reactive streams
• Based on the Akka actor toolkit
Actor A
Hello msg
Actor B

Talk streams to me
• Graph - description how the stream is processed, composed of
processing stages
• Processing stage – the basic unit of the graph, may transform,
receive or emit elements – must not block
• Source – a processing stage that has single output – emits
elements when the downstream stages are ready
• Sink – a processing stage with a single input – requests and
accepts data
• Flow - a processing stage with a single input and output

Runnable Graph
• Runnable Graph = Source + Flow + Sink
• Executed by calling run()
• Till calling run the graph doesn’t run
• Materialization is when he materializer takes the stream “recipe”
and actually executes it.
• How? remember the akka actors?

Complex stream graphs
• We want that the lines of the file will get to two different flows
• Its called “Broadcast” in the Akka streams
• The sign “~>” is used as a connector in the GraphDSL
• Once the graph is connected we can return closed shape
File
Lines
mapper
Word
counter
Cleaner
Print
Top
words
Longest
Line

Batching
• There some cases when we want to collect several items and only
then apply our business logic
• Aggregative logic
• Batch writes to a db
• We can use the batch(max,seedFunction)(aggFunction) – In case
of back pressure aggregates the elements till max elements
• max- defines the maximal number of elements
• seed – a function to create a batch of single element
• aggFunction – combines the existing batch with the next element

To summarize
• Backpressure enables us to handle stream in an efficent manner
• Akka streams implement the reactive streams api using Source,
Flow, Graph, Sink
• Graph is a blue print (“recipe”) of processing stages
• We can build complex flows using the Graph DSL
• We also can batch

Stream processing requirements
• What if I need to have the same logic for stream processing and
batch processing?
• I want to run a cluster of stream processors
• I want it to recover from fail automatically
• Handle multiple stream sources out of the box
• High level API

Spark streaming
• A Spark module for building scalable, fault tolerant stream
processing
Taken from official spark documentation

Remember Spark?
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a distributed collection of items which their source may for
example: Hadoop (HDFS), Kafka, Kinesis …

D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…

RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.

RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).

Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation

Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()

DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD

Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)

Updating the state
• All the operations so far didn't have state
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}

Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data

Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)

Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}

Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}

To summarize
• Spark streaming provides high level micro-batch API
• It is distributed by using RDD
• It is fault tolerant because due to the checkpoints
• You can have state that is updated over time
• Use for each RDD carefully

Stream processing from single node to a cluster

More Related Content

What's hot (20)

Similar to Stream processing from single node to a cluster (20)

Recently uploaded (20)

Stream processing from single node to a cluster