Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API

®
© 2016 MapR Technologies 1®
© 2016 MapR Technologies 1© 2016 MapR Technologies
®
Exploring Data Pipelines for Spark Streaming Applications
Carol McDonald, Industry Solutions Architect
2016

®
© 2016 MapR Technologies 2
What is Streaming Data? Got Some Examples?
Data Collection
Devices
Smart Machinery Phones and Tablets Home Automation
RFID Systems Digital Signage Security Systems Medical Devices

®
It was hot
at 6:05
yesterday
!
Why Stream Processing?
Analyze
6:01 P.M.:
72°
6:02 P.M.:
75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°
6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
Batch processing may be too late for some events

®
Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the
air
conditioning!
It’s becoming important to process events as they arrive

®
Key to Real Time: Event-based Data Flows
web events
etc…
machine sensors
Biometrics
Mobile events

®
What if BP had detected problems before the oil hit
the water ?
•  1M samples/sec
•  High performance at
scale is necessary!

®
Use Case: Time Series Data
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic

®
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil
pump name, date, and
a time stamp

®
Schema
Row key
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0

®
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?

®
How do we do this with High Performance at Scale?
•  Parallel operations and minimize disk read/write time

®
Collect the Data
Data Ingest
MapR-FS
Source
Stream
Topic
•  Data Ingest:
–  File Based: NFS with MapR-FS,
HDFS
–  Network Based: MapR Streams,
Kafka, Kinesis, Twitter, Sockets...

®
MapR Streams Publish Subscribe Messaging
Topics Organize Events into Categories
and decouple Producers from Consumers

®
Scalable Messaging with MapR Streams
Topics are partitioned for throughput and scalability

®
•  Parallel , Partitioned = fast , scalable
–  Messaging with MapR Streams

®
Collect Data
Process the Data with Spark Streaming
MapR-FS
Process Data
Stream
Topic
•  Extension of the core Spark AP
•  Enables scalable, high-throughput,
fault-tolerant stream processing of
live data

®
Processing Spark DStreams
Data stream divided into batches of X milliseconds = DStreams

®
Spark Resilient Distributed Datasets
RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory

®
Spark Resilient Distributed Datasets
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory

®
•  Parallel , Partitioned = fast , scalable
–  Processing with Spark

®
transformations à create new RDDs
Two types of operations on DStreams:
•  Transformations:
–  Create new DStreams
–  map, filter, reduceByKey, SQL. . .
•  Output Operations
DStream
RDDs
DStream
RDDs
transform
transform

data from
time 0 to 1
RDD @ time 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3
RDD @ time 3
transform

RDD @ time 1 RDD @ time 2

®
Two types of operations on DStreams
•  Transformations
•  Output Operations: trigger
Computation
–  Save to File, HBase..
•  saveAsHadoopFiles
•  saveAsHadoopDataset
•  saveAsTextFiles
Output operations à trigger computation
MapR-FS
MapR-DB
DStream
RDDs
data from
time 0 to 1
data from
time 1 to 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1 RDD @ time 2
mapmap map
savesave save

®
MapR-FS
MapR-FS
Stream
Topic

®
MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!

®
Store Lots of Data with NoSQL MapR-DB
bottleneck
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together

®
Key to Real Time: Event-based Data Flows
Key to Scale = Parallel Partitioned:
•  Messaging
•  Processing
•  Storage

®
MapR-FS
MapR-FS
Stream
Topic

®
Use Case Example Code
Data for
read
Sensor
Spark
Streaming
Stream
Topic

®
Data for
read
Sensor
Spark
Streaming
Stream
Topic

®
KafkaProducer
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
Properties properties = new Properties();
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
// Instantiate KafkaProducer with properties
producer = new KafkaProducer<String, String>(properties);
String txt = “msg text”;
ProducerRecord<String, String> rec = new
ProducerRecord<String, String>(topic, txt);
producer.send(rec);

®
Data for
read
Sensor
Spark
Streaming
Stream
Topic

®
Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))
val dStream = KafkaUtils.createDirectStream[String,
String](ssc, kafkaParams, topicsSet)
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD

®
Process DStream
val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
sensorDStream RDDs
New RDDs created
for every batch
map map map

®
Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}

®
DataFrame and SQL Operations
// for Each RDD
sensorDStream.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
rdd.toDF().registerTempTable("sensor")
val res = sqlContext.sql( "SELECT resid, date,
max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz,
max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp,
max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo,
max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi
FROM sensor GROUP BY resid,date")
res.show()
}

®
Streaming Application Output

®
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist
data to external storage
Put objects written
to HBase
batch
time 2-3
batch
time 1 to 2
batch
time 0 to 1
mapmap map
savesave save

®
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()

®
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing

®
To Learn More:
•  Read explanation of and Download code
–  https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-
spark-streaming-and-mapr-db
–  https://www.mapr.com/blog/spark-streaming-hbase

®
To Learn More:
•  http://learn.mapr.com/

®
Q&A
@mapr
@caroljmcdonald
https://www.mapr.com/blog/author/carol-mcdonald
Engage with us!
mapr-technologies

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API

More Related Content

What's hot(20)

Viewers also liked(11)

Similar to Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API(20)

More from Carol McDonald(18)

Recently uploaded(20)

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API