SlideShare a Scribd company logo
®
© 2016 MapR Technologies 1®
© 2016 MapR Technologies 1© 2016 MapR Technologies
®
Exploring Data Pipelines for Spark Streaming Applications
Carol McDonald, Industry Solutions Architect
2016
®
© 2016 MapR Technologies 2®
© 2016 MapR Technologies 2
What is Streaming Data? Got Some Examples?
Data Collection
Devices
Smart Machinery Phones and Tablets Home Automation
RFID Systems Digital Signage Security Systems Medical Devices
®
© 2016 MapR Technologies 3®
© 2016 MapR Technologies 3
It was hot
at 6:05
yesterday
!
Why Stream Processing?
Analyze
6:01 P.M.:
72°
6:02 P.M.:
75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°
6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
Batch processing may be too late for some events
®
© 2016 MapR Technologies 4®
© 2016 MapR Technologies 4
Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the
air
conditioning!
It’s becoming important to process events as they arrive
®
© 2016 MapR Technologies 5®
© 2016 MapR Technologies 5
Key to Real Time: Event-based Data Flows
web events
etc…
machine sensors
Biometrics
Mobile events
®
© 2016 MapR Technologies 6®
© 2016 MapR Technologies 6
What if BP had detected problems before the oil hit
the water ?
•  1M samples/sec
•  High performance at
scale is necessary!
®
© 2016 MapR Technologies 7®
© 2016 MapR Technologies 7
Use Case: Time Series Data
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 8®
© 2016 MapR Technologies 8
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil
pump name, date, and
a time stamp
®
© 2016 MapR Technologies 9®
© 2016 MapR Technologies 9
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
®
© 2016 MapR Technologies 10®
© 2016 MapR Technologies 10
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
®
© 2016 MapR Technologies 11®
© 2016 MapR Technologies 11
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
®
© 2016 MapR Technologies 12®
© 2016 MapR Technologies 12
How do we do this with High Performance at Scale?
•  Parallel operations and minimize disk read/write time
®
© 2016 MapR Technologies 13®
© 2016 MapR Technologies 13
Collect the Data
Data Ingest
MapR-FS
Source
Stream
Topic
•  Data Ingest:
–  File Based: NFS with MapR-FS,
HDFS
–  Network Based: MapR Streams,
Kafka, Kinesis, Twitter, Sockets...
®
© 2016 MapR Technologies 14®
© 2016 MapR Technologies 14
MapR Streams Publish Subscribe Messaging
Topics Organize Events into Categories
and decouple Producers from Consumers
®
© 2016 MapR Technologies 15®
© 2016 MapR Technologies 15
Scalable Messaging with MapR Streams
Topics are partitioned for throughput and scalability
®
© 2016 MapR Technologies 16®
© 2016 MapR Technologies 16
How do we do this with High Performance at Scale?
•  Parallel , Partitioned = fast , scalable
–  Messaging with MapR Streams
®
© 2016 MapR Technologies 17®
© 2016 MapR Technologies 17
Collect Data
Process the Data with Spark Streaming
MapR-FS
Process Data
Stream
Topic
•  Extension of the core Spark AP
•  Enables scalable, high-throughput,
fault-tolerant stream processing of
live data
®
© 2016 MapR Technologies 18®
© 2016 MapR Technologies 18
Processing Spark DStreams
Data stream divided into batches of X milliseconds = DStreams
®
© 2016 MapR Technologies 19®
© 2016 MapR Technologies 19
Spark Resilient Distributed Datasets
RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory
®
© 2016 MapR Technologies 20®
© 2016 MapR Technologies 20
Spark Resilient Distributed Datasets
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory
®
© 2016 MapR Technologies 21®
© 2016 MapR Technologies 21
How do we do this with High Performance at Scale?
•  Parallel , Partitioned = fast , scalable
–  Processing with Spark
®
© 2016 MapR Technologies 22®
© 2016 MapR Technologies 22
Processing Spark DStreams
transformations à create new RDDs
Two types of operations on DStreams:
•  Transformations:
–  Create new DStreams
–  map, filter, reduceByKey, SQL. . .
•  Output Operations
DStream
RDDs
DStream
RDDs
transform	
  transform	
  
data from
time 0 to 1
RDD @ time 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3
RDD @ time 3
transform	
  
RDD @ time 1 RDD @ time 2
®
© 2016 MapR Technologies 23®
© 2016 MapR Technologies 23
Two types of operations on DStreams
•  Transformations
•  Output Operations: trigger
Computation
–  Save to File, HBase..
•  saveAsHadoopFiles
•  saveAsHadoopDataset
•  saveAsTextFiles
Processing Spark DStreams
Output operations à trigger computation
MapR-FS
MapR-DB
DStream
RDDs
data from
time 0 to 1
data from
time 1 to 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1 RDD @ time 2
mapmap map
savesave save
®
© 2016 MapR Technologies 24®
© 2016 MapR Technologies 24
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
MapR-FS
Stream
Topic
®
© 2016 MapR Technologies 25®
© 2016 MapR Technologies 25
MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!
®
© 2016 MapR Technologies 26®
© 2016 MapR Technologies 26
Store Lots of Data with NoSQL MapR-DB
bottleneck
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
®
© 2016 MapR Technologies 27®
© 2016 MapR Technologies 27
Key to Real Time: Event-based Data Flows
Key to Scale = Parallel Partitioned:
•  Messaging
•  Processing
•  Storage
®
© 2016 MapR Technologies 28®
© 2016 MapR Technologies 28
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
MapR-FS
Stream
Topic
®
© 2016 MapR Technologies 29®
© 2016 MapR Technologies 29
Use Case Example Code
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 30®
© 2016 MapR Technologies 30
Use Case Example Code
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 31®
© 2016 MapR Technologies 31
KafkaProducer
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
Properties properties = new Properties();
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
// Instantiate KafkaProducer with properties
producer = new KafkaProducer<String, String>(properties);
String txt = “msg text”;
ProducerRecord<String, String> rec = new
ProducerRecord<String, String>(topic, txt);
producer.send(rec);
®
© 2016 MapR Technologies 32®
© 2016 MapR Technologies 32
Use Case Example Code
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 33®
© 2016 MapR Technologies 33
Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))
val dStream = KafkaUtils.createDirectStream[String,
String](ssc, kafkaParams, topicsSet)
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD
®
© 2016 MapR Technologies 34®
© 2016 MapR Technologies 34
Process DStream
val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
sensorDStream RDDs
New RDDs created
for every batch
map map map
®
© 2016 MapR Technologies 35®
© 2016 MapR Technologies 35
Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
®
© 2016 MapR Technologies 36®
© 2016 MapR Technologies 36
DataFrame and SQL Operations
// for Each RDD
sensorDStream.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
rdd.toDF().registerTempTable("sensor")
val res = sqlContext.sql( "SELECT resid, date,
max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz,
max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp,
max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo,
max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi
FROM sensor GROUP BY resid,date")
res.show()
}
®
© 2016 MapR Technologies 37®
© 2016 MapR Technologies 37
Streaming Application Output
®
© 2016 MapR Technologies 38®
© 2016 MapR Technologies 38
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist
data to external storage
Put objects written
to HBase
batch
time 2-3
batch
time 1 to 2
batch
time 0 to 1
mapmap map
savesave save
®
© 2016 MapR Technologies 39®
© 2016 MapR Technologies 39
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
®
© 2016 MapR Technologies 40®
© 2016 MapR Technologies 40
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing
®
© 2016 MapR Technologies 41®
© 2016 MapR Technologies 41
To Learn More:
•  Read explanation of and Download code
–  https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-
spark-streaming-and-mapr-db
–  https://www.mapr.com/blog/spark-streaming-hbase
®
© 2016 MapR Technologies 42®
© 2016 MapR Technologies 42
To Learn More:
•  http://learn.mapr.com/
®
© 2016 MapR Technologies 43®
© 2016 MapR Technologies 43
Q&A
@mapr
@caroljmcdonald
https://www.mapr.com/blog/author/carol-mcdonald
Engage with us!
mapr-technologies

More Related Content

What's hot (20)

PPTX
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
PPTX
M7 and Apache Drill, Micheal Hausenblas
Modern Data Stack France
 
PDF
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 
PDF
Cmu-2011-09.pptx
Ted Dunning
 
PPTX
Using Apache Drill
Chicago Hadoop Users Group
 
PPTX
Dealing with an Upside Down Internet
MapR Technologies
 
PPTX
Free Code Friday: Drill 101 - Basics of Apache Drill
MapR Technologies
 
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
PDF
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
PPTX
Apache drill
Jakub Pieprzyk
 
PPTX
Hug france-2012-12-04
Ted Dunning
 
PPTX
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR Technologies
 
PPTX
Analyzing Real-World Data with Apache Drill
tshiran
 
PPTX
Apache Drill
Ted Dunning
 
PDF
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
The Hive
 
PDF
Apache Drill - Why, What, How
mcsrivas
 
PDF
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR Technologies
 
PPTX
Hive+Tez: A performance deep dive
t3rmin4t0r
 
PPTX
Working with Delimited Data in Apache Drill 1.6.0
Vince Gonzalez
 
PPTX
MapR 5.2 Product Update
MapR Technologies
 
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
M7 and Apache Drill, Micheal Hausenblas
Modern Data Stack France
 
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 
Cmu-2011-09.pptx
Ted Dunning
 
Using Apache Drill
Chicago Hadoop Users Group
 
Dealing with an Upside Down Internet
MapR Technologies
 
Free Code Friday: Drill 101 - Basics of Apache Drill
MapR Technologies
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
Apache drill
Jakub Pieprzyk
 
Hug france-2012-12-04
Ted Dunning
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR Technologies
 
Analyzing Real-World Data with Apache Drill
tshiran
 
Apache Drill
Ted Dunning
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
The Hive
 
Apache Drill - Why, What, How
mcsrivas
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR Technologies
 
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Working with Delimited Data in Apache Drill 1.6.0
Vince Gonzalez
 
MapR 5.2 Product Update
MapR Technologies
 

Viewers also liked (11)

PPTX
Apache spark core
Thành Nguyễn
 
PDF
Apache Spark Overview
Carol McDonald
 
PDF
Spark streaming state of the union
Databricks
 
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
PPTX
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPTX
Apache Spark An Overview
Mohit Jain
 
PDF
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
PDF
Applying Machine Learning to Live Patient Data
Carol McDonald
 
Apache spark core
Thành Nguyễn
 
Apache Spark Overview
Carol McDonald
 
Spark streaming state of the union
Databricks
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 
Introduction to Spark - DataFactZ
DataFactZ
 
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
DataWorks Summit/Hadoop Summit
 
Apache Spark Core
Girish Khanzode
 
Apache Spark An Overview
Mohit Jain
 
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Applying Machine Learning to Live Patient Data
Carol McDonald
 
Ad

Similar to Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API (20)

PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
MapR Technologies
 
PPTX
How Spark is Enabling the New Wave of Converged Applications
MapR Technologies
 
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
PDF
Advanced Threat Detection on Streaming Data
Carol McDonald
 
PDF
Streaming patterns revolutionary architectures
Carol McDonald
 
PPTX
IoT Austin CUG talk
Felicia Haggarty
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
MapR Technologies
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PDF
Stream Processing Everywhere - What to use?
MapR Technologies
 
PPTX
Map r seattle streams meetup oct 2016
Nitin Kumar
 
PDF
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
PDF
Spark cep
Byungjin Kim
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
How Spark is Enabling the New Wave of Converged Cloud Applications
MapR Technologies
 
How Spark is Enabling the New Wave of Converged Applications
MapR Technologies
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Advanced Threat Detection on Streaming Data
Carol McDonald
 
Streaming patterns revolutionary architectures
Carol McDonald
 
IoT Austin CUG talk
Felicia Haggarty
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
MapR Technologies
 
Introduction to Spark Streaming
datamantra
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Stream Processing Everywhere - What to use?
MapR Technologies
 
Map r seattle streams meetup oct 2016
Nitin Kumar
 
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Byungjin Kim
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
Ad

More from Carol McDonald (18)

PDF
Introduction to machine learning with GPUs
Carol McDonald
 
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
 
PDF
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
 
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
PDF
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
 
PDF
Spark graphx
Carol McDonald
 
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
PDF
Spark machine learning predicting customer churn
Carol McDonald
 
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
PPTX
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
PDF
Apache Spark Machine Learning
Carol McDonald
 
PDF
Machine Learning Recommendations with Spark
Carol McDonald
 
DOC
CU9411MW.DOC
Carol McDonald
 
Introduction to machine learning with GPUs
Carol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
 
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
 
Spark graphx
Carol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
Spark machine learning predicting customer churn
Carol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
Apache Spark Machine Learning
Carol McDonald
 
Machine Learning Recommendations with Spark
Carol McDonald
 
CU9411MW.DOC
Carol McDonald
 

Recently uploaded (20)

PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Import Data Form Excel to Tally Services
Tally xperts
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API

  • 1. ® © 2016 MapR Technologies 1® © 2016 MapR Technologies 1© 2016 MapR Technologies ® Exploring Data Pipelines for Spark Streaming Applications Carol McDonald, Industry Solutions Architect 2016
  • 2. ® © 2016 MapR Technologies 2® © 2016 MapR Technologies 2 What is Streaming Data? Got Some Examples? Data Collection Devices Smart Machinery Phones and Tablets Home Automation RFID Systems Digital Signage Security Systems Medical Devices
  • 3. ® © 2016 MapR Technologies 3® © 2016 MapR Technologies 3 It was hot at 6:05 yesterday ! Why Stream Processing? Analyze 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° 90°90° 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° Batch processing may be too late for some events
  • 4. ® © 2016 MapR Technologies 4® © 2016 MapR Technologies 4 Why Stream Processing? 6:05 P.M.: 90° To pic Stream Temperature Turn on the air conditioning! It’s becoming important to process events as they arrive
  • 5. ® © 2016 MapR Technologies 5® © 2016 MapR Technologies 5 Key to Real Time: Event-based Data Flows web events etc… machine sensors Biometrics Mobile events
  • 6. ® © 2016 MapR Technologies 6® © 2016 MapR Technologies 6 What if BP had detected problems before the oil hit the water ? •  1M samples/sec •  High performance at scale is necessary!
  • 7. ® © 2016 MapR Technologies 7® © 2016 MapR Technologies 7 Use Case: Time Series Data Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 8. ® © 2016 MapR Technologies 8® © 2016 MapR Technologies 8 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0 Row Key contains oil pump name, date, and a time stamp
  • 9. ® © 2016 MapR Technologies 9® © 2016 MapR Technologies 9 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 10. ® © 2016 MapR Technologies 10® © 2016 MapR Technologies 10 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 11. ® © 2016 MapR Technologies 11® © 2016 MapR Technologies 11 Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  • 12. ® © 2016 MapR Technologies 12® © 2016 MapR Technologies 12 How do we do this with High Performance at Scale? •  Parallel operations and minimize disk read/write time
  • 13. ® © 2016 MapR Technologies 13® © 2016 MapR Technologies 13 Collect the Data Data Ingest MapR-FS Source Stream Topic •  Data Ingest: –  File Based: NFS with MapR-FS, HDFS –  Network Based: MapR Streams, Kafka, Kinesis, Twitter, Sockets...
  • 14. ® © 2016 MapR Technologies 14® © 2016 MapR Technologies 14 MapR Streams Publish Subscribe Messaging Topics Organize Events into Categories and decouple Producers from Consumers
  • 15. ® © 2016 MapR Technologies 15® © 2016 MapR Technologies 15 Scalable Messaging with MapR Streams Topics are partitioned for throughput and scalability
  • 16. ® © 2016 MapR Technologies 16® © 2016 MapR Technologies 16 How do we do this with High Performance at Scale? •  Parallel , Partitioned = fast , scalable –  Messaging with MapR Streams
  • 17. ® © 2016 MapR Technologies 17® © 2016 MapR Technologies 17 Collect Data Process the Data with Spark Streaming MapR-FS Process Data Stream Topic •  Extension of the core Spark AP •  Enables scalable, high-throughput, fault-tolerant stream processing of live data
  • 18. ® © 2016 MapR Technologies 18® © 2016 MapR Technologies 18 Processing Spark DStreams Data stream divided into batches of X milliseconds = DStreams
  • 19. ® © 2016 MapR Technologies 19® © 2016 MapR Technologies 19 Spark Resilient Distributed Datasets RDD W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  • 20. ® © 2016 MapR Technologies 20® © 2016 MapR Technologies 20 Spark Resilient Distributed Datasets Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  • 21. ® © 2016 MapR Technologies 21® © 2016 MapR Technologies 21 How do we do this with High Performance at Scale? •  Parallel , Partitioned = fast , scalable –  Processing with Spark
  • 22. ® © 2016 MapR Technologies 22® © 2016 MapR Technologies 22 Processing Spark DStreams transformations à create new RDDs Two types of operations on DStreams: •  Transformations: –  Create new DStreams –  map, filter, reduceByKey, SQL. . . •  Output Operations DStream RDDs DStream RDDs transform  transform   data from time 0 to 1 RDD @ time 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3 RDD @ time 3 transform   RDD @ time 1 RDD @ time 2
  • 23. ® © 2016 MapR Technologies 23® © 2016 MapR Technologies 23 Two types of operations on DStreams •  Transformations •  Output Operations: trigger Computation –  Save to File, HBase.. •  saveAsHadoopFiles •  saveAsHadoopDataset •  saveAsTextFiles Processing Spark DStreams Output operations à trigger computation MapR-FS MapR-DB DStream RDDs data from time 0 to 1 data from time 1 to 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 2 mapmap map savesave save
  • 24. ® © 2016 MapR Technologies 24® © 2016 MapR Technologies 24 Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  • 25. ® © 2016 MapR Technologies 25® © 2016 MapR Technologies 25 MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Key colB col C val val val xxx val val Key colB col C val val val xxx val val Key colB col C val val val xxx val val Fast Reads and Writes by Key! Data is automatically partitioned by Key Range!
  • 26. ® © 2016 MapR Technologies 26® © 2016 MapR Technologies 26 Store Lots of Data with NoSQL MapR-DB bottleneck Key colB col C val val val xxx val val Key colB col C val val val xxx val val Key colB col C val val val xxx val val Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together
  • 27. ® © 2016 MapR Technologies 27® © 2016 MapR Technologies 27 Key to Real Time: Event-based Data Flows Key to Scale = Parallel Partitioned: •  Messaging •  Processing •  Storage
  • 28. ® © 2016 MapR Technologies 28® © 2016 MapR Technologies 28 Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  • 29. ® © 2016 MapR Technologies 29® © 2016 MapR Technologies 29 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 30. ® © 2016 MapR Technologies 30® © 2016 MapR Technologies 30 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 31. ® © 2016 MapR Technologies 31® © 2016 MapR Technologies 31 KafkaProducer String topic=“/streams/pump:warning”; public static KafkaProducer producer; Properties properties = new Properties(); properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); // Instantiate KafkaProducer with properties producer = new KafkaProducer<String, String>(properties); String txt = “msg text”; ProducerRecord<String, String> rec = new ProducerRecord<String, String>(topic, txt); producer.send(rec);
  • 32. ® © 2016 MapR Technologies 32® © 2016 MapR Technologies 32 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 33. ® © 2016 MapR Technologies 33® © 2016 MapR Technologies 33 Create a DStream DStream: a sequence of RDDs representing a stream of data val ssc = new StreamingContext(sparkConf, Seconds(5)) val dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD
  • 34. ® © 2016 MapR Technologies 34® © 2016 MapR Technologies 34 Process DStream val sensorDStream = dStream.map(_._2).map(parseSensor) dStream RDDs batch time 2 to 3 batch time 1 to 2 batch time 0 to 1 sensorDStream RDDs New RDDs created for every batch map map map
  • 35. ® © 2016 MapR Technologies 35® © 2016 MapR Technologies 35 Message Data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  • 36. ® © 2016 MapR Technologies 36® © 2016 MapR Technologies 36 DataFrame and SQL Operations // for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) rdd.toDF().registerTempTable("sensor") val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date") res.show() }
  • 37. ® © 2016 MapR Technologies 37® © 2016 MapR Technologies 37 Streaming Application Output
  • 38. ® © 2016 MapR Technologies 38® © 2016 MapR Technologies 38 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) linesRDD DStream sensorRDD DStream output operation: persist data to external storage Put objects written to HBase batch time 2-3 batch time 1 to 2 batch time 0 to 1 mapmap map savesave save
  • 39. ® © 2016 MapR Technologies 39® © 2016 MapR Technologies 39 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  • 40. ® © 2016 MapR Technologies 40® © 2016 MapR Technologies 40 Stream Processing Building a Complete Data Architecture MapR File System (MapR-FS) MapR Converged Data Platform MapR Database (MapR-DB) MapR Streams Sources/Apps Bulk Processing
  • 41. ® © 2016 MapR Technologies 41® © 2016 MapR Technologies 41 To Learn More: •  Read explanation of and Download code –  https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams- spark-streaming-and-mapr-db –  https://www.mapr.com/blog/spark-streaming-hbase
  • 42. ® © 2016 MapR Technologies 42® © 2016 MapR Technologies 42 To Learn More: •  http://learn.mapr.com/
  • 43. ® © 2016 MapR Technologies 43® © 2016 MapR Technologies 43 Q&A @mapr @caroljmcdonald https://www.mapr.com/blog/author/carol-mcdonald Engage with us! mapr-technologies