SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Real Time Data Processing
using Spark Streaming
2© Cloudera, Inc. All rights reserved.
Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...
3© Cloudera, Inc. All rights reserved.
Use Cases Across Industries
Credit
Identify
fraudulent transactions
as soon as they occur.
Transportation
Dynamic
Re-routing
Of traffic or
Vehicle Fleet.
Retail
• Dynamic
Inventory
Management
• Real-time
In-store
Offers and
recommendations
Consumer
Internet &
Mobile
Optimize user
engagement based
on user’s current
behavior.
Healthcare
Continuously
monitor patient
vital stats and
proactively identify
at-risk patients.
Manufacturing
• Identify
equipment
failures and
react instantly
• Perform
Proactive
maintenance.
Surveillance
Identify
threats
and intrusions
In real-time
Digital
Advertising
& Marketing
Optimize and
personalize content
based on real-time
information.
4© Cloudera, Inc. All rights reserved.
From Volume and Variety to Velocity
Present
Batch + Stream Processing
Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
Past
Present
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing
Time to insight of Hours
5© Cloudera, Inc. All rights reserved.
Key Components of Streaming Architectures
Data Ingestion
& Transportation
Service
Real-Time Stream
Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-Time
Data Serving
6© Cloudera, Inc. All rights reserved.
Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources
7© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory
8© Cloudera, Inc. All rights reserved.
Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
9© Cloudera, Inc. All rights reserved.
RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage
10© Cloudera, Inc. All rights reserved.
Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput
11© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming
12© Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture
13© Cloudera, Inc. All rights reserved.
Use DStreams for Windowing Functions
14© Cloudera, Inc. All rights reserved.
Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.
15© Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as
well
16© Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:
17© Cloudera, Inc. All rights reserved.
Reliability
• Received data automatically persisted to HDFS Write Ahead Log to prevent data
loss
• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf
• When AM dies, the application is restarted by YARN
• Received, ack-ed and unprocessed data replayed from WAL (data that made it
into blocks)
• Reliable Receivers can replay data from the original source, if required
• Un-acked data replayed from source.
• Kafka, Flume receivers bundled with Spark are examples
• Reliable Receivers + WAL = No data loss on driver or receiver failure!
18© Cloudera, Inc. All rights reserved.
Kafka Connectors
• Reliable Kafka DStream
• Stores received data to Write Ahead Log on HDFS for replay
• No data loss
• Stable and supported!
• Direct Kafka DStream
• Uses low level API to pull data from Kafka
• Replays from Kafka on driver failure
• No data loss
• Experimental
19© Cloudera, Inc. All rights reserved.
Flume Connector
• Flume Polling DStream
• Use Spark sink from Maven to Flume’s plugin directory
• Flume Polling Receiver polls the sink to receive data
• Replays received data from WAL on HDFS
• No data loss
• Stable and Supported!
20© Cloudera, Inc. All rights reserved.
Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.
21© Cloudera, Inc. All rights reserved.
What is coming?
• Run on Secure YARN for more than 7 days!
• Better Monitoring and alerting
• Batch-level and task-level monitoring
• SQL on Streaming
• Run SQL-like queries on top of Streaming (medium – long term)
• Python!
• Limited support coming in Spark 1.3
22© Cloudera, Inc. All rights reserved.
Current Spark project status
• 400+ contributors and 50+ companies contributing
• Includes: Databricks, Cloudera, Intel, Yahoo! etc
• Dozens of production deployments
• Spark Streaming Survived Netflix Chaos Monkey – production ready!
• Included in CDH!
23© Cloudera, Inc. All rights reserved.
More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark
24© Cloudera, Inc. All rights reserved.
Thank you
hshreedharan@cloudera.com
15% Discount Code for Cloudera Training PNWCUG_15
university.cloudera.com

More Related Content

PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
PDF
Reactive app using actor model & apache spark
Rahul Kumar
 
PDF
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Reactive app using actor model & apache spark
Rahul Kumar
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 

What's hot (20)

PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PPTX
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
PDF
Akka in Production - ScalaDays 2015
Evan Chan
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Lambda architecture
Szilveszter Molnár
 
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
PDF
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
Using the SDACK Architecture to Build a Big Data Product
Evans Ye
 
PPTX
Lambda architecture with Spark
Vincent GALOPIN
 
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
ODP
Kick-Start with SMACK Stack
Knoldus Inc.
 
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Akka in Production - ScalaDays 2015
Evan Chan
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Lambda architecture
Szilveszter Molnár
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Using the SDACK Architecture to Build a Big Data Product
Evans Ye
 
Lambda architecture with Spark
Vincent GALOPIN
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Kick-Start with SMACK Stack
Knoldus Inc.
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Ad

Viewers also liked (20)

PPTX
Spark+flume seattle
Hari Shreedharan
 
PDF
Streamsets and spark
Hari Shreedharan
 
PDF
Avvo fkafka
Nitin Kumar
 
PPTX
Brandon obrien streaming_data
Nitin Kumar
 
PPTX
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
sparktc
 
PPTX
Spark Streaming and Expert Systems
Jim Haughwout
 
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
PDF
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
PDF
JDG 7 & Spark Integration
Ted Won
 
PPTX
How to Avoid Problems with Lump-sum Relocation Allowances
Parsifal Corporation
 
PPTX
Spark Streaming the Industrial IoT
Jim Haughwout
 
PDF
AddisDev Meetup ii: Golang and Flow-based Programming
Samuel Lampa
 
PPTX
Cloudera's Flume
Cloudera, Inc.
 
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
PDF
Tools For jQuery Application Architecture (Extended Slides)
Addy Osmani
 
PPTX
SimplifyStreamingArchitecture
Maheedhar Gunturu
 
PDF
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Going bananas with recursion schemes for fixed point data types
Pawel Szulc
 
Spark+flume seattle
Hari Shreedharan
 
Streamsets and spark
Hari Shreedharan
 
Avvo fkafka
Nitin Kumar
 
Brandon obrien streaming_data
Nitin Kumar
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
sparktc
 
Spark Streaming and Expert Systems
Jim Haughwout
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
JDG 7 & Spark Integration
Ted Won
 
How to Avoid Problems with Lump-sum Relocation Allowances
Parsifal Corporation
 
Spark Streaming the Industrial IoT
Jim Haughwout
 
AddisDev Meetup ii: Golang and Flow-based Programming
Samuel Lampa
 
Cloudera's Flume
Cloudera, Inc.
 
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Tools For jQuery Application Architecture (Extended Slides)
Addy Osmani
 
SimplifyStreamingArchitecture
Maheedhar Gunturu
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Going bananas with recursion schemes for fixed point data types
Pawel Szulc
 
Ad

Similar to Real Time Data Processing Using Spark Streaming (20)

PPTX
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
PPTX
Spark One Platform Webinar
Cloudera, Inc.
 
PPTX
IoT Austin CUG talk
Felicia Haggarty
 
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PPTX
End to End Streaming Architectures
Cloudera, Inc.
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
PPTX
Fraud Detection Architecture
Gwen (Chen) Shapira
 
PPTX
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
PPTX
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
PDF
Fraud Detection using Hadoop
hadooparchbook
 
PPTX
Visual Mapping of Clickstream Data
DataWorks Summit
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
PDF
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
PDF
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Spark One Platform Webinar
Cloudera, Inc.
 
IoT Austin CUG talk
Felicia Haggarty
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
End to End Streaming Architectures
Cloudera, Inc.
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
Fraud Detection using Hadoop
hadooparchbook
 
Visual Mapping of Clickstream Data
DataWorks Summit
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 

Recently uploaded (20)

PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Software Development Methodologies in 2025
KodekX
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 

Real Time Data Processing Using Spark Streaming

  • 1. 1© Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Real Time Data Processing using Spark Streaming
  • 2. 2© Cloudera, Inc. All rights reserved. Motivation for Real-Time Stream Processing Data is being created at unprecedented rates • Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...
  • 3. 3© Cloudera, Inc. All rights reserved. Use Cases Across Industries Credit Identify fraudulent transactions as soon as they occur. Transportation Dynamic Re-routing Of traffic or Vehicle Fleet. Retail • Dynamic Inventory Management • Real-time In-store Offers and recommendations Consumer Internet & Mobile Optimize user engagement based on user’s current behavior. Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients. Manufacturing • Identify equipment failures and react instantly • Perform Proactive maintenance. Surveillance Identify threats and intrusions In real-time Digital Advertising & Marketing Optimize and personalize content based on real-time information.
  • 4. 4© Cloudera, Inc. All rights reserved. From Volume and Variety to Velocity Present Batch + Stream Processing Time to Insight of Seconds Big-Data = Volume + Variety Big-Data = Volume + Variety + Velocity Past Present Hadoop Ecosystem evolves as well… Past Big Data has evolved Batch Processing Time to insight of Hours
  • 5. 5© Cloudera, Inc. All rights reserved. Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Kafka Flume System Management Security Data Management & Integration Real-Time Data Serving
  • 6. 6© Cloudera, Inc. All rights reserved. Canonical Stream Processing Architecture Kafka Data Ingest App 1 App 2 . . . Kafka Flume HDFS HBase Data Sources
  • 7. 7© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data •Easy to Develop •Rich APIs in Java, Scala, Python •Interactive shell •Fast to Run •General execution graphs •In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 8. 8© Cloudera, Inc. All rights reserved. Spark Architecture Driver Worker Worker Worker Data RAM Data RAM Data RAM
  • 9. 9© Cloudera, Inc. All rights reserved. RDDs RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage
  • 10. 10© Cloudera, Inc. All rights reserved. Spark Streaming Extension of Apache Spark’s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput
  • 11. 11© Cloudera, Inc. All rights reserved. Spark Streaming • Incoming data represented as Discretized Streams (DStreams) • Stream is broken down into micro-batches • Each micro-batch is an RDD – can share code between batch and streaming
  • 12. 12© Cloudera, Inc. All rights reserved. val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2tweets DStream hashTags DStream Stream composed of small (1-10s) batch computations “Micro-batch” Architecture
  • 13. 13© Cloudera, Inc. All rights reserved. Use DStreams for Windowing Functions
  • 14. 14© Cloudera, Inc. All rights reserved. Spark Streaming • Runs as a Spark job • YARN or standalone for scheduling • YARN has KDC integration • Use the same code for real-time Spark Streaming and for batch Spark jobs. • Integrates natively with messaging systems such as Flume, Kafka, Zero MQ…. • Easy to write “Receivers” for custom messaging systems.
  • 15. 15© Cloudera, Inc. All rights reserved. Sharing Code between Batch and Streaming def filterErrors (rdd: RDD[String]): RDD[String] = { rdd.filter(s => s.contains(“ERROR”)) } Library that filters “ERRORS” • Streaming generates RDDs periodically • Any code that operates on RDDs can therefore be used in streaming as well
  • 16. 16© Cloudera, Inc. All rights reserved. Sharing Code between Batch and Streaming val lines = sc.textFile(…) val filtered = filterErrors(lines) filtered.saveAsTextFile(...) Spark: val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435) val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => { filterErrors(rdd) })) filtered.saveAsTextFiles(…) Spark Streaming:
  • 17. 17© Cloudera, Inc. All rights reserved. Reliability • Received data automatically persisted to HDFS Write Ahead Log to prevent data loss • set spark.streaming.receiver.writeAheadLog.enable=true in spark conf • When AM dies, the application is restarted by YARN • Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks) • Reliable Receivers can replay data from the original source, if required • Un-acked data replayed from source. • Kafka, Flume receivers bundled with Spark are examples • Reliable Receivers + WAL = No data loss on driver or receiver failure!
  • 18. 18© Cloudera, Inc. All rights reserved. Kafka Connectors • Reliable Kafka DStream • Stores received data to Write Ahead Log on HDFS for replay • No data loss • Stable and supported! • Direct Kafka DStream • Uses low level API to pull data from Kafka • Replays from Kafka on driver failure • No data loss • Experimental
  • 19. 19© Cloudera, Inc. All rights reserved. Flume Connector • Flume Polling DStream • Use Spark sink from Maven to Flume’s plugin directory • Flume Polling Receiver polls the sink to receive data • Replays received data from WAL on HDFS • No data loss • Stable and Supported!
  • 20. 20© Cloudera, Inc. All rights reserved. Spark Streaming Use-Cases • Real-time dashboards • Show approximate results in real-time • Reconcile periodically with source-of-truth using Spark • Joins of multiple streams • Time-based or count-based “windows” • Combine multiple sources of input to produce composite data • Re-use RDDs created by Streaming in other Spark jobs.
  • 21. 21© Cloudera, Inc. All rights reserved. What is coming? • Run on Secure YARN for more than 7 days! • Better Monitoring and alerting • Batch-level and task-level monitoring • SQL on Streaming • Run SQL-like queries on top of Streaming (medium – long term) • Python! • Limited support coming in Spark 1.3
  • 22. 22© Cloudera, Inc. All rights reserved. Current Spark project status • 400+ contributors and 50+ companies contributing • Includes: Databricks, Cloudera, Intel, Yahoo! etc • Dozens of production deployments • Spark Streaming Survived Netflix Chaos Monkey – production ready! • Included in CDH!
  • 23. 23© Cloudera, Inc. All rights reserved. More Info.. • CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera- docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html • Cloudera Blog: http://blog.cloudera.com/blog/category/spark/ • Apache Spark homepage: http://spark.apache.org/ • Github: https://github.com/apache/spark
  • 24. 24© Cloudera, Inc. All rights reserved. Thank you [email protected] 15% Discount Code for Cloudera Training PNWCUG_15 university.cloudera.com

Editor's Notes

  • #3: The Narrative: Vast quantities of streaming data are being generated, and more will be generated thanks to phenomenon SINGULAR/PLURAL such as the internet of things. The motivation for Real-Time Stream processing is to turn all this data into valuable insights and actions, as soon as the data is generated. Instant processing of the data also opens the door to new use cases that were not possible before. NOTE: Feel free to remove the cheesy image of “The Flash”, if it feels unprofessional or overly cheesy
  • #5: The Narrative: As you can see from the previous slides, lots of streaming data will be generated. Making this data actionable in real time is very valuable across industries. Our very own Hadoop is all you need. Previously Hadoop was associated just with “big unstructured data”. That was hadoop’s selling point. But now, Hadoop can also handle real-time data (in addition to big unstructured). So think Hadoop when you think Real-Time Streaming. Purpose of the slide: Goal is to associate Hadoop with real-time……to get people to think hadoop when they think real-time streaming data.
  • #11: Purpose of this Slide: Make sure to associate Spark Streaming with Apache Spark, so folks know it is a part of THE Apache Spark that everyone is talking about. List some of the key properties that make Spark Streaming a good platform for stream processing. Touch upon the key attributes that make it good for stream processing. Note: If required, we can mention low latency as well.