SlideShare a Scribd company logo
Spark Streaming and IoT
Michael J. Freedman
iobeam
Technology confluence in IoT
UBIQUITOUS SENSORS
REAL-TIME
SYSTEMS
MACHINE
LEARNING
DATA
ANALYSIS
INTERSECTION OF 3 MAJOR TRENDS
Data analysis is the killer app
CASE STUDY: PREDICTIVE MAINTENANCE
Predicting motor failure through
analysis of vibration data
CASE STUDY: HEALTH & FITNESS
Exercise identification based on
3D motion data analysis
CASE STUDY: SMART CITIES
Traffic and air quality monitoring via
GPS and environmental sensor
CASE STUDY: SMART GRID
Demand-response optimizations on
supply-side capacity, spot prices
Challenges in applying Spark to IoT
REQUIREMENTS
2
Devices send data at varying
delays and rates
2 Handling delayed data transparently
3
Processing many low-volume,
independent streams
1
One IoT app performs tasks
at different time intervals
1
Supporting full spectrum of
batch to real-time analysis
3
Within org, multiple IoT apps
run concurrently
4
Multi-tenancy with low-volume apps
and high utilization
CHALLENGES
Potential economic impact of IoT is >$11 trillion per year,
even while 99% of IoT data goes unused today.
— 2015 McKinsey study
Required: Programming + data infra abstractions
Supporting full spectrum of
batch to real-time analysis
1
IoT analysis spans many intervals
BATCH PROCESSING
(HOURS, NIGHTLY)
STREAM PROCESSING
(REAL-TIME)
Fire / Hazard Detection
Immediately
Bus Location Updates
15 sec
Traffi
c Conditions
1 min
Environmental Conditions
15 min
Traffi
c Optimizations
Daily
Spark simplifies programming across intervals
val readings = iobeamInterface.getInputStreamRecords()
// Trigger temperatures that fall outside acceptable conditions
val bad_temps = readings.filter(t => t > highTempThreshold || t < lowTempThreshold)
val triggers = new TriggerEventStream(bad_temps.map(t => new TriggerEvent("bad_temperature", t)))
// Compute mean temperatures over 5 min windows
val windows = readings.groupByKeyAndWindow(Seconds( 300 ), Seconds(60))
val mean_temps = new TimeSeriesStream("mean_temperature", windows.map(t => t.sum / t.length))
new OutputStreams(mean_temps, triggers)
30
1800
But programming != data abstractions
DATA STREAMS
(KAFKA, FLUME, SOCKETS, ETC.)
DATA FILES
(HDFS, ETC.)
BATCH PROCESSING
(HOURS, NIGHTLY)
STREAM PROCESSING
(REAL-TIME)
Programming != data abstractions
Traffi
c Conditions
30
sec
Traffi
c Conditions
1 hour
Frequencies change as products evolve1
Programming != data abstractions
Joining real-time with historical data2
Frequencies change as products evolve1
5 min mean vs. trailing - hourly mean
- hourly mean from yesterday
- hourly mean from last week
Programming != data abstractions
Joining real-time with historical data2
Supporting backfill for delayed data3
Frequencies change as products evolve1
Programming != data abstractions
Joining real-time with historical data2
Supporting backfill for delayed data3
Frequencies change as products evolve1
Data Series Abstraction
Handling delayed data transparently2
Windows in streaming DBs
Tumbling windows
titjtk
Windows in streaming DBs
Sliding windows
• Defined over # of tuples
• Defined over time period
…using arrival_time of tuples
titjtk
But IoT data is often delayed
Seconds due to
network congestion
Minutes due to duty cycling
for energy savings
Minutes to hours due to
intermittent connectivity
Windowing data by arrival time has no semantic meaning
titjtk
Wanted: Data generation time, not arrival time
titjtk
filter by timestamp
Data semantics defined over timestamp
e.g., aggregation
JOIN ( , historical data)
Wanted: Backfill does not change semantics
titjtk
filter by timestamp
Data semantics defined over timestamp
e.g., aggregation
JOIN ( , historical data)
…from recent streaming data…
…from historical archive…
Wanted: Better data infra abstractions
titjtk
filter by timestamp
e.g., aggregation
JOIN ( , historical data)
Data Series Abstraction
…from recent streaming data…
…from historical archive…
Processing many low-volume,
independent streams
3
IoT Device Streams
Wanted: Maintain state across batches
Map to good/bad conditions
titjtk
Alert on condition transition
Spark: Share state through RDDs
Click streams
Ad impressions
Market feeds
Shared state b/w
stream partitions
‣ Transforms RDD, makes state available across cluster
‣ Many great uses, e.g., learning parameters in iterative ML
‣ But increases data lineage increases checkpointing cost
Maintain shared state via updateKeyByState()
IoT: Many independent streams
‣ Each worker handles 1+ streams, not multiple workers per stream
‣ Use language data structures (e.g., Java Map) to maintain state within worker
‣ No RDD transform no lineage increase no increased checkpointing cost
Independent state
per stream
IoT Device Streams
Often only need to maintain state within individual streams
Multi-tenancy with low-volume apps
and high utilization
4
Multi-tenancy for batch processing
Job Queue
Server Cores
Spark: 1 worker = 1 server core
Goal: Minimize time-to-completion
Multi-tenancy for batch processing
Job Queue
Server Cores
Spark: 1 worker = 1 server core
Goal: Minimize time-to-completion
Multi-tenancy for stream processing
Job Queue
Server Cores
Spark: 1 worker = 1 server core
Problem: Low utilization with low-rate apps
Multi-tenancy for stream processing
Job Queue
Server Cores
Virtual Cores
(e.g., resource-limited
containers)
1 worker = 1 virtual core
N workers = 1 server core
Goal: Improve utilization with low-rate apps
Multi-tenancy for stream processing
Job Queue
Server Cores
Virtual Cores
(e.g., resource-limited
containers)
1 worker = 1 virtual core
N workers = 1 server core
Goal: Improve utilization with low-rate apps
Spark + Unified Data Infrastructure
Required: Programming + data infra abstractions
Required: Programming + data infra abstractions
Device-Model-Infra (DMI) framework for IoT
Questions?
Developers: docs.iobeam.com
Whitepaper: www.iobeam.com/docs/iobeam-DMI.pdf

More Related Content

Similar to Spark Streaming and IoT by Mike Freedman (20)

PDF
Io t data streaming
ratthaslip ranokphanuwat
 
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
PPTX
IoT Austin CUG talk
Felicia Haggarty
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Real time Analytics in IoT - Marcel Lattmann Codit Switzerland @.NET Day 2019
Codit
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Getting insights from IoT data with Apache Spark and Apache Bahir
Luciano Resende
 
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PPTX
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Adaryl "Bob" Wakefield, MBA
 
PDF
Lifting the hood on spark streaming - StampedeCon 2015
StampedeCon
 
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
PDF
Making Structured Streaming Ready for Production
Databricks
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PPTX
Streaming Analytics for IoT with Apache Spark
Impetus Technologies
 
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Io t data streaming
ratthaslip ranokphanuwat
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
IoT Austin CUG talk
Felicia Haggarty
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Real time Analytics in IoT - Marcel Lattmann Codit Switzerland @.NET Day 2019
Codit
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Getting insights from IoT data with Apache Spark and Apache Bahir
Luciano Resende
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Productizing Structured Streaming Jobs
Databricks
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Adaryl "Bob" Wakefield, MBA
 
Lifting the hood on spark streaming - StampedeCon 2015
StampedeCon
 
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
Making Structured Streaming Ready for Production
Databricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Streaming Analytics for IoT with Apache Spark
Impetus Technologies
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Ad

Spark Streaming and IoT by Mike Freedman

  • 1. Spark Streaming and IoT Michael J. Freedman iobeam
  • 2. Technology confluence in IoT UBIQUITOUS SENSORS REAL-TIME SYSTEMS MACHINE LEARNING DATA ANALYSIS INTERSECTION OF 3 MAJOR TRENDS
  • 3. Data analysis is the killer app CASE STUDY: PREDICTIVE MAINTENANCE Predicting motor failure through analysis of vibration data CASE STUDY: HEALTH & FITNESS Exercise identification based on 3D motion data analysis CASE STUDY: SMART CITIES Traffic and air quality monitoring via GPS and environmental sensor CASE STUDY: SMART GRID Demand-response optimizations on supply-side capacity, spot prices
  • 4. Challenges in applying Spark to IoT REQUIREMENTS 2 Devices send data at varying delays and rates 2 Handling delayed data transparently 3 Processing many low-volume, independent streams 1 One IoT app performs tasks at different time intervals 1 Supporting full spectrum of batch to real-time analysis 3 Within org, multiple IoT apps run concurrently 4 Multi-tenancy with low-volume apps and high utilization CHALLENGES Potential economic impact of IoT is >$11 trillion per year, even while 99% of IoT data goes unused today. — 2015 McKinsey study
  • 5. Required: Programming + data infra abstractions
  • 6. Supporting full spectrum of batch to real-time analysis 1
  • 7. IoT analysis spans many intervals BATCH PROCESSING (HOURS, NIGHTLY) STREAM PROCESSING (REAL-TIME) Fire / Hazard Detection Immediately Bus Location Updates 15 sec Traffi c Conditions 1 min Environmental Conditions 15 min Traffi c Optimizations Daily
  • 8. Spark simplifies programming across intervals val readings = iobeamInterface.getInputStreamRecords() // Trigger temperatures that fall outside acceptable conditions val bad_temps = readings.filter(t => t > highTempThreshold || t < lowTempThreshold) val triggers = new TriggerEventStream(bad_temps.map(t => new TriggerEvent("bad_temperature", t))) // Compute mean temperatures over 5 min windows val windows = readings.groupByKeyAndWindow(Seconds( 300 ), Seconds(60)) val mean_temps = new TimeSeriesStream("mean_temperature", windows.map(t => t.sum / t.length)) new OutputStreams(mean_temps, triggers) 30 1800
  • 9. But programming != data abstractions DATA STREAMS (KAFKA, FLUME, SOCKETS, ETC.) DATA FILES (HDFS, ETC.) BATCH PROCESSING (HOURS, NIGHTLY) STREAM PROCESSING (REAL-TIME)
  • 10. Programming != data abstractions Traffi c Conditions 30 sec Traffi c Conditions 1 hour Frequencies change as products evolve1
  • 11. Programming != data abstractions Joining real-time with historical data2 Frequencies change as products evolve1 5 min mean vs. trailing - hourly mean - hourly mean from yesterday - hourly mean from last week
  • 12. Programming != data abstractions Joining real-time with historical data2 Supporting backfill for delayed data3 Frequencies change as products evolve1
  • 13. Programming != data abstractions Joining real-time with historical data2 Supporting backfill for delayed data3 Frequencies change as products evolve1 Data Series Abstraction
  • 14. Handling delayed data transparently2
  • 15. Windows in streaming DBs Tumbling windows titjtk
  • 16. Windows in streaming DBs Sliding windows • Defined over # of tuples • Defined over time period …using arrival_time of tuples titjtk
  • 17. But IoT data is often delayed Seconds due to network congestion Minutes due to duty cycling for energy savings Minutes to hours due to intermittent connectivity Windowing data by arrival time has no semantic meaning titjtk
  • 18. Wanted: Data generation time, not arrival time titjtk filter by timestamp Data semantics defined over timestamp e.g., aggregation JOIN ( , historical data)
  • 19. Wanted: Backfill does not change semantics titjtk filter by timestamp Data semantics defined over timestamp e.g., aggregation JOIN ( , historical data) …from recent streaming data… …from historical archive…
  • 20. Wanted: Better data infra abstractions titjtk filter by timestamp e.g., aggregation JOIN ( , historical data) Data Series Abstraction …from recent streaming data… …from historical archive…
  • 21. Processing many low-volume, independent streams 3 IoT Device Streams
  • 22. Wanted: Maintain state across batches Map to good/bad conditions titjtk Alert on condition transition
  • 23. Spark: Share state through RDDs Click streams Ad impressions Market feeds Shared state b/w stream partitions ‣ Transforms RDD, makes state available across cluster ‣ Many great uses, e.g., learning parameters in iterative ML ‣ But increases data lineage increases checkpointing cost Maintain shared state via updateKeyByState()
  • 24. IoT: Many independent streams ‣ Each worker handles 1+ streams, not multiple workers per stream ‣ Use language data structures (e.g., Java Map) to maintain state within worker ‣ No RDD transform no lineage increase no increased checkpointing cost Independent state per stream IoT Device Streams Often only need to maintain state within individual streams
  • 25. Multi-tenancy with low-volume apps and high utilization 4
  • 26. Multi-tenancy for batch processing Job Queue Server Cores Spark: 1 worker = 1 server core Goal: Minimize time-to-completion
  • 27. Multi-tenancy for batch processing Job Queue Server Cores Spark: 1 worker = 1 server core Goal: Minimize time-to-completion
  • 28. Multi-tenancy for stream processing Job Queue Server Cores Spark: 1 worker = 1 server core Problem: Low utilization with low-rate apps
  • 29. Multi-tenancy for stream processing Job Queue Server Cores Virtual Cores (e.g., resource-limited containers) 1 worker = 1 virtual core N workers = 1 server core Goal: Improve utilization with low-rate apps
  • 30. Multi-tenancy for stream processing Job Queue Server Cores Virtual Cores (e.g., resource-limited containers) 1 worker = 1 virtual core N workers = 1 server core Goal: Improve utilization with low-rate apps
  • 31. Spark + Unified Data Infrastructure
  • 32. Required: Programming + data infra abstractions
  • 33. Required: Programming + data infra abstractions