Spark Streaming and IoT by Mike Freedman

Spark Streaming and IoT
Michael J. Freedman
iobeam

Technology confluence in IoT
UBIQUITOUS SENSORS
REAL-TIME
SYSTEMS
MACHINE
LEARNING
DATA
ANALYSIS
INTERSECTION OF 3 MAJOR TRENDS

Data analysis is the killer app
CASE STUDY: PREDICTIVE MAINTENANCE
Predicting motor failure through
analysis of vibration data
CASE STUDY: HEALTH & FITNESS
Exercise identiﬁcation based on
3D motion data analysis
CASE STUDY: SMART CITIES
Traffic and air quality monitoring via
GPS and environmental sensor
CASE STUDY: SMART GRID
Demand-response optimizations on
supply-side capacity, spot prices

Challenges in applying Spark to IoT
REQUIREMENTS
2
Devices send data at varying
delays and rates
2 Handling delayed data transparently
3
Processing many low-volume,
independent streams
1
One IoT app performs tasks
at different time intervals
1
Supporting full spectrum of
batch to real-time analysis
3
Within org, multiple IoT apps
run concurrently
4
Multi-tenancy with low-volume apps
and high utilization
CHALLENGES
Potential economic impact of IoT is >$11 trillion per year,
even while 99% of IoT data goes unused today.
— 2015 McKinsey study

Required: Programming + data infra abstractions

Supporting full spectrum of
batch to real-time analysis
1

IoT analysis spans many intervals
BATCH PROCESSING
(HOURS, NIGHTLY)
STREAM PROCESSING
(REAL-TIME)
Fire / Hazard Detection
Immediately
Bus Location Updates
15 sec
Traffi
c Conditions
1 min
Environmental Conditions
15 min
Traffi
c Optimizations
Daily

Spark simplifies programming across intervals
val readings = iobeamInterface.getInputStreamRecords()
// Trigger temperatures that fall outside acceptable conditions
val bad_temps = readings.filter(t => t > highTempThreshold || t < lowTempThreshold)
val triggers = new TriggerEventStream(bad_temps.map(t => new TriggerEvent("bad_temperature", t)))
// Compute mean temperatures over 5 min windows
val windows = readings.groupByKeyAndWindow(Seconds( 300 ), Seconds(60))
val mean_temps = new TimeSeriesStream("mean_temperature", windows.map(t => t.sum / t.length))
new OutputStreams(mean_temps, triggers)
30
1800

But programming != data abstractions
DATA STREAMS
(KAFKA, FLUME, SOCKETS, ETC.)
DATA FILES
(HDFS, ETC.)
BATCH PROCESSING
(HOURS, NIGHTLY)
STREAM PROCESSING
(REAL-TIME)

Programming != data abstractions
Traffi
c Conditions
30
sec
Traffi
c Conditions
1 hour
Frequencies change as products evolve1

Joining real-time with historical data2
5 min mean vs. trailing - hourly mean
- hourly mean from yesterday
- hourly mean from last week

Supporting backﬁll for delayed data3

Supporting backﬁll for delayed data3
Data Series Abstraction

Handling delayed data transparently2

Windows in streaming DBs
Tumbling windows
titjtk

Windows in streaming DBs
Sliding windows
• Defined over # of tuples
• Defined over time period
…using arrival_time of tuples
titjtk

But IoT data is often delayed
Seconds due to
network congestion
Minutes due to duty cycling
for energy savings
Minutes to hours due to
intermittent connectivity
Windowing data by arrival time has no semantic meaning
titjtk

Wanted: Data generation time, not arrival time
titjtk
filter by timestamp
Data semantics deﬁned over timestamp
e.g., aggregation
JOIN ( , historical data)

Wanted: Backfill does not change semantics
titjtk
filter by timestamp
Data semantics deﬁned over timestamp
e.g., aggregation
…from recent streaming data…
…from historical archive…

Wanted: Better data infra abstractions
titjtk
filter by timestamp
e.g., aggregation
Data Series Abstraction
…from recent streaming data…
…from historical archive…

Processing many low-volume,
independent streams
3
IoT Device Streams

Wanted: Maintain state across batches
Map to good/bad conditions
titjtk
Alert on condition transition

Spark: Share state through RDDs
Click streams
Ad impressions
Market feeds
Shared state b/w
stream partitions
‣ Transforms RDD, makes state available across cluster
‣ Many great uses, e.g., learning parameters in iterative ML
‣ But increases data lineage increases checkpointing cost
Maintain shared state via updateKeyByState()

IoT: Many independent streams
‣ Each worker handles 1+ streams, not multiple workers per stream
‣ Use language data structures (e.g., Java Map) to maintain state within worker
‣ No RDD transform no lineage increase no increased checkpointing cost
Independent state
per stream
IoT Device Streams
Often only need to maintain state within individual streams

Multi-tenancy with low-volume apps
and high utilization
4

Multi-tenancy for batch processing
Job Queue
Server Cores
Spark: 1 worker = 1 server core
Goal: Minimize time-to-completion

Multi-tenancy for stream processing
Job Queue
Server Cores
Spark: 1 worker = 1 server core
Problem: Low utilization with low-rate apps

Multi-tenancy for stream processing
Job Queue
Server Cores
Virtual Cores
(e.g., resource-limited
containers)
1 worker = 1 virtual core
N workers = 1 server core
Goal: Improve utilization with low-rate apps

Spark + Unified Data Infrastructure

Device-Model-Infra (DMI) framework for IoT

Questions?
Developers: docs.iobeam.com
Whitepaper: www.iobeam.com/docs/iobeam-DMI.pdf

Spark Streaming and IoT by Mike Freedman

More Related Content

Similar to Spark Streaming and IoT by Mike Freedman (20)

More from Spark Summit (20)

Recently uploaded (20)

Spark Streaming and IoT by Mike Freedman