SlideShare a Scribd company logo
Building scalable IoT apps
using OSS technologies
Pavel Hardak
Basho Technologies
Disclaimer: some of the opinions expressed here are mine and might not fully agree with those of
IOT & INDUSTRY VERTICALS
IoT market - growth prediction
Number of connected “things”
• 2016 – about 6.4 B
• 30% YoY growth, 5.5M activations per day
• 2020 – about 21 B
“By 2020 more than half of new major business processes
and systems will incorporate some element of Internet of
Things”
Reality Check - let us get a second opinion
Building Scalable IoT Apps (QCon S-F)
IoT Project Plan
• Investigate those “things” and figure out
• What protocols they support (CoAP, MQTT, HTTP, …)
• What data they generate (temperature, humidity, location, speed, ...)
• Collect this data in our data center
• Implement protocols and parsing routines
• Store into persistent storage (“Data Lake” architecture)
• Once stored in Data Lake
• Analyze, summarize, “slice and dice”
• Predict, discover insights
• Declare a victory – make profit & go for IPO
Data Lake
IoT
Devices
SQL
Apps &
AnalyticsMQTT, CoAP and
HTTP
REFERENCE ARCHITECTURE (?)
Not so fast, my friend.
What is wrong with “Data Lake” for IoT ?
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Auto Insurance - Micro Case Study
• One of top 5 auto insurance companies in USA, appears in Fortune-500 list
• More than $10B in annual revenue, above $15B in assets
• About 20,000 employees and 50,000 insurance agents
• More than 19 million individual policies across all 50 states
How this “rating info” influences your payment ?
• Garaging Zip – what neighborhood is the car parked when it is
not used? There is a high correlation between Zip code and the
probability of car being stolen or vandalized.
• Current and Previous Annual Mileage – if the insured drives
for longer distances, it leads to the higher probability of road
accidents or car malfunctions.
• Vehicle Usage – do you use your car for work or pleasure? Are
you commuter, student, stay-at-home parent or Uber driver?
Depending on your usage, the company will calculate the risk
and adjust the rate.
• Years of Driving Experience – young drivers are put into
higher risk categories, where older people are considered safer
drivers due to more time behind the wheel. Note - average
young driver vs. average experienced driver.
Building Scalable IoT Apps (QCon S-F)
Sampling Frequency and Dataset Size
• Mileage
• From one sample per year to 52 (weekly) or 365 (daily)
• Better - let us do hourly to “see” the car usage (commuter, …)
• Location (used to be “Garaging Zip”)
• From one sample per year to 365 (daily)
• Better - hourly, allows to learn when car is parked for several hours
• New factors for rating algorithm based on weekly summaries
• Hard brakes, hard accelerations, going above the speed limit, …
• Amount of time series data to be stored and analyzed
• Grows by factor of 365x, then by another 24x = 8760x
Each week – at least 50x more data than the whole previous year.
Building Scalable IoT Apps (QCon S-F)
What is different special about IoT?
It is about the “things”… and more.
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
IoT Data Categories
Category Description
Metadata
&
Profiles
Devices Device info (model, SN, firmware, sensors, ..), configuration, owner, …
Users Personal info, preferences, billing info, registered devices, …
Time
Series
Ingested
(“Raw”)
Measurements, statuses and events from devices.
Aggregated
(“Derived”)
Calculated data - from devices & profiles
• Rollups – aggregate metrics from low resolution to higher ones (min -
hour – day) using min, max, avg, ...
• Aggregations – aggregate measurements, configuration and profiles
(model, region, …) over time ranges
IOT - NETWORKING TECHNOLOGIES
NETWORK WISH LIST
• Extreme Reliability
• Guaranteed Delivery
• End-to-End Low Latency
• Quality of Service
• Engineered Topology
• Committed Bandwidth (CIR)
• Fiber-optic network
• Dedicated Channel
• Strong Signal
• Interference and Crosstalk Resistant
• High SNR (Signal to Noise Ratio)
• Very Low BER (Bit Error Rate)
REALITY CHECK - LET US LOOK AGAIN
IOT & NETWORK - REALITY
• Wireless technologies
• Shared transmission media
• Limited bandwidth
• Mesh or Ad-hoc Topology
• Possible signals interference
• Mis-ordered or lost packets
• Low cost hardware components
• Low power radio transmitters
• Very small antennas
• “Custom-made” firmware
• Constrained Application Protocol
(CoAP)
• “Best Effort” QoS (“shoot and forget”)
IoT is “Big Data” - by definition.
Actually, lots and lots of Big Data.
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries: user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of
conflicts.
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of
conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data
(e.g. rollups, aggregations).
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of
conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data
(e.g. rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays).
Spikes up on new model launches or successful marketing campaign. But can slow down,
but will keep growing. Efficient data retention policy is critical to prevent overflows.
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of
conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data
(e.g. rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays).
Spikes up on new model launches or successful marketing campaign. But can slow down,
but will keep growing. Efficient data retention policy is critical to prevent overflows.
Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not-
so-reliable transport - expect that some data will be corrupted or arrive late or might be
lost. (Hopefully the devices were not hijacked or impersonated by hackers)
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of
conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data
(e.g. rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays).
Spikes up on new model launches or successful marketing campaign. But can slow down,
but will keep growing. Efficient data retention policy is critical to prevent overflows.
Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not-
so-reliable transport - expect that some data will be corrupted or arrive late or might be
lost. (Hopefully the devices were not hijacked or impersonated by hackers)
Value Profiles and summaries are much more valuable than raw data samples. The value of
“raw” time series quickly goes down was processed and clock advances. Aggregated
(”derived”) data are more valuable than raw data.
Exceptions: financial transactions, life support, nuclear plants, oil rigs, …
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of
conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data
(e.g. rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays).
Spikes up on new model launches or successful marketing campaign. But can slow down,
but will keep growing. Efficient data retention policy is critical to prevent overflows.
Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not-
so-reliable transport - expect that some data will be corrupted or arrive late or might be
lost. (Hopefully the devices were not hijacked or impersonated by hackers)
Value Profiles and summaries are much more valuable than raw data samples. The value of
“raw” time series quickly goes down was processed and clock advances. Aggregated
(”derived”) data are more valuable than raw data.
Exceptions: financial transactions, life support, nuclear plants, oil rigs, …
Complexity Poly-structured using simple schemas and simple relations (usually implicit). Some data is
treated as unstructured (”opaque”) for speed or flexibility.
What architecture would work ?
Architectural Blueprints
• Lambda Architecture by Nathan Marz (ex-Twitter)
• Kappa Architecture by Jay Kreps (Confluent)
• Zeta Architecture by Jim Scott (MapR)
• … and their variants
Lambda
Kappa
Zeta
Data Processing Framework for IoT
• Uses “Best of breed” OSS technologies
• Combines two paradigms
• “Speed Layer” – pipeline for Stream Processing for “Data in Motion”
• “Serving Layer” – analytics for “Data in Motion” and “Data at Rest”
• Every component is “Distributed by Design”
• Collection Layer
• Message Queue
• Stream Processing
• Data Storage (Database, Object System, Data Warehouse)
• Query and Analytics Engines
Data Access Patterns
Category Description R:W %
Metadata
&
Profiles
Devices &
Users
Many low latency small reads - all over the dataset. Occasional
updates – possibly by different “actors” (web, device, app), conflicts
need to be prevented or resolved. Fewer creates and deletes.
90:10
Time
Series
Data Access Patterns
Category Description R:W %
Metadata
&
Profiles
Devices &
Users
Many low latency small reads - all over the dataset. Occasional
updates – possibly by different “actors” (web, device, app), conflicts
need to be prevented or resolved. Fewer creates and deletes.
90:10
Time
Series
Ingested
(“Raw”)
Very high throughout of relatively small writes. Most reads are over
recent time range “slice”. Updates are rare (corrections).
This category is a biggest part of the IoT application dataset.
10:90
Data Access Patterns
Category Description R:W %
Metadata
&
Profiles
Devices &
Users
Many low latency small reads - all over the dataset. Occasional
updates – possibly by different “actors” (web, device, app), conflicts
need to be prevented or resolved. Fewer creates and deletes.
90:10
Time
Series
Ingested
(“Raw”)
Very high throughout of relatively small writes. Most reads are over
recent time range “slice”. Updates are rare (corrections).
This category is a biggest part of the IoT application dataset.
10:90
Aggregated
(“Derived”)
Mostly reads – users, platform services, reports. Writes are
periodical on each time interval or from batch jobs.
80:20
Data store for IoT – “Wish list”
• Ingested (Raw) Time Series
• Very high write throughput
• Fast slice (time range) reads
• Aggregated (Derived) Time Series
• Auto-distributed + slice locality
• SQL-like queries
• Aggregations
• Bulk queries (analytics)
• Secondary Indexes (Tags)
• Efficient Storage
• Auto Data Retention (TTL)
• Build-in anti entropy
• Compression
• Hot Backups
• Profiles and Metadata
• Many concurrent reads with low latency
• Reliable writes (ACID or conflict
resolution)
• Unstructured or partially structured
• Secondary Indexes + Text Search
• Scalability and Availability
• Distributed architecture, no SPoF
• Linearly scalable - up and down
• Operational simplicity
• Master-less architecture
• Automatic rebalancing
• Metrics, logs, events
• Rolling upgrades
What DB type is a good fit for TS use cases?
Database Type For IoT or Time Series
Relational Key Value Document Wide Column Graph
MySQL Riak KV MongoDB Cassandra Neo4J
PostgreSQL DynamoDB CouchBase HBase Titan
Oracle Voldemort RethinkDB Accumulo Infinite Graph
There is a need for a new type of NoSQL database – Time
Series
None of existing DB types was designed to handle time series data
• Wide column DBs have high write throughput, but reads and updates are not their strength
• Key Value and Document DBs handle metadata well, but struggle with heavy writes and time-slicing
reads
• Relational - good with metadata (unless number of updates is high), but a bad choice for TS data
• Graph DB – not a good choice for either time series or metadata, can be added later on
Database Type For IoT or Time Series
Relational Key Value Document Wide Column Graph
MySQL Riak KV MongoDB Cassandra Neo4J
PostgreSQL DynamoDB CouchBase HBase Titan
Oracle Voldemort RethinkDB Accumulo Infinite Graph
Time Series
InfluxDB Riak TS Blueflood
KairosDB Prometeus Druid
OpenTSDB Dalmatiner Graphite
Iot Sensors Data – Hot to Cold
SENSORS DATA – HOT N’ COLD
Temp Purpose Description Immutable?
Boiling
Hot
App usage
Last known value(s) and/or for last N minutes, useful for
immediate responses, very frequently accessed
No
Hot Operational
dataset
Last 24 hours to several days or weeks (rarely months),
frequently accessed, dashboards and online analytics
Almost*
Warm Historical data
Older data, less frequently accessed, used mostly for
offline analytics and historical analysis
Yes
Cold Archives
Used only in rare situations, kept in long term storage for
regulatory or unpredicted purposes
Yes
STORAGE TIERS – FROM HOT TO COLD
RAM → Database (TSDB) → Object Storage → Archive
Data Lake
Temp Purpose Storage Products Immutable?
Boiling
Hot
App usage Internal app cache, Redis or Memcached No
Hot Operational
dataset
NoSQL Database (preferably Time Series DB)
Riak TS, OpenTSDB, KairosDB, Cassandra, HBase
Almost*
Warm Historical data
Object storage – HDFS (Hadoop), Ceph, Minio,
Riak S2 or AWS S3
Yes
Cold Archives Various Yes
STORAGE TIERS – REALITY CHECK
RAM → Database (TSDB) → Object Storage → Archive
Elastic Cache (Redis) → Database (Postgres, DynamoDB) → AWS S3 →
Glacier
Data Lake
Temp AWS Service Storage price, GB per month
Boiling Hot Elastic Cache (Redis) $15-45
Hot DynamoDB
RDS (Postgres)
$ 0.25-0.35 (SSD)
from $0.1 (Magnetic)
Warm Simple Storage Service (S3) $0.024 to $0.030
Cold Glacier $0.007
OSS technologies for scalable IoT apps
Component Open Source Technologies
Load balancer Ngnix, HA Proxy
Ingestion Kafka, RabbitMQ, ZeroMQ, Flume
Stream Computing Spark Streaming, Apache Flink, Kafka Streams, Samza
Time Series Store InfluxDB, KairosDB, Riak, Cassandra, OpenTSDB
Profiles Store CouchBase, Riak, MySQL, Postgres, MongoDB
Search Solr, Elastic Search
Object Storage HDFS (Hadoop), Minio, Riak S2, Ceph
Analytics Framework Apache Spark, MapReduce, Hive
SQL Query Engine Spark SQL, Presto, Impala, Drill
Cluster Manager Mesosphere DC/OS or Mesos, Kubernetes, Docker Swarm
Checklist for IoT technology stack
❑Is it vendor lock-in or open source software? Are there open APIs?
❑Can it be deployed in cloud? At the edge? In a data center? Using hybrid
approach?
❑Can it be used it for free or low cost (no big upfront investment)?
❑Can you develop your app on your laptop? How many “moving parts”?
❑Are the components pre-integrated or can be easily integrated together?
❑Can you easily scale each component in this architecture by 10x? 20x? 50x?
❑Is there a roadmap, actively worked on, which is aligned with your vision?
❑Is there a company behind the technology to provide 24x7 support when needed?
Come to Basho booth to learn about
• Riak TS (Time Series) - highly scalable NoSQL database for IoT and Time
Series
… and more
• Riak Spark Connector for Apache Spark
• Riak Integrations with Redis and Kafka
• Riak Mesos Framework (RMF) for DC/OS
QUESTIONS?
Building Scalable IoT Apps (QCon S-F)

More Related Content

What's hot (20)

PPTX
Powering the Internet of Things with Apache Hadoop
Cloudera, Inc.
 
PPTX
IOT Platform as a Service
kidozen
 
PPTX
ParStream - Big Data for Business Users
ParStream Inc.
 
PDF
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
gogo6
 
PPTX
AI as a Catalyst for IoT
marina romanovich
 
PDF
IoT-Use-Case-eBook
Nicolas Delorme
 
PDF
The IOT scenario in the digital age
Sreenivasa Akshinthala
 
PPTX
Overcoming the AIoT Obstacles through Smart Component Integration
Innodisk Corporation
 
PDF
Architect Your IoT Platform for Success
Solace
 
PDF
CL2015 - Datacenter and Cloud Strategy and Planning
Cisco
 
PDF
Brian Isle: The Internet of Things: Manufacturing Panacea - or - Hacker's Dream?
360mnbsu
 
PPTX
2015-09-16 IoT in Oil and Gas Conference
Mark Reynolds
 
PDF
Short introduction to Big Data Analytics, the Internet of Things, and their s...
Andrei Khurshudov
 
PDF
Powering the Intelligent Edge: HPE's Strategy and Direction for IoT & Big Data
DataWorks Summit
 
PPTX
World of Watson IoT Journey Map
IBM Internet of Things
 
PPTX
Connected barrels_IoT in Oil and Gas_deloitte
Anshu Mittal
 
PDF
Real-Time Communications and the Industrial Internet of Things
Real-Time Innovations (RTI)
 
PPTX
HPE Presentation on Internet of Things at IoT World 2016 - Dubai
Alpha Data
 
PDF
Internet of Things Stack - Presentation Version
Postscapes
 
PPTX
The Prospect of IoT in the Oil & Gas
Ghazi Wadi, PMP
 
Powering the Internet of Things with Apache Hadoop
Cloudera, Inc.
 
IOT Platform as a Service
kidozen
 
ParStream - Big Data for Business Users
ParStream Inc.
 
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
gogo6
 
AI as a Catalyst for IoT
marina romanovich
 
IoT-Use-Case-eBook
Nicolas Delorme
 
The IOT scenario in the digital age
Sreenivasa Akshinthala
 
Overcoming the AIoT Obstacles through Smart Component Integration
Innodisk Corporation
 
Architect Your IoT Platform for Success
Solace
 
CL2015 - Datacenter and Cloud Strategy and Planning
Cisco
 
Brian Isle: The Internet of Things: Manufacturing Panacea - or - Hacker's Dream?
360mnbsu
 
2015-09-16 IoT in Oil and Gas Conference
Mark Reynolds
 
Short introduction to Big Data Analytics, the Internet of Things, and their s...
Andrei Khurshudov
 
Powering the Intelligent Edge: HPE's Strategy and Direction for IoT & Big Data
DataWorks Summit
 
World of Watson IoT Journey Map
IBM Internet of Things
 
Connected barrels_IoT in Oil and Gas_deloitte
Anshu Mittal
 
Real-Time Communications and the Industrial Internet of Things
Real-Time Innovations (RTI)
 
HPE Presentation on Internet of Things at IoT World 2016 - Dubai
Alpha Data
 
Internet of Things Stack - Presentation Version
Postscapes
 
The Prospect of IoT in the Oil & Gas
Ghazi Wadi, PMP
 

Viewers also liked (17)

PDF
Event Driven Streaming Analytics - Demostration on Architecture of IoT
Lei Xu
 
PDF
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
PPTX
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
PDF
Armando scannone recopilación de recetas
Free lancer
 
PDF
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
PDF
Time Series Data with Apache Cassandra
Eric Evans
 
PDF
Time Series Data with Apache Cassandra
Eric Evans
 
PPTX
Creator Ci40 IoT kit & Framework - scalable LWM2M IoT dev platform for business
Paul Evans
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PDF
Hands-on with AWS IoT
Julien SIMON
 
PDF
Time Series Processing with Apache Spark
Josef Adersberger
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PDF
Why Scala Is Taking Over the Big Data World
Dean Wampler
 
PDF
Time series with Apache Cassandra - Long version
Patrick McFadin
 
PPTX
A reference architecture for the internet of things
Charles Gibbons
 
PDF
Introduction to Network Function Virtualization (NFV)
rjain51
 
PDF
IoT architecture
Sumit Sharma
 
Event Driven Streaming Analytics - Demostration on Architecture of IoT
Lei Xu
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
Armando scannone recopilación de recetas
Free lancer
 
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Time Series Data with Apache Cassandra
Eric Evans
 
Time Series Data with Apache Cassandra
Eric Evans
 
Creator Ci40 IoT kit & Framework - scalable LWM2M IoT dev platform for business
Paul Evans
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Hands-on with AWS IoT
Julien SIMON
 
Time Series Processing with Apache Spark
Josef Adersberger
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Why Scala Is Taking Over the Big Data World
Dean Wampler
 
Time series with Apache Cassandra - Long version
Patrick McFadin
 
A reference architecture for the internet of things
Charles Gibbons
 
Introduction to Network Function Virtualization (NFV)
rjain51
 
IoT architecture
Sumit Sharma
 
Ad

Similar to Building Scalable IoT Apps (QCon S-F) (20)

PDF
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
PPTX
Internet of Things & Big Data
Arun Rajput
 
PDF
IOT DATA MANAGEMENT REQUIREMENTS AND ARCHITECTURE OF IOT.pdf
Vandana N
 
PPT
Intelligent Data Processing for the Internet of Things
PayamBarnaghi
 
PPTX
Io t first(1)
MuhammadAbduArRahman
 
PDF
SECON'2016. Семенченко Антон, Как тренды в Мобильной разработке и Интернете в...
SECON
 
PPTX
Group 4 IT INfrastructure Group presentation Final [Auto-saved].pptx
OdedeleIfeoluwa
 
PPTX
Разработка и тестирование интернета вещей. Тренды индустрии
corehard_by
 
PDF
Data dynamics in IoT Era
Paddy Ramanathan
 
PDF
¿Cómo puede ayudarlo Qlik a descubrir más valor en sus datos de IoT?
Data IQ Argentina
 
PPTX
Iot presentation
ANKITCHATTERJEE17
 
PDF
Setting up InfluxData for IoT
InfluxData
 
PDF
IOT_MODULE_4.pd easy to understand notes
shreyarrce
 
PDF
Zühlke Meetup - Mai 2017
Boris Adryan
 
PPTX
Data Management in Internet of Things MTECH
SachinDhavane
 
PPTX
Chapter 6 - IT Culture and the Society - Lesson 1.pptx
DondonGoles
 
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
Ajay Ohri
 
PDF
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
GetInData
 
PPTX
Groupdsaacascasacascascascasccsca 5.pptx
saksham23bce11216
 
PPTX
isheji-copy_17cscsccccc44699508460 .pptx
saksham23bce11216
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
Internet of Things & Big Data
Arun Rajput
 
IOT DATA MANAGEMENT REQUIREMENTS AND ARCHITECTURE OF IOT.pdf
Vandana N
 
Intelligent Data Processing for the Internet of Things
PayamBarnaghi
 
Io t first(1)
MuhammadAbduArRahman
 
SECON'2016. Семенченко Антон, Как тренды в Мобильной разработке и Интернете в...
SECON
 
Group 4 IT INfrastructure Group presentation Final [Auto-saved].pptx
OdedeleIfeoluwa
 
Разработка и тестирование интернета вещей. Тренды индустрии
corehard_by
 
Data dynamics in IoT Era
Paddy Ramanathan
 
¿Cómo puede ayudarlo Qlik a descubrir más valor en sus datos de IoT?
Data IQ Argentina
 
Iot presentation
ANKITCHATTERJEE17
 
Setting up InfluxData for IoT
InfluxData
 
IOT_MODULE_4.pd easy to understand notes
shreyarrce
 
Zühlke Meetup - Mai 2017
Boris Adryan
 
Data Management in Internet of Things MTECH
SachinDhavane
 
Chapter 6 - IT Culture and the Society - Lesson 1.pptx
DondonGoles
 
National seminar on emergence of internet of things (io t) trends and challe...
Ajay Ohri
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
GetInData
 
Groupdsaacascasacascascascasccsca 5.pptx
saksham23bce11216
 
isheji-copy_17cscsccccc44699508460 .pptx
saksham23bce11216
 
Ad

Recently uploaded (20)

PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PPT
Brief History of Python by Learning Python in three hours
adanechb21
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PDF
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
Troubleshooting Virtual Threads in Java!
Tier1 app
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Brief History of Python by Learning Python in three hours
adanechb21
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Activate_Methodology_Summary presentatio
annapureddyn
 
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 

Building Scalable IoT Apps (QCon S-F)

  • 1. Building scalable IoT apps using OSS technologies Pavel Hardak Basho Technologies Disclaimer: some of the opinions expressed here are mine and might not fully agree with those of
  • 2. IOT & INDUSTRY VERTICALS
  • 3. IoT market - growth prediction Number of connected “things” • 2016 – about 6.4 B • 30% YoY growth, 5.5M activations per day • 2020 – about 21 B “By 2020 more than half of new major business processes and systems will incorporate some element of Internet of Things”
  • 4. Reality Check - let us get a second opinion
  • 6. IoT Project Plan • Investigate those “things” and figure out • What protocols they support (CoAP, MQTT, HTTP, …) • What data they generate (temperature, humidity, location, speed, ...) • Collect this data in our data center • Implement protocols and parsing routines • Store into persistent storage (“Data Lake” architecture) • Once stored in Data Lake • Analyze, summarize, “slice and dice” • Predict, discover insights • Declare a victory – make profit & go for IPO
  • 7. Data Lake IoT Devices SQL Apps & AnalyticsMQTT, CoAP and HTTP REFERENCE ARCHITECTURE (?) Not so fast, my friend.
  • 8. What is wrong with “Data Lake” for IoT ?
  • 13. Auto Insurance - Micro Case Study • One of top 5 auto insurance companies in USA, appears in Fortune-500 list • More than $10B in annual revenue, above $15B in assets • About 20,000 employees and 50,000 insurance agents • More than 19 million individual policies across all 50 states
  • 14. How this “rating info” influences your payment ? • Garaging Zip – what neighborhood is the car parked when it is not used? There is a high correlation between Zip code and the probability of car being stolen or vandalized. • Current and Previous Annual Mileage – if the insured drives for longer distances, it leads to the higher probability of road accidents or car malfunctions. • Vehicle Usage – do you use your car for work or pleasure? Are you commuter, student, stay-at-home parent or Uber driver? Depending on your usage, the company will calculate the risk and adjust the rate. • Years of Driving Experience – young drivers are put into higher risk categories, where older people are considered safer drivers due to more time behind the wheel. Note - average young driver vs. average experienced driver.
  • 16. Sampling Frequency and Dataset Size • Mileage • From one sample per year to 52 (weekly) or 365 (daily) • Better - let us do hourly to “see” the car usage (commuter, …) • Location (used to be “Garaging Zip”) • From one sample per year to 365 (daily) • Better - hourly, allows to learn when car is parked for several hours • New factors for rating algorithm based on weekly summaries • Hard brakes, hard accelerations, going above the speed limit, … • Amount of time series data to be stored and analyzed • Grows by factor of 365x, then by another 24x = 8760x Each week – at least 50x more data than the whole previous year.
  • 18. What is different special about IoT? It is about the “things”… and more.
  • 21. IoT Data Categories Category Description Metadata & Profiles Devices Device info (model, SN, firmware, sensors, ..), configuration, owner, … Users Personal info, preferences, billing info, registered devices, … Time Series Ingested (“Raw”) Measurements, statuses and events from devices. Aggregated (“Derived”) Calculated data - from devices & profiles • Rollups – aggregate metrics from low resolution to higher ones (min - hour – day) using min, max, avg, ... • Aggregations – aggregate measurements, configuration and profiles (model, region, …) over time ranges
  • 22. IOT - NETWORKING TECHNOLOGIES
  • 23. NETWORK WISH LIST • Extreme Reliability • Guaranteed Delivery • End-to-End Low Latency • Quality of Service • Engineered Topology • Committed Bandwidth (CIR) • Fiber-optic network • Dedicated Channel • Strong Signal • Interference and Crosstalk Resistant • High SNR (Signal to Noise Ratio) • Very Low BER (Bit Error Rate)
  • 24. REALITY CHECK - LET US LOOK AGAIN
  • 25. IOT & NETWORK - REALITY • Wireless technologies • Shared transmission media • Limited bandwidth • Mesh or Ad-hoc Topology • Possible signals interference • Mis-ordered or lost packets • Low cost hardware components • Low power radio transmitters • Very small antennas • “Custom-made” firmware • Constrained Application Protocol (CoAP) • “Best Effort” QoS (“shoot and forget”)
  • 26. IoT is “Big Data” - by definition. Actually, lots and lots of Big Data.
  • 27. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries: user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
  • 28. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations).
  • 29. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. But can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows.
  • 30. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. But can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows. Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not- so-reliable transport - expect that some data will be corrupted or arrive late or might be lost. (Hopefully the devices were not hijacked or impersonated by hackers)
  • 31. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. But can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows. Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not- so-reliable transport - expect that some data will be corrupted or arrive late or might be lost. (Hopefully the devices were not hijacked or impersonated by hackers) Value Profiles and summaries are much more valuable than raw data samples. The value of “raw” time series quickly goes down was processed and clock advances. Aggregated (”derived”) data are more valuable than raw data. Exceptions: financial transactions, life support, nuclear plants, oil rigs, …
  • 32. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. But can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows. Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not- so-reliable transport - expect that some data will be corrupted or arrive late or might be lost. (Hopefully the devices were not hijacked or impersonated by hackers) Value Profiles and summaries are much more valuable than raw data samples. The value of “raw” time series quickly goes down was processed and clock advances. Aggregated (”derived”) data are more valuable than raw data. Exceptions: financial transactions, life support, nuclear plants, oil rigs, … Complexity Poly-structured using simple schemas and simple relations (usually implicit). Some data is treated as unstructured (”opaque”) for speed or flexibility.
  • 34. Architectural Blueprints • Lambda Architecture by Nathan Marz (ex-Twitter) • Kappa Architecture by Jay Kreps (Confluent) • Zeta Architecture by Jim Scott (MapR) • … and their variants Lambda Kappa Zeta
  • 35. Data Processing Framework for IoT • Uses “Best of breed” OSS technologies • Combines two paradigms • “Speed Layer” – pipeline for Stream Processing for “Data in Motion” • “Serving Layer” – analytics for “Data in Motion” and “Data at Rest” • Every component is “Distributed by Design” • Collection Layer • Message Queue • Stream Processing • Data Storage (Database, Object System, Data Warehouse) • Query and Analytics Engines
  • 36. Data Access Patterns Category Description R:W % Metadata & Profiles Devices & Users Many low latency small reads - all over the dataset. Occasional updates – possibly by different “actors” (web, device, app), conflicts need to be prevented or resolved. Fewer creates and deletes. 90:10 Time Series
  • 37. Data Access Patterns Category Description R:W % Metadata & Profiles Devices & Users Many low latency small reads - all over the dataset. Occasional updates – possibly by different “actors” (web, device, app), conflicts need to be prevented or resolved. Fewer creates and deletes. 90:10 Time Series Ingested (“Raw”) Very high throughout of relatively small writes. Most reads are over recent time range “slice”. Updates are rare (corrections). This category is a biggest part of the IoT application dataset. 10:90
  • 38. Data Access Patterns Category Description R:W % Metadata & Profiles Devices & Users Many low latency small reads - all over the dataset. Occasional updates – possibly by different “actors” (web, device, app), conflicts need to be prevented or resolved. Fewer creates and deletes. 90:10 Time Series Ingested (“Raw”) Very high throughout of relatively small writes. Most reads are over recent time range “slice”. Updates are rare (corrections). This category is a biggest part of the IoT application dataset. 10:90 Aggregated (“Derived”) Mostly reads – users, platform services, reports. Writes are periodical on each time interval or from batch jobs. 80:20
  • 39. Data store for IoT – “Wish list” • Ingested (Raw) Time Series • Very high write throughput • Fast slice (time range) reads • Aggregated (Derived) Time Series • Auto-distributed + slice locality • SQL-like queries • Aggregations • Bulk queries (analytics) • Secondary Indexes (Tags) • Efficient Storage • Auto Data Retention (TTL) • Build-in anti entropy • Compression • Hot Backups • Profiles and Metadata • Many concurrent reads with low latency • Reliable writes (ACID or conflict resolution) • Unstructured or partially structured • Secondary Indexes + Text Search • Scalability and Availability • Distributed architecture, no SPoF • Linearly scalable - up and down • Operational simplicity • Master-less architecture • Automatic rebalancing • Metrics, logs, events • Rolling upgrades
  • 40. What DB type is a good fit for TS use cases?
  • 41. Database Type For IoT or Time Series Relational Key Value Document Wide Column Graph MySQL Riak KV MongoDB Cassandra Neo4J PostgreSQL DynamoDB CouchBase HBase Titan Oracle Voldemort RethinkDB Accumulo Infinite Graph There is a need for a new type of NoSQL database – Time Series None of existing DB types was designed to handle time series data • Wide column DBs have high write throughput, but reads and updates are not their strength • Key Value and Document DBs handle metadata well, but struggle with heavy writes and time-slicing reads • Relational - good with metadata (unless number of updates is high), but a bad choice for TS data • Graph DB – not a good choice for either time series or metadata, can be added later on
  • 42. Database Type For IoT or Time Series Relational Key Value Document Wide Column Graph MySQL Riak KV MongoDB Cassandra Neo4J PostgreSQL DynamoDB CouchBase HBase Titan Oracle Voldemort RethinkDB Accumulo Infinite Graph Time Series InfluxDB Riak TS Blueflood KairosDB Prometeus Druid OpenTSDB Dalmatiner Graphite
  • 43. Iot Sensors Data – Hot to Cold
  • 44. SENSORS DATA – HOT N’ COLD Temp Purpose Description Immutable? Boiling Hot App usage Last known value(s) and/or for last N minutes, useful for immediate responses, very frequently accessed No Hot Operational dataset Last 24 hours to several days or weeks (rarely months), frequently accessed, dashboards and online analytics Almost* Warm Historical data Older data, less frequently accessed, used mostly for offline analytics and historical analysis Yes Cold Archives Used only in rare situations, kept in long term storage for regulatory or unpredicted purposes Yes
  • 45. STORAGE TIERS – FROM HOT TO COLD RAM → Database (TSDB) → Object Storage → Archive Data Lake Temp Purpose Storage Products Immutable? Boiling Hot App usage Internal app cache, Redis or Memcached No Hot Operational dataset NoSQL Database (preferably Time Series DB) Riak TS, OpenTSDB, KairosDB, Cassandra, HBase Almost* Warm Historical data Object storage – HDFS (Hadoop), Ceph, Minio, Riak S2 or AWS S3 Yes Cold Archives Various Yes
  • 46. STORAGE TIERS – REALITY CHECK RAM → Database (TSDB) → Object Storage → Archive Elastic Cache (Redis) → Database (Postgres, DynamoDB) → AWS S3 → Glacier Data Lake Temp AWS Service Storage price, GB per month Boiling Hot Elastic Cache (Redis) $15-45 Hot DynamoDB RDS (Postgres) $ 0.25-0.35 (SSD) from $0.1 (Magnetic) Warm Simple Storage Service (S3) $0.024 to $0.030 Cold Glacier $0.007
  • 47. OSS technologies for scalable IoT apps Component Open Source Technologies Load balancer Ngnix, HA Proxy Ingestion Kafka, RabbitMQ, ZeroMQ, Flume Stream Computing Spark Streaming, Apache Flink, Kafka Streams, Samza Time Series Store InfluxDB, KairosDB, Riak, Cassandra, OpenTSDB Profiles Store CouchBase, Riak, MySQL, Postgres, MongoDB Search Solr, Elastic Search Object Storage HDFS (Hadoop), Minio, Riak S2, Ceph Analytics Framework Apache Spark, MapReduce, Hive SQL Query Engine Spark SQL, Presto, Impala, Drill Cluster Manager Mesosphere DC/OS or Mesos, Kubernetes, Docker Swarm
  • 48. Checklist for IoT technology stack ❑Is it vendor lock-in or open source software? Are there open APIs? ❑Can it be deployed in cloud? At the edge? In a data center? Using hybrid approach? ❑Can it be used it for free or low cost (no big upfront investment)? ❑Can you develop your app on your laptop? How many “moving parts”? ❑Are the components pre-integrated or can be easily integrated together? ❑Can you easily scale each component in this architecture by 10x? 20x? 50x? ❑Is there a roadmap, actively worked on, which is aligned with your vision? ❑Is there a company behind the technology to provide 24x7 support when needed?
  • 49. Come to Basho booth to learn about • Riak TS (Time Series) - highly scalable NoSQL database for IoT and Time Series … and more • Riak Spark Connector for Apache Spark • Riak Integrations with Redis and Kafka • Riak Mesos Framework (RMF) for DC/OS