SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT: what about data storage?
Vladimir Rodionov
Staff Software Engineer
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
 Do we have good time series data store?
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
 Do we have good time series data store?
 Open source?
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
 Do we have good time series data store?
 Open source?
 But commercially supported?
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HBase
 Open Source
 Scalable
 Distributed
 NoSQL Data Store
 Commercially supported
 Temporal?
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HBase
 Open Source
 Scalable
 Distributed
 NoSQL Data Store
 Commercially supported
 Temporal? Sure, you can do temporal
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HBase
 Open Source
 Scalable
 Distributed
 NoSQL Data Store
 Commercially supported
 Temporal? Sure, you can do temporal stuff!
 Out of box?
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB requirements
 Data Store MUST preserve temporal locality of data for better in-memory caching
 Data Store MUST provide efficient compression
– Time – series are highly compressible (less than 2 bytes per data point in some cases)
– Facebook custom compression codec produces less than 1.4 bytes per data point
 Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg,
min, max, etc., by min, hour, day and so on – configurable. Most of the time its
aggregated data we are interested in.
 Efficient caching policy (RAM/SSD)
 SQL API (nice to have, but it is optional)
 Support IoT use cases ( write/read ratio up to 99/1, millions ops)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ideal HBase Time Series DB
 Keeps raw data for hours
 Does not compact raw data at all
 Preserves raw data in memory cache for periodic compactions and time-based rollup
aggregations
 Stores full resolution data only in compressed form
 Has different TTL for different aggregation resolutions:
– Days for by_min, by_10min etc.
– Months, years for by_hour
 Compaction should preserve temporal locality of both: full resolution data and
aggregated data.
 Integration with Phoenix (SQL)
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Path (for 99%)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB HBase
Raw Events
Region Server
HDFS
CF:Compressed
CF:Raw
CF:Aggregates
C
A
C
A
Compressor Coprocessor
Aggregator Coprocessor
CF:Aggregates
CF:Compressed – TTL days/months
CF:Aggregates – TTL months/years (CF per resolution)
CF:Raw – TTL hours
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-14468 FIFO compaction
 First-In-First-Out
 No compaction at all
 TTL expired data just get archived
 Ideal for raw data storage
 No compaction – no block cache trashing
 Raw data can be cached on write or on read
 Sustains 100s MB/s write throughput per RS
 Available 0.98.17, 1.1+, 1.2+, HDP-2.4+
 Can be easily back ported to 1.0 (do we need this?)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exploring (Size-Tiered) Compaction
 Does not preserve temporal locality of data.
 Compaction trashes block cache
 No efficient caching of data is possible
 It hurts most-recent-most-valuable data access pattern.
 Compression/Aggregation is very heavy.
 To read back recent raw data and run it through compressor, many IO operations are
required, because …
 We can’t guarantee recent data in a block cache.
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-15181 Date Tiered Compaction
 DateTieredCompactionPolicy
 CASSANDRA-6602
 Works better for time series than ExploringCompactionPolicy
 Better temporal locality helps with reads
 Good choice for compressed full resolution and aggregated data.
 Available in 0.98.17, 1.2+, HDP-2.4 has it as well
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exploring Compaction + Max Size
 Set hbase.hstore.compaction.max.size
 This emulates Date-Tiered Compaction
 Preserves temporal locality of data – data point which are close will be stored in a same
file, distant ones – in separate files.
 Compaction works better with block cache
 More efficient caching of recent data is possible
 Good for most-recent-most-valuable data access pattern.
 Use it for compressed and aggregated data
 Helps to keep recent data in a block cache.
 ECPM
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-14496 Delayed compaction
 Files are eligible for minor compaction if their age > delay
 Good for application where most recent data is most valuable.
 Prevents block cache from trashing for recent data due to frequent minor compactions
of a fresh store files
 Will enable this feature for Exploring Compaction Policy
 Improves read latency for most recent data.
 ECP + Max +Delay (1-2 days) is good option for compressed full resolution and
aggregated data. ECPMD
 Patch available.
 HBase 1.0+ (can be back-ported to 0.98)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB HBase
Raw Events
Region Server
HDFS
CF:Compressed
CF:Raw
CF:Aggregates
C
A
C
A
Compressor Coprocessor
Aggregator Coprocessor
CF:Aggregates
CF:Compressed – TTL days/months
CF:Aggregates – TTL months/years (CF per resolution)
CF:Raw – TTL hours
ECPM or DTCP
FIFO
ECPM or DTCP
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBase Block Cache and Time Series
 Current policy (LRU) is not optimal for time-series applications
 We need something similar to FIFO (both in RAM and on SSD)
 We need support for TB size RAM/SSD-based caches
 Current off-heap bucket cache does not scale well (it keeps keys in Java heap)
 For SSD cache we could mirror most recent store files, thus providing FIFO semantics
w/o any complexity of disk-based cache management.
 This all above are work items for future, but today …
– Disable cache for raw data (prevent extreme cache churn)
– Enable cache on write/read for compressed data and aggregations
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Flexible Retention Policies
Raw
Compressed
Aggregates
Hours Months Years
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read/Write IO Reduction
100
~50
~10
Base
FIFO+ECPM
+Compaction
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read/Write IO Reduction
100
~50
~10
Base
FIFO+ECPM
+Compaction
50-100MB/s
25-50MB/s
5-10MB/s
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read/Write IO Reduction (estimate for 250K/sec data points)
100
~50
~10
Base
FIFO+ECPM
+Compaction
50-100MB/s
25-50MB/s
5-10MB/s
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 Disable major compaction
 Do not run HDFS balancer
 Disable HBase auto region balancing: balance_switch false
 Disable region splits (DisabledRegionSplitPolicy)
 Presplit table in advance.
 Have separate column families for raw, compressed and aggregated data (each
aggregate resolution – its own family)
 Increase hbase.hstore.blockingStoreFiles for all column families
 FIFO for Raw, ECPM(D) or DTCP (next session) for compressed and aggregated data
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary (continued)
 Run periodically internal job (coprocessor) to compress data and produce time-based
rollup aggregations.
 Do not cache raw data, write/read cache for others (if ECPM(D))
 Enable WAL Compression - decrease write IO.
 Use maximum compression for Raw data (GZ) – decrease write IO.
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read Path (for 1%)
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL (Phoenix) integration
 Each time series has set of named attributes, which we call meta (tags in OpenTSDB)
 Keep time-series meta in Phoenix type table(s).
 Adding new time series, deleting time-series or updating time-series is DML/DDL
operation on a Phoenix table.
 Meta is static (mostly)
 Define set of attributes in meta which create PK
 Have PK translation to unique ID.
 Store ID, RTS (reversed time stamp), VALUE in HBase
 Now you can index time-series by any attribute(s) in Phoenix
 Query is two-step process: Phoenix first to select list of IDs, then HBase to run query on
ID list
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Flow
ID Active Version … MFG
11 true 1.1 SA
12 true 1.3 SA
15 true 1.4 GE
17 true 1.1 GE
… … … … …
345 false 1.0 SA
Phoenix SQL
Time-Series Definition - META
ID Timestamp Value
11 143897653 10.0
12 143897753 11.3
15 143897953 11.6
17 143897853 11.9
… … …
345 143897753 11.0
HBase Time Series DB
Time-Series Data
2)GetAvgByIdSet(ID
set, now(), now() -
24h)
1)SELECT ID FROM META
WHERE MFG=‘SA’AND
Version = ‘1.1’
1. 2.
ID set
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time-Series DB API
 Group operations on ID sets by time range
– Min, Max, Avg, Count, Sum, other aggregations
 Pluggable aggregation functions
 Support for different time resolutions
 With different approximations (linear, cubic, bi-cubic)
 Batch load support (for writes)
 Can be implemented in a HBase coprocessor layer
 Can work much-much faster than regular SQL DBMS
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time-Series DB API
 Group operations on ID sets by time range
– Min, Max, Avg, Count, Sum, other aggregations
 Pluggable aggregation functions
 Support for different time resolutions
 With different approximations (linear, cubic, bi-cubic)
 Batch load support (for writes)
 Can be implemented in a HBase coprocessor layer
 Can work much-much faster than regular SQL DBMS
 Because we have already aggregated data
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you
 Q&A

More Related Content

PPT
Windowsforensics
Santosh Khadsare
 
PPTX
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPT
HadoooIO.ppt
Sheba41
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Data science.chapter-1,2,3
varshakumar21
 
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Windowsforensics
Santosh Khadsare
 
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Data streaming fundamentals
Mohammed Fazuluddin
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
HadoooIO.ppt
Sheba41
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Data science.chapter-1,2,3
varshakumar21
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 

What's hot (20)

PPT
Seminar Presentation Hadoop
Varun Narang
 
PPTX
Elastic stack Presentation
Amr Alaa Yassen
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
Storage As A Service (StAAS)
Shreyans Jain
 
PPTX
Introduction to Storage technologies
Kaivalya Shah
 
PPTX
Ozone: scaling HDFS to trillions of objects
DataWorks Summit
 
PPTX
Hadoop Oozie
Madhur Nawandar
 
PPTX
MongoDB
nikhil2807
 
PPTX
MongoDB presentation
Hyphen Call
 
PPTX
Hadoop Security Today & Tomorrow with Apache Knox
Vinay Shukla
 
PPTX
Trusted Platform Module (TPM)
k33a
 
PDF
Dev Ops Training
Spark Summit
 
PPSX
Cassandra and Riak at BestBuy.com
joelcrabb
 
PDF
Apache ZooKeeper
Scott Leberknight
 
PPT
Ch08 Authentication
Information Technology
 
PPTX
Catalyst optimizer
Ayub Mohammad
 
PDF
Big Data Analytics with Spark
Mohammed Guller
 
PPTX
Hadoop distributed file system
Anshul Bhatnagar
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Seminar Presentation Hadoop
Varun Narang
 
Elastic stack Presentation
Amr Alaa Yassen
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Storage As A Service (StAAS)
Shreyans Jain
 
Introduction to Storage technologies
Kaivalya Shah
 
Ozone: scaling HDFS to trillions of objects
DataWorks Summit
 
Hadoop Oozie
Madhur Nawandar
 
MongoDB
nikhil2807
 
MongoDB presentation
Hyphen Call
 
Hadoop Security Today & Tomorrow with Apache Knox
Vinay Shukla
 
Trusted Platform Module (TPM)
k33a
 
Dev Ops Training
Spark Summit
 
Cassandra and Riak at BestBuy.com
joelcrabb
 
Apache ZooKeeper
Scott Leberknight
 
Ch08 Authentication
Information Technology
 
Catalyst optimizer
Ayub Mohammad
 
Big Data Analytics with Spark
Mohammed Guller
 
Hadoop distributed file system
Anshul Bhatnagar
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Ad

Viewers also liked (20)

PDF
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
PDF
Timeline service V2 at the Hadoop Summit SJ 2016
Vrushali Channapattan
 
PPTX
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PDF
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PPTX
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
PPTX
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB
 
PPTX
Streaming map reduce
danirayan
 
PPTX
Zero Downtime App Deployment using Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
PDF
HBase Consistency and Performance Improvements
DataWorks Summit
 
PDF
Apache HBase 0.98
AndrewPurtell
 
PPTX
JustWatch - Culture & Core Values
David Croyé
 
PDF
Hbase Nosql
elliando dias
 
PPTX
Launching your advanced analytics program for success in a mature industry
DataWorks Summit/Hadoop Summit
 
PDF
IOT Paris Seminar 2015 - Storage Challenges in IOT
MongoDB
 
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.
 
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Timeline service V2 at the Hadoop Summit SJ 2016
Vrushali Channapattan
 
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB
 
Streaming map reduce
danirayan
 
Zero Downtime App Deployment using Hadoop
DataWorks Summit/Hadoop Summit
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
HBase Consistency and Performance Improvements
DataWorks Summit
 
Apache HBase 0.98
AndrewPurtell
 
JustWatch - Culture & Core Values
David Croyé
 
Hbase Nosql
elliando dias
 
Launching your advanced analytics program for success in a mature industry
DataWorks Summit/Hadoop Summit
 
IOT Paris Seminar 2015 - Storage Challenges in IOT
MongoDB
 
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.
 
Ad

Similar to IoT:what about data storage? (20)

PPTX
Time-Series Apache HBase
HBaseCon
 
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
PPTX
Druid Scaling Realtime Analytics
Aaron Brooks
 
PPTX
ACID Transactions in Hive
Eugene Koifman
 
PPTX
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
 
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
PPTX
Big data spain keynote nov 2016
alanfgates
 
PPTX
Hive acid and_2.x new_features
Alberto Romero
 
PPTX
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
PPTX
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
PPTX
PUT is the new rename()
Steve Loughran
 
PPTX
Hive ACID Apache BigData 2016
alanfgates
 
PPTX
Apache Hive on ACID
Hortonworks
 
PPTX
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
PDF
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
Time-Series Apache HBase
HBaseCon
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Druid Scaling Realtime Analytics
Aaron Brooks
 
ACID Transactions in Hive
Eugene Koifman
 
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
Big data spain keynote nov 2016
alanfgates
 
Hive acid and_2.x new_features
Alberto Romero
 
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
PUT is the new rename()
Steve Loughran
 
Hive ACID Apache BigData 2016
alanfgates
 
Apache Hive on ACID
Hortonworks
 
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
The Future of Artificial Intelligence (AI)
Mukul
 

IoT:what about data storage?

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT: what about data storage? Vladimir Rodionov Staff Software Engineer
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series  Do we have good time series data store?
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series  Do we have good time series data store?  Open source?
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series  Do we have good time series data store?  Open source?  But commercially supported?
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HBase  Open Source  Scalable  Distributed  NoSQL Data Store  Commercially supported  Temporal?
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HBase  Open Source  Scalable  Distributed  NoSQL Data Store  Commercially supported  Temporal? Sure, you can do temporal
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HBase  Open Source  Scalable  Distributed  NoSQL Data Store  Commercially supported  Temporal? Sure, you can do temporal stuff!  Out of box?
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB requirements  Data Store MUST preserve temporal locality of data for better in-memory caching  Data Store MUST provide efficient compression – Time – series are highly compressible (less than 2 bytes per data point in some cases) – Facebook custom compression codec produces less than 1.4 bytes per data point  Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg, min, max, etc., by min, hour, day and so on – configurable. Most of the time its aggregated data we are interested in.  Efficient caching policy (RAM/SSD)  SQL API (nice to have, but it is optional)  Support IoT use cases ( write/read ratio up to 99/1, millions ops)
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ideal HBase Time Series DB  Keeps raw data for hours  Does not compact raw data at all  Preserves raw data in memory cache for periodic compactions and time-based rollup aggregations  Stores full resolution data only in compressed form  Has different TTL for different aggregation resolutions: – Days for by_min, by_10min etc. – Months, years for by_hour  Compaction should preserve temporal locality of both: full resolution data and aggregated data.  Integration with Phoenix (SQL)
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Write Path (for 99%)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Compressed – TTL days/months CF:Aggregates – TTL months/years (CF per resolution) CF:Raw – TTL hours
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-14468 FIFO compaction  First-In-First-Out  No compaction at all  TTL expired data just get archived  Ideal for raw data storage  No compaction – no block cache trashing  Raw data can be cached on write or on read  Sustains 100s MB/s write throughput per RS  Available 0.98.17, 1.1+, 1.2+, HDP-2.4+  Can be easily back ported to 1.0 (do we need this?)
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exploring (Size-Tiered) Compaction  Does not preserve temporal locality of data.  Compaction trashes block cache  No efficient caching of data is possible  It hurts most-recent-most-valuable data access pattern.  Compression/Aggregation is very heavy.  To read back recent raw data and run it through compressor, many IO operations are required, because …  We can’t guarantee recent data in a block cache.
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-15181 Date Tiered Compaction  DateTieredCompactionPolicy  CASSANDRA-6602  Works better for time series than ExploringCompactionPolicy  Better temporal locality helps with reads  Good choice for compressed full resolution and aggregated data.  Available in 0.98.17, 1.2+, HDP-2.4 has it as well
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exploring Compaction + Max Size  Set hbase.hstore.compaction.max.size  This emulates Date-Tiered Compaction  Preserves temporal locality of data – data point which are close will be stored in a same file, distant ones – in separate files.  Compaction works better with block cache  More efficient caching of recent data is possible  Good for most-recent-most-valuable data access pattern.  Use it for compressed and aggregated data  Helps to keep recent data in a block cache.  ECPM
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-14496 Delayed compaction  Files are eligible for minor compaction if their age > delay  Good for application where most recent data is most valuable.  Prevents block cache from trashing for recent data due to frequent minor compactions of a fresh store files  Will enable this feature for Exploring Compaction Policy  Improves read latency for most recent data.  ECP + Max +Delay (1-2 days) is good option for compressed full resolution and aggregated data. ECPMD  Patch available.  HBase 1.0+ (can be back-ported to 0.98)
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Compressed – TTL days/months CF:Aggregates – TTL months/years (CF per resolution) CF:Raw – TTL hours ECPM or DTCP FIFO ECPM or DTCP
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase Block Cache and Time Series  Current policy (LRU) is not optimal for time-series applications  We need something similar to FIFO (both in RAM and on SSD)  We need support for TB size RAM/SSD-based caches  Current off-heap bucket cache does not scale well (it keeps keys in Java heap)  For SSD cache we could mirror most recent store files, thus providing FIFO semantics w/o any complexity of disk-based cache management.  This all above are work items for future, but today … – Disable cache for raw data (prevent extreme cache churn) – Enable cache on write/read for compressed data and aggregations
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Flexible Retention Policies Raw Compressed Aggregates Hours Months Years
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read/Write IO Reduction 100 ~50 ~10 Base FIFO+ECPM +Compaction
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read/Write IO Reduction 100 ~50 ~10 Base FIFO+ECPM +Compaction 50-100MB/s 25-50MB/s 5-10MB/s
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read/Write IO Reduction (estimate for 250K/sec data points) 100 ~50 ~10 Base FIFO+ECPM +Compaction 50-100MB/s 25-50MB/s 5-10MB/s
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  Disable major compaction  Do not run HDFS balancer  Disable HBase auto region balancing: balance_switch false  Disable region splits (DisabledRegionSplitPolicy)  Presplit table in advance.  Have separate column families for raw, compressed and aggregated data (each aggregate resolution – its own family)  Increase hbase.hstore.blockingStoreFiles for all column families  FIFO for Raw, ECPM(D) or DTCP (next session) for compressed and aggregated data
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary (continued)  Run periodically internal job (coprocessor) to compress data and produce time-based rollup aggregations.  Do not cache raw data, write/read cache for others (if ECPM(D))  Enable WAL Compression - decrease write IO.  Use maximum compression for Raw data (GZ) – decrease write IO.
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read Path (for 1%)
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL (Phoenix) integration  Each time series has set of named attributes, which we call meta (tags in OpenTSDB)  Keep time-series meta in Phoenix type table(s).  Adding new time series, deleting time-series or updating time-series is DML/DDL operation on a Phoenix table.  Meta is static (mostly)  Define set of attributes in meta which create PK  Have PK translation to unique ID.  Store ID, RTS (reversed time stamp), VALUE in HBase  Now you can index time-series by any attribute(s) in Phoenix  Query is two-step process: Phoenix first to select list of IDs, then HBase to run query on ID list
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query Flow ID Active Version … MFG 11 true 1.1 SA 12 true 1.3 SA 15 true 1.4 GE 17 true 1.1 GE … … … … … 345 false 1.0 SA Phoenix SQL Time-Series Definition - META ID Timestamp Value 11 143897653 10.0 12 143897753 11.3 15 143897953 11.6 17 143897853 11.9 … … … 345 143897753 11.0 HBase Time Series DB Time-Series Data 2)GetAvgByIdSet(ID set, now(), now() - 24h) 1)SELECT ID FROM META WHERE MFG=‘SA’AND Version = ‘1.1’ 1. 2. ID set
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time-Series DB API  Group operations on ID sets by time range – Min, Max, Avg, Count, Sum, other aggregations  Pluggable aggregation functions  Support for different time resolutions  With different approximations (linear, cubic, bi-cubic)  Batch load support (for writes)  Can be implemented in a HBase coprocessor layer  Can work much-much faster than regular SQL DBMS
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time-Series DB API  Group operations on ID sets by time range – Min, Max, Avg, Count, Sum, other aggregations  Pluggable aggregation functions  Support for different time resolutions  With different approximations (linear, cubic, bi-cubic)  Batch load support (for writes)  Can be implemented in a HBase coprocessor layer  Can work much-much faster than regular SQL DBMS  Because we have already aggregated data
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you  Q&A