1 © 2016, Conversant, LLC. All rights reserved.
DATA MODELING FOR IOT
APACHECON IOT NORTH AMERICA 2017 PRESENTED BY:
JAYESH THAKRAR
SENIOR SOFTWARE ENGINEER
2
WHY DATA MODELING FOR IOT?
1. IoT is the next big wave after social media
(e.g. connected cars, smart homes & appliances)
2. Interesting challenges of volume, velocity and variety
3. Can be applied to other big data problems
3
DATA MODELING FOR IOT
1. Discuss sample IoT application
2. Discuss data model
3. Discuss application architecture
4
Sample
Application
5
INTELLIGENT VEHICLES
Cloud (Internet)
Road-side
infrastructure
• V2V: Vehicle to Vehicle
• V2C : Vehicle to Cloud
• V2I: Vehicle to Infrastructure
• Event = single, discrete
communication message
exchanged between a vehicle
and infrastructure
Communication Endpoints:
6
V2I: DATA & APPLICATION ASSUMPTIONS
• 1+ billion vehicles
• 500+ events per vehicle/day, based on
avg. time on road = 3 hours = 180 min
1 event per 10-30 seconds (avg = 3 per min) = 180*3 = 540 events/vehicle
• Avg. event size = 250-500+ bytes
Total raw data size = 150-300 TB / day
• Cassandra datastore
can be applied to HBase or other similarly
scalable datastore with appropriate testing
• Streaming for ingestion/processing/ETL
• Adhoc and batched analytics, extraction, etc
• Avoid schema-level indexes
for maintainability, efficiency, size, storage, etc.
7
SAMPLE APPLICATION ARCHITECTURE
Ingestion pipeline
Stream processing and analytics
Data storage
8
DATA MODEL CONSTRAINTS / REQUIREMENTS
• Efficient, low-latency writes and reads
• Sample queries:
- Events for a vehicle between two dates (or timestamps)
- Events for an infrastructure between two dates (or timestamps)
- Events by all infrastructure on a specific road-segment in a region
• Short, adhoc query characteristics/needs (guesstimate)
- volume = 100 – 100,000 rows
- response time = 100 ms – 100 seconds (proportional to result size)
9
SCHEMA VISUALIZATION: STAR SCHEMA
Vehicle
Event
Infrastructure
Road SegmentTime / Calendar
Region
10
CAN ALSO BE APPLIED TO: ADVERTISING/SEARCH
Cookie
Event
URL
LocationTime / Calendar
Region
11
CN ALSO BE APPLIED TO : SOCIAL NETWORKS
User
Action
Page
LocationTime / Calendar
Region
12
IoT Data Model
13
INSPIRATION: UNIX FILESYSTEM INODE
14
CASSANDRA: TABLE BASICS
• Data stored in tables with pre-defined schema
• Data types: primitives, collections, user-defined type
– Collections = sets, maps, lists
– Map keys and set and list values sorted
• Every table has primary key (PK)
– PK = single column or multi-column (composite)
– Data distributed on cluster nodes based on hash of first part of PK
• Keyspace = collection of (related) tables
• PK based queries = very fast
because of key cache, bloom filter, and sstable indexes
15
DATA ASSUMPTIONS (SIMPLISTIC MODEL)
16
TABLE DESIGN OPTIONS
Traditional table structure - column for each field
INSERT INTO event(id, timestamp, vehicle_id, infra_id,...)
INSERT INTO event JSON '{ "id" : 1234, "timestamp" : "...", ....)
All data fields serialized into a single column
INSERT INTO event(id, data)
VALUES (1234, "JSON/blob/serialized avro/etc") // data = blob or text
All data field stored in a collection field (e.g. map and/or set)
INSERT INTO event(id, data)
VALUES (1234, {'timestamp': ...}) // data = map<text, text>
17
STAR SCHEMA: DIMENSION TABLES
18
STAR SCHEMA: EVENT NAVIGATION TABLES
19
VEHICLE -> EVENTS : VEH_EVENT
CREATE TABLE veh_event(id TEXT PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
eb5071d8-0e35-4a82-ad37-543d3da66de7 set_data: (2017062408, 2017062409, ...)
eb5071d8-0e35-4a82-ad37-543d3da66de7, 2017062408 map_data: (08:23:16.732 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...)
25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ......
Level 0: Map of pointers to hourly data for each vehicle
Level 1: Map of pointers to actual event data for a vehicle for a given hour interval
Actual event data
vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7
event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
20
INFRASTRUCTURE -> EVENTS: INFRA_EVENT
CREATE TABLE infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559
vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7
event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
Level 0: Map of pointers to hourly data for each infrastructure
ffe0bdbb-3b89-4337-a477-4a17f719b559 set_data: (2017062408, 2017062409, ...)
ffe0bdbb-3b89-4337-a477-4a17f719b559, 2017062408 map_data: (23:16.732, eb5071d8-0e35-4a82-ad37-543d3da66de7 ->
25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...)
Level 1: Map of pointers to actual event data by vehicle for an infrastructure for a given hour interval
25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ......
Actual event data
21
LOCATION -> EVENTS: LOC_INFRA_EVENT
CREATE TABLE loc_infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
3aa40699-357e-48db-888b-af2ff7856949 set_data: (60b57655-0670-4969-9eec-99bcf8c8a034, ...)
60b57655-0670-4969-9eec-99bcf8c8a034 set_data: (ffe0bdbb-3b89-4337-a477-4a17f719b559, ...)
Level 0: Map of pointers to road-segments by region
Level 1: Map of pointers to infrastructure by road-segment
region_id = 3aa40699-357e-48db-888b-af2ff7856949
road_seg_id = 60b57655-0670-4969-9eec-99bcf8c8a034
infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559
map_data can be used above if there is a need to store any data (e.g. timestamp) along with road-segment or infra id
22
LOGICAL & PHYSICAL DESIGN CONSIDERATIONS
• Split each "level" of (logical) event navigation table into physical tables
– E.g. vehicle_event into vehicle_event_lo, vehicle_event_l1
Allows tuning parameters like cache, partition size, bloom filter as well as ease maintenance, etc.
• Primary keys for tables – combine process-level UUID + counter E.g.
– <uuid>-<NNNN> (reduces number of UUID generation calls)
– Further compact primary key by using binary encoding instead of string
(e.g 16 bytes for UUID + 8 bytes for counter)
• Short column names and appropriate data formats
– CREATE TABLE vehicle_event(id BLOB PRIMARY KEY, m MAP <TEXT, TEXT>, s SET <TEXT>, ...)
– Compact data e.g. time-of-day timestamps as integer i.e. ms of the day)
• Data immutability (helps reduce Cassandra entropy & ghost data concerns)
– Immutable event level data (insert-only into event and navigation tables)
– TTL to "age-out/purge" old data
• Keyspace sharding by time period and Cassandra compaction strategy
– Keyspace by day/week Compaction strategy = STCS v/s TWCS
23
KEY TAKEAWAYS OF DATA MODEL
• Single column primary keys
• Short primary key and column names
• All access (single row or range scan) via primary keys only
• Range scan (when necessary) appropriately paginated
• Immutable data (no updates/deletes) and idempotent inserts
• Data purge (TTL v/s keyspace by time period)
24
The Big Picture
Data Architecture +
App Architecture
25
SINGLE CLUSTER, CENTRALIZED INGESTION & PROCESSING
Single, centralized Cassandra cluster with
data-pipeline from different locations
26
MULTI-DATACENTER CLUSTER, INGESTION & PROCESSING
27
MULTIPLE INDEPENDENT, MODULAR SYSTEMS
Multiple, independent Cassandra clusters at different
datacenters along with an optional central cluster
containing select and/or aggregated data.
28
Reference & Misc
29
SAMPLE OF V2I REFERENCE INFORMATION
• https://www.its.dot.gov/index.htm
• https://www.its.dot.gov/v2i/
• https://www.its.dot.gov/communications/media/15cv_future.htm
• https://www.iso.org/committee/54706/x/catalogue/
• https://www.iso.org/standard/69897.html
30
SCALA SAMPLE TO MAP SET DATA INTO
INDIVIDUAL CASSANDRA ROW ACCESS
case class Data(key: String, values: Set[String]) extends
Iterator[Tuple2[String, String]] {
private val i = values.iterator
def hasNext = i.hasNext
def next = Tuple2[String, String](key, i.next)
}
val d = Seq[(String, Set[String])](("a",
Set[String]("a-1", "a-2", "a-3")))
scala> d.flatMap(i => Data(i._1, i._2))
res3: Seq[(String, String)] = List((a,a-1), (a,a-2), (a,a-3))
31

Data Modeling for IoT and Big Data

  • 1.
    1 © 2016,Conversant, LLC. All rights reserved. DATA MODELING FOR IOT APACHECON IOT NORTH AMERICA 2017 PRESENTED BY: JAYESH THAKRAR SENIOR SOFTWARE ENGINEER
  • 2.
    2 WHY DATA MODELINGFOR IOT? 1. IoT is the next big wave after social media (e.g. connected cars, smart homes & appliances) 2. Interesting challenges of volume, velocity and variety 3. Can be applied to other big data problems
  • 3.
    3 DATA MODELING FORIOT 1. Discuss sample IoT application 2. Discuss data model 3. Discuss application architecture
  • 4.
  • 5.
    5 INTELLIGENT VEHICLES Cloud (Internet) Road-side infrastructure •V2V: Vehicle to Vehicle • V2C : Vehicle to Cloud • V2I: Vehicle to Infrastructure • Event = single, discrete communication message exchanged between a vehicle and infrastructure Communication Endpoints:
  • 6.
    6 V2I: DATA &APPLICATION ASSUMPTIONS • 1+ billion vehicles • 500+ events per vehicle/day, based on avg. time on road = 3 hours = 180 min 1 event per 10-30 seconds (avg = 3 per min) = 180*3 = 540 events/vehicle • Avg. event size = 250-500+ bytes Total raw data size = 150-300 TB / day • Cassandra datastore can be applied to HBase or other similarly scalable datastore with appropriate testing • Streaming for ingestion/processing/ETL • Adhoc and batched analytics, extraction, etc • Avoid schema-level indexes for maintainability, efficiency, size, storage, etc.
  • 7.
    7 SAMPLE APPLICATION ARCHITECTURE Ingestionpipeline Stream processing and analytics Data storage
  • 8.
    8 DATA MODEL CONSTRAINTS/ REQUIREMENTS • Efficient, low-latency writes and reads • Sample queries: - Events for a vehicle between two dates (or timestamps) - Events for an infrastructure between two dates (or timestamps) - Events by all infrastructure on a specific road-segment in a region • Short, adhoc query characteristics/needs (guesstimate) - volume = 100 – 100,000 rows - response time = 100 ms – 100 seconds (proportional to result size)
  • 9.
    9 SCHEMA VISUALIZATION: STARSCHEMA Vehicle Event Infrastructure Road SegmentTime / Calendar Region
  • 10.
    10 CAN ALSO BEAPPLIED TO: ADVERTISING/SEARCH Cookie Event URL LocationTime / Calendar Region
  • 11.
    11 CN ALSO BEAPPLIED TO : SOCIAL NETWORKS User Action Page LocationTime / Calendar Region
  • 12.
  • 13.
  • 14.
    14 CASSANDRA: TABLE BASICS •Data stored in tables with pre-defined schema • Data types: primitives, collections, user-defined type – Collections = sets, maps, lists – Map keys and set and list values sorted • Every table has primary key (PK) – PK = single column or multi-column (composite) – Data distributed on cluster nodes based on hash of first part of PK • Keyspace = collection of (related) tables • PK based queries = very fast because of key cache, bloom filter, and sstable indexes
  • 15.
  • 16.
    16 TABLE DESIGN OPTIONS Traditionaltable structure - column for each field INSERT INTO event(id, timestamp, vehicle_id, infra_id,...) INSERT INTO event JSON '{ "id" : 1234, "timestamp" : "...", ....) All data fields serialized into a single column INSERT INTO event(id, data) VALUES (1234, "JSON/blob/serialized avro/etc") // data = blob or text All data field stored in a collection field (e.g. map and/or set) INSERT INTO event(id, data) VALUES (1234, {'timestamp': ...}) // data = map<text, text>
  • 17.
  • 18.
    18 STAR SCHEMA: EVENTNAVIGATION TABLES
  • 19.
    19 VEHICLE -> EVENTS: VEH_EVENT CREATE TABLE veh_event(id TEXT PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) eb5071d8-0e35-4a82-ad37-543d3da66de7 set_data: (2017062408, 2017062409, ...) eb5071d8-0e35-4a82-ad37-543d3da66de7, 2017062408 map_data: (08:23:16.732 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...) 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ...... Level 0: Map of pointers to hourly data for each vehicle Level 1: Map of pointers to actual event data for a vehicle for a given hour interval Actual event data vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7 event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
  • 20.
    20 INFRASTRUCTURE -> EVENTS:INFRA_EVENT CREATE TABLE infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559 vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7 event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 Level 0: Map of pointers to hourly data for each infrastructure ffe0bdbb-3b89-4337-a477-4a17f719b559 set_data: (2017062408, 2017062409, ...) ffe0bdbb-3b89-4337-a477-4a17f719b559, 2017062408 map_data: (23:16.732, eb5071d8-0e35-4a82-ad37-543d3da66de7 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...) Level 1: Map of pointers to actual event data by vehicle for an infrastructure for a given hour interval 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ...... Actual event data
  • 21.
    21 LOCATION -> EVENTS:LOC_INFRA_EVENT CREATE TABLE loc_infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) 3aa40699-357e-48db-888b-af2ff7856949 set_data: (60b57655-0670-4969-9eec-99bcf8c8a034, ...) 60b57655-0670-4969-9eec-99bcf8c8a034 set_data: (ffe0bdbb-3b89-4337-a477-4a17f719b559, ...) Level 0: Map of pointers to road-segments by region Level 1: Map of pointers to infrastructure by road-segment region_id = 3aa40699-357e-48db-888b-af2ff7856949 road_seg_id = 60b57655-0670-4969-9eec-99bcf8c8a034 infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559 map_data can be used above if there is a need to store any data (e.g. timestamp) along with road-segment or infra id
  • 22.
    22 LOGICAL & PHYSICALDESIGN CONSIDERATIONS • Split each "level" of (logical) event navigation table into physical tables – E.g. vehicle_event into vehicle_event_lo, vehicle_event_l1 Allows tuning parameters like cache, partition size, bloom filter as well as ease maintenance, etc. • Primary keys for tables – combine process-level UUID + counter E.g. – <uuid>-<NNNN> (reduces number of UUID generation calls) – Further compact primary key by using binary encoding instead of string (e.g 16 bytes for UUID + 8 bytes for counter) • Short column names and appropriate data formats – CREATE TABLE vehicle_event(id BLOB PRIMARY KEY, m MAP <TEXT, TEXT>, s SET <TEXT>, ...) – Compact data e.g. time-of-day timestamps as integer i.e. ms of the day) • Data immutability (helps reduce Cassandra entropy & ghost data concerns) – Immutable event level data (insert-only into event and navigation tables) – TTL to "age-out/purge" old data • Keyspace sharding by time period and Cassandra compaction strategy – Keyspace by day/week Compaction strategy = STCS v/s TWCS
  • 23.
    23 KEY TAKEAWAYS OFDATA MODEL • Single column primary keys • Short primary key and column names • All access (single row or range scan) via primary keys only • Range scan (when necessary) appropriately paginated • Immutable data (no updates/deletes) and idempotent inserts • Data purge (TTL v/s keyspace by time period)
  • 24.
    24 The Big Picture DataArchitecture + App Architecture
  • 25.
    25 SINGLE CLUSTER, CENTRALIZEDINGESTION & PROCESSING Single, centralized Cassandra cluster with data-pipeline from different locations
  • 26.
  • 27.
    27 MULTIPLE INDEPENDENT, MODULARSYSTEMS Multiple, independent Cassandra clusters at different datacenters along with an optional central cluster containing select and/or aggregated data.
  • 28.
  • 29.
    29 SAMPLE OF V2IREFERENCE INFORMATION • https://www.its.dot.gov/index.htm • https://www.its.dot.gov/v2i/ • https://www.its.dot.gov/communications/media/15cv_future.htm • https://www.iso.org/committee/54706/x/catalogue/ • https://www.iso.org/standard/69897.html
  • 30.
    30 SCALA SAMPLE TOMAP SET DATA INTO INDIVIDUAL CASSANDRA ROW ACCESS case class Data(key: String, values: Set[String]) extends Iterator[Tuple2[String, String]] { private val i = values.iterator def hasNext = i.hasNext def next = Tuple2[String, String](key, i.next) } val d = Seq[(String, Set[String])](("a", Set[String]("a-1", "a-2", "a-3"))) scala> d.flatMap(i => Data(i._1, i._2)) res3: Seq[(String, String)] = List((a,a-1), (a,a-2), (a,a-3))
  • 31.