Terror & Hysteria: Cost Effective
Scaling of Time Series Data
with Cassandra
Sam Bisbee, Threat Stack CTO
Typical [time series] problems on C*
● Disk utilization creates a scaling pattern of lighting money on
fire
– Only works for a month or two, even with 90% disk utilization
● Every write up we found focused on schema design for
tracking integers across time
– There are days we wish we only tracked integers
● Data drastically loses value over time, but C*'s design
doesn't acknowledge this
– TTLs only address 0 value states, not partial value
– Ex., 99% of reads are for data in its first day
● Not all sensors are equal
Categories of Time Series Data
Volume of Tx's
Size of Tx's
CRUD, Web 2.0
System Monitoring
(CPU, etc.)
System Monitoring
(CPU, etc.)
Traditional object store
Threat Stack
Categories of Time Series Data
Volume of Tx's
Size of Tx's
CRUD, Web 2.0
System Monitoring
(CPU, etc.)
System Monitoring
(CPU, etc.)
Traditional object store
Threat Stack
Traditional time
series on C*, what
everyone writes about
“We're going to need
a bigger boat. Or disks.”
We care about this thing called margins
(see: we're in Boston, not the Valley)
Data at Threat Stack
● 5 to 10TBs per day of raw data
– Crossed several TB per day in first few months of production with ~4 people
● 80,000 to 150,000 Tx per second, analyzed in real time
– Internal goal of analyzing, persisting, and firing alerts in <1s
● 90% write to 10% read tx
● Pre-compute query results for 70% of queries for UI
– Optimized lookup tables & complex data structures, not just “query & cache”
● 100% AWS, distrust of remote storage in our DNA
– This is not just EBS bashing. This applies to all databases on all platforms,
even a cage in a data center.
● By the way, we're on DSE 4.8.4 (C* 2.1)
Generic data model
● Entire platform assumes that events form a partially ordered, eventually
consistent, write ahead log
– A wonderful C* use case, so long as you only INSERT
● UPDATE is a dirty word and C* counters are “banned”
– We do our big counts elsewhere (“right tool for the right job”)
● No DELETEs, too many key permutations and don't want tombstones
● Duplicate writes will happen
– Legitimate: fully or partially failed batches of writes
– Legitimate: sensor resends data because it doesn't see platform's
acknowledgement of data
– How-do-you-even-computer: people cannot configure NTP, so have fun
constantly receiving data from 1970
● TTL on insert time, store and query on event time
We need to show individual events or slices,
cannot use time granularity rows
(1min, 15min, 30min, 1hr, etc.)
Creating and updating tables' schema
● ALTER TABLE isn't fun, so we support dual writes instead
– Create new schema, performing dual reads for new & old
– Cut writes over to new schema
– After TTL time, DROP TABLE old
● Each step is verifiable with unit tests and metrics
● Maintains insert only data model for temporary disk util
cost
● Allows trivial testing of analysis and A/B'ing of schema
– Just toss a new schema in, gather some insights, and then
feel free to drop it
AWS Instance Types & EBS
● EBS is generally banned on our platform
– Too many of us lived through the great outage
– Too many of us cannot live with unpredictable I/O patterns
– Biggest reason: you cannot RI EBS
● Originally used i2.2xlarge's in 2014/2015
– Considering amount of “learning” we did, we were very
grateful for SSDs due to amount of streaming we had to do
● Moved to d2.xlarge's and d2.2xlarge's in 2015
– RAID 0 the spindles with xfs
– We like the CPU and RAM to disk ratio, especially since
compaction stops after a few hours
$/TB on AWS
i2.2xlarge d2.2xlarge c3.2xlarge +
6 x 2TB io1 EBS
No Prepay $619.04 / 1.6TB
= $386 / TB / month
$586.92 / 12TB
= $49.91 / TB / month
$1,713.16 / 12TB
= $142.77/TB/month
Partial Prepay $530.37 / 1.6TB
= $331.48/TB/month
$502.12 / 12TB
= $41.85 / TB / month
$1,684.59 / 12TB
= $140.39/TB/month
Full Prepay $519.17 / 1.6TB
= $324.85/TB/month
$492 / 12TB
= $41 / TB / month
$1,680.84 / 12TB
= $140.07/TB/month
● Amortizes one-time RI across 1yr, focusing on cost instead of cash out of
pocket
● Does not account for N=3 in cluster, so x3 for each record, then x2 for worst
case compaction headroom (realistically need MUCH LESS)
● c3 column assumes d2 comparison on disk size, not fair versus i2
We only store some raw data in C*
● Deleting data proved too difficult in the early days, even
with DTCS (slides coming on how we solved this)
● Re-streaming due to regular maintenance could take a
week or more
– Dropping instance size doesn't solve throughput problem
since all resources are cut, not just disk size
– Another reason not to use EBS since you'll “never” get close
to 100% disk utilization
● Due to aforementioned C* durability design, cost of data
for day 2..N is too high even if you drop replica count
Tying C* to raw data
● Every query must constrain a minimum of:
– Sensor ID
– Event Day
● Every query result must include a minimum of:
– Sensor ID
– Event Day
– Event ID
● Batches of (sensor_id, event_day, event_id) triples are
then used to look up the raw events from raw data storage
– This isn't always necessary (aggregates, correlations, etc.)
– Even with additional hops, full reads are still <1s
Using triples to batch writes
● Partition key starts with sensor id and event day
– Bonus: you get fresh ring location every day! Helps for
averaging out your schema mistakes over the TTL
● Event batches off of RabbitMQ are already constrained to
a single sensor id and event day
– Allows mapping a single AMQP read to a single C* write
(RabbitMQ is podded, not clustered)
– Flow state of pipeline becomes trivial to understand
● Batch C* writes on partition key, then data size (soft cap at
5120 bytes, C* inner warn)
Compaction woes, STCS & DTCS
● Used STCS in 2014/2015, expired data would get stuck ∞
– “We could rotate tables” → eh, no
– “We could rotate clusters” → oh c'mon, hell no
– “We could generate every historic permutation of keys within
that time bucket with Spark and run DELETEs” →...............
● Used DTCS in 2015, but expired data still got stuck ∞
– When deciding whether an SSTable is too old to compact,
compares “now” versus max timestamp (most recent write)
– If you write constantly (time series), then SSTables will rarely
or never stop compacting
– This means that you never realize the true value of DTCS for
time series, the ability to unlink whole SSTables from disk
Cluster disk states assuming const sensor count
Disk Util
Time
What you want
What you get
Initial build up to
retention period
MTCS, fixing DTCS
https://github.com/threatstack/mtcs
Now compare w/ min time
(oldest write)
MTCS settings
● Never run repairs (never worked on STCS or DTCS anyway)
and hinted handoff is off (great way to kill a cluster anyway)
● max_sstable_age_days = 1
base_time_seconds = 1 hour
● Results in roughly hour bucket sequential SSTables
– Reads are happy due to day or hour resolution, which have to
provide this in the partition key anyway
● Rest of DTCS sub-properties are default
● Not worried about really old and small SSTables since those
are simply unlinked “soon”
MTCS + sstablejanitor.sh
● Even with MTCS, SSTables were still not getting unlinked
● So enters sstablejanitor.sh
– Cron job fires it once per hour
– Iterates over each SSTable on disk for MTCS tables (chef/cron
feeds it a list of tables and their TTLs)
– Uses sstablemetadata to determine max timestamp
– If past TTL, then uses JMX to invoke CompactionManager's
forceUserDefinedCompaction on the table
● Hack? Yes, cron + sed + awk + JMX qualifies as a hack, but
it works like a charm and we don't carry expired data
● Bonus: don't need to reserve half your disks for compaction
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016
Discussion
@threatstack
@sbisbee

More Related Content

PDF
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
PPTX
Cassandra Tuning - above and beyond
PPTX
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
PDF
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
Cassandra Tuning - above and beyond
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...

What's hot (20)

PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PDF
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PPTX
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PDF
Instaclustr webinar 2017 feb 08 japan
PPTX
Everyday I’m scaling... Cassandra
PPTX
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
PPTX
Processing 50,000 events per second with Cassandra and Spark
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
Advanced Apache Cassandra Operations with JMX
PDF
Cassandra CLuster Management by Japan Cassandra Community
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PDF
Distribute Key Value Store
PPTX
Load testing Cassandra applications
PDF
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
PPTX
How to size up an Apache Cassandra cluster (Training)
PDF
Performance Monitoring: Understanding Your Scylla Cluster
PDF
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
PPT
Webinar: Getting Started with Apache Cassandra
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
Instaclustr webinar 2017 feb 08 japan
Everyday I’m scaling... Cassandra
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Processing 50,000 events per second with Cassandra and Spark
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Advanced Apache Cassandra Operations with JMX
Cassandra CLuster Management by Japan Cassandra Community
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Distribute Key Value Store
Load testing Cassandra applications
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
How to size up an Apache Cassandra cluster (Training)
Performance Monitoring: Understanding Your Scylla Cluster
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Webinar: Getting Started with Apache Cassandra
Ad

Viewers also liked (8)

PPTX
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
PDF
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
PDF
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
PDF
PagerDuty: One Year of Cassandra Failures
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PDF
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
PPTX
Always On: Building Highly Available Applications on Cassandra
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
PagerDuty: One Year of Cassandra Failures
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Always On: Building Highly Available Applications on Cassandra
Ad

Similar to Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016 (20)

PPTX
Using Time Window Compaction Strategy For Time Series Workloads
PDF
QuestDB: The building blocks of a fast open-source time-series database
PDF
Re-Engineering PostgreSQL as a Time-Series Database
PDF
CrowdStrike: Real World DTCS For Operators
PDF
Cassandra NYC 2011 Data Modeling
PPTX
Cassandra in Operation
PPTX
Forecasting database performance
PDF
Experiences building a multi region cassandra operations orchestrator on aws
PDF
Rankwave MOMENT™ (English)
PDF
The Building Blocks of QuestDB, a Time Series Database
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
STORMPresentation and all about storm_FINAL.pdf
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PDF
Observer, a "real life" time series application
PDF
Performance Analysis: new tools and concepts from the cloud
PDF
Kudu - Fast Analytics on Fast Data
PPTX
AWS Redshift Introduction - Big Data Analytics
PDF
Gcp data engineer
PDF
Performance tuning ColumnStore
Using Time Window Compaction Strategy For Time Series Workloads
QuestDB: The building blocks of a fast open-source time-series database
Re-Engineering PostgreSQL as a Time-Series Database
CrowdStrike: Real World DTCS For Operators
Cassandra NYC 2011 Data Modeling
Cassandra in Operation
Forecasting database performance
Experiences building a multi region cassandra operations orchestrator on aws
Rankwave MOMENT™ (English)
The Building Blocks of QuestDB, a Time Series Database
Building real time Data Pipeline using Spark Streaming
STORMPresentation and all about storm_FINAL.pdf
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Observer, a "real life" time series application
Performance Analysis: new tools and concepts from the cloud
Kudu - Fast Analytics on Fast Data
AWS Redshift Introduction - Big Data Analytics
Gcp data engineer
Performance tuning ColumnStore

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PPTX
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
PPTX
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
PPTX
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
PPTX
Python is a high-level, interpreted programming language
PPTX
Chapter 1 - Transaction Processing and Mgt.pptx
PDF
infoteam HELLAS company profile 2025 presentation
PDF
CapCut PRO for PC Crack New Download (Fully Activated 2025)
PPTX
ROI Analysis for Newspaper Industry with Odoo ERP
PDF
AI Guide for Business Growth - Arna Softech
PDF
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
PPTX
Computer Software - Technology and Livelihood Education
PPTX
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
PPTX
hospital managemt ,san.dckldnklcdnkdnkdnjadnjdjn
PPTX
Human-Computer Interaction for Lecture 2
PDF
What Makes a Great Data Visualization Consulting Service.pdf
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
R-Studio Crack Free Download 2025 Latest
PPTX
Airline CRS | Airline CRS Systems | CRS System
DOC
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
Python is a high-level, interpreted programming language
Chapter 1 - Transaction Processing and Mgt.pptx
infoteam HELLAS company profile 2025 presentation
CapCut PRO for PC Crack New Download (Fully Activated 2025)
ROI Analysis for Newspaper Industry with Odoo ERP
AI Guide for Business Growth - Arna Softech
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
Computer Software - Technology and Livelihood Education
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
hospital managemt ,san.dckldnklcdnkdnkdnjadnjdjn
Human-Computer Interaction for Lecture 2
What Makes a Great Data Visualization Consulting Service.pdf
CCleaner 6.39.11548 Crack 2025 License Key
R-Studio Crack Free Download 2025 Latest
Airline CRS | Airline CRS Systems | CRS System
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
Full-Stack Developer Courses That Actually Land You Jobs

Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016

  • 1. Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra Sam Bisbee, Threat Stack CTO
  • 2. Typical [time series] problems on C* ● Disk utilization creates a scaling pattern of lighting money on fire – Only works for a month or two, even with 90% disk utilization ● Every write up we found focused on schema design for tracking integers across time – There are days we wish we only tracked integers ● Data drastically loses value over time, but C*'s design doesn't acknowledge this – TTLs only address 0 value states, not partial value – Ex., 99% of reads are for data in its first day ● Not all sensors are equal
  • 3. Categories of Time Series Data Volume of Tx's Size of Tx's CRUD, Web 2.0 System Monitoring (CPU, etc.) System Monitoring (CPU, etc.) Traditional object store Threat Stack
  • 4. Categories of Time Series Data Volume of Tx's Size of Tx's CRUD, Web 2.0 System Monitoring (CPU, etc.) System Monitoring (CPU, etc.) Traditional object store Threat Stack Traditional time series on C*, what everyone writes about “We're going to need a bigger boat. Or disks.”
  • 5. We care about this thing called margins (see: we're in Boston, not the Valley)
  • 6. Data at Threat Stack ● 5 to 10TBs per day of raw data – Crossed several TB per day in first few months of production with ~4 people ● 80,000 to 150,000 Tx per second, analyzed in real time – Internal goal of analyzing, persisting, and firing alerts in <1s ● 90% write to 10% read tx ● Pre-compute query results for 70% of queries for UI – Optimized lookup tables & complex data structures, not just “query & cache” ● 100% AWS, distrust of remote storage in our DNA – This is not just EBS bashing. This applies to all databases on all platforms, even a cage in a data center. ● By the way, we're on DSE 4.8.4 (C* 2.1)
  • 7. Generic data model ● Entire platform assumes that events form a partially ordered, eventually consistent, write ahead log – A wonderful C* use case, so long as you only INSERT ● UPDATE is a dirty word and C* counters are “banned” – We do our big counts elsewhere (“right tool for the right job”) ● No DELETEs, too many key permutations and don't want tombstones ● Duplicate writes will happen – Legitimate: fully or partially failed batches of writes – Legitimate: sensor resends data because it doesn't see platform's acknowledgement of data – How-do-you-even-computer: people cannot configure NTP, so have fun constantly receiving data from 1970 ● TTL on insert time, store and query on event time
  • 8. We need to show individual events or slices, cannot use time granularity rows (1min, 15min, 30min, 1hr, etc.)
  • 9. Creating and updating tables' schema ● ALTER TABLE isn't fun, so we support dual writes instead – Create new schema, performing dual reads for new & old – Cut writes over to new schema – After TTL time, DROP TABLE old ● Each step is verifiable with unit tests and metrics ● Maintains insert only data model for temporary disk util cost ● Allows trivial testing of analysis and A/B'ing of schema – Just toss a new schema in, gather some insights, and then feel free to drop it
  • 10. AWS Instance Types & EBS ● EBS is generally banned on our platform – Too many of us lived through the great outage – Too many of us cannot live with unpredictable I/O patterns – Biggest reason: you cannot RI EBS ● Originally used i2.2xlarge's in 2014/2015 – Considering amount of “learning” we did, we were very grateful for SSDs due to amount of streaming we had to do ● Moved to d2.xlarge's and d2.2xlarge's in 2015 – RAID 0 the spindles with xfs – We like the CPU and RAM to disk ratio, especially since compaction stops after a few hours
  • 11. $/TB on AWS i2.2xlarge d2.2xlarge c3.2xlarge + 6 x 2TB io1 EBS No Prepay $619.04 / 1.6TB = $386 / TB / month $586.92 / 12TB = $49.91 / TB / month $1,713.16 / 12TB = $142.77/TB/month Partial Prepay $530.37 / 1.6TB = $331.48/TB/month $502.12 / 12TB = $41.85 / TB / month $1,684.59 / 12TB = $140.39/TB/month Full Prepay $519.17 / 1.6TB = $324.85/TB/month $492 / 12TB = $41 / TB / month $1,680.84 / 12TB = $140.07/TB/month ● Amortizes one-time RI across 1yr, focusing on cost instead of cash out of pocket ● Does not account for N=3 in cluster, so x3 for each record, then x2 for worst case compaction headroom (realistically need MUCH LESS) ● c3 column assumes d2 comparison on disk size, not fair versus i2
  • 12. We only store some raw data in C* ● Deleting data proved too difficult in the early days, even with DTCS (slides coming on how we solved this) ● Re-streaming due to regular maintenance could take a week or more – Dropping instance size doesn't solve throughput problem since all resources are cut, not just disk size – Another reason not to use EBS since you'll “never” get close to 100% disk utilization ● Due to aforementioned C* durability design, cost of data for day 2..N is too high even if you drop replica count
  • 13. Tying C* to raw data ● Every query must constrain a minimum of: – Sensor ID – Event Day ● Every query result must include a minimum of: – Sensor ID – Event Day – Event ID ● Batches of (sensor_id, event_day, event_id) triples are then used to look up the raw events from raw data storage – This isn't always necessary (aggregates, correlations, etc.) – Even with additional hops, full reads are still <1s
  • 14. Using triples to batch writes ● Partition key starts with sensor id and event day – Bonus: you get fresh ring location every day! Helps for averaging out your schema mistakes over the TTL ● Event batches off of RabbitMQ are already constrained to a single sensor id and event day – Allows mapping a single AMQP read to a single C* write (RabbitMQ is podded, not clustered) – Flow state of pipeline becomes trivial to understand ● Batch C* writes on partition key, then data size (soft cap at 5120 bytes, C* inner warn)
  • 15. Compaction woes, STCS & DTCS ● Used STCS in 2014/2015, expired data would get stuck ∞ – “We could rotate tables” → eh, no – “We could rotate clusters” → oh c'mon, hell no – “We could generate every historic permutation of keys within that time bucket with Spark and run DELETEs” →............... ● Used DTCS in 2015, but expired data still got stuck ∞ – When deciding whether an SSTable is too old to compact, compares “now” versus max timestamp (most recent write) – If you write constantly (time series), then SSTables will rarely or never stop compacting – This means that you never realize the true value of DTCS for time series, the ability to unlink whole SSTables from disk
  • 16. Cluster disk states assuming const sensor count Disk Util Time What you want What you get Initial build up to retention period
  • 18. MTCS settings ● Never run repairs (never worked on STCS or DTCS anyway) and hinted handoff is off (great way to kill a cluster anyway) ● max_sstable_age_days = 1 base_time_seconds = 1 hour ● Results in roughly hour bucket sequential SSTables – Reads are happy due to day or hour resolution, which have to provide this in the partition key anyway ● Rest of DTCS sub-properties are default ● Not worried about really old and small SSTables since those are simply unlinked “soon”
  • 19. MTCS + sstablejanitor.sh ● Even with MTCS, SSTables were still not getting unlinked ● So enters sstablejanitor.sh – Cron job fires it once per hour – Iterates over each SSTable on disk for MTCS tables (chef/cron feeds it a list of tables and their TTLs) – Uses sstablemetadata to determine max timestamp – If past TTL, then uses JMX to invoke CompactionManager's forceUserDefinedCompaction on the table ● Hack? Yes, cron + sed + awk + JMX qualifies as a hack, but it works like a charm and we don't carry expired data ● Bonus: don't need to reserve half your disks for compaction