SlideShare a Scribd company logo
Cassandra serving netflix @ scale
Cassandra serving netflix @ scale
CDE
• Cloud Database Engineering
• Responsible for providing data stores as
services @ Netflix
CDE Services
Agenda
• Cassandra @ Netflix
• Challenges
• Certification and benchmarking
• CDE Architecture
• 98% of streaming data is stored in
Cassandra
• Data ranges from customer
details to Viewing history /
streaming bookmarks to billing
and payment
Cassandra @ Netflix
Cassandra Footprint
• Hundreds of clusters
• Tens of Thousands of nodes
• PBs of data
• Millions of transactions / sec
Challenges
• Monitoring
• Maintenance
• Open source product
• Production readiness
Cassandra serving netflix @ scale
Monitoring
• What do we monitor?
– Latencies
• Co-ordinator Read 99th and 95th based on cluster configurations
• Co-ordinator Write 99th and 95th based on cluster configurations
– Health
• Health check (Powered by Mantis)
• Gossip issues
• Thrift/ Binary services status
• Heap
• Dmesg - Hardware and network issues
Monitoring
• Recent maintenances
– Jenkins
– User initiated maintenances
• Wide row metrics
• Log file warning/ errors/exceptions
Cassandra serving netflix @ scale
Common Approach
CRON System
Job
RunnerJob
RunnerJob
RunnerJob
Runner
Common Architecture
Problems inherent in polling
● Point-in-time snapshot, no state
● Establishing a connection to a cluster when it’s
under heavy load is problematic
● Not resilient to network hiccups, especially for
large clusters
A different approach
What if we had a continuous stream
of fine-grained snapshots ?
Mantis Streaming System
Stream processing system built on Apache Mesos
– Provides a flexible programming model
– Models computation as a distributed DAG
– Designed for high throughput, low latency
Health Check using Mantis
Source
Job
Local
Ring
Agg
Global
Ring
Agg
Source
Job
Source
Job
eu-west-1us-east-1us-west-2
Local
Ring
Agg
Local
Ring
Agg
Score
S
Health Evaluator
Consumes Scores
FSM
Health
Status
S
S
S
S
S
S
S
Score
MM
MM
MM
That’s great, but...
Now the health of the fleet is encapsulated in a
single data stream, so how do we make sense of
that ?
Real Time Dash (Macro View)
Macro View of the fleet
Real Time Dash (Cluster View)
Real Time Dash (Perspective)
Benefits
● Faster detection of issues
● Greater accuracy
● Massive reduction in false positives
● Separation of concerns (decouples detection
from remediation)
Cassandra serving netflix @ scale
Known problems
• Distributed persistent stores (Not stateless)
• Unresponsive nodes
• Cloud
• Configurations setup and tuning
• Hot nodes / token distribution
• Resiliency
Cassandra serving netflix @ scale
• Bootstrapping and automated token assignment
• Backup and recovery/restore
• Centralized configuration management
• REST API for most nodetool commands
• C* JMX metrics collection
• Monitor C* health
Building C* in cloud with Priam
(1) Alternate
availability zones
(a, b, c) around the
ring to ensure data
is written to
multiple data
centers.
(2) Survive the
loss of a data
center by ensuring
that we only lose
one node from
each replication
set.
A
B
C
A
B
c
A
B
C
A
B
C
Priam runs on each node and
will:
* Assign tokens to each
node, alternating (1) the
AZs around the ring (2).
* Perform nightly snapshot
backup to S3
* Perform incremental
SSTable backups to S3
* Bootstrap replacement
nodes to use vacated
tokens
* Collect JMX metrics for our
monitoring systems
* REST API calls to most
nodetool functions
Cassandra
Priam
Tomcat
Putting it all together
Constructing a cluster in AWS
AMI contains os, base netflix packages
and Cassandra and Priam
S3
2
Address DC Rack Status State Load Owns Token
…
###.##.##.### eu---west 1a Up Normal 108.97 GB 16.67% …
###.##.#.## us---east 1e Up Normal 103.72 GB 0.00% …
##.###.###.### eu---west 1b Up Normal 104.82 GB 16.67% …
##.##.##.### us---east 1c Up Normal 111.87 GB 0.00% …
###.##.##.### us---east 1e Up Normal 102.71 GB 0.00% …
##.###.###.### eu---west 1b Up Normal 101.87 GB 16.67% …
##.##.###.## us---east 1c Up Normal 102.83 GB 0.00% …
###.##.###.## eu---west 1c Up Normal 96.66 GB 16.67% …
##.##.##.### us---east 1d Up Normal 99.68 GB 0.00% …
Instance
Region
Availability Zone
(AZ)
Autoscaling Groups
ASGs do not map directly to
nodetool ring output, but are
used to define the cluster (# of
instances, AZs, etc).
Amazon Machine Image
Image loaded onto an AWS
instance; all packages needed
to run an application.
2
##.###.##.### eu---west 1c Up Normal 95.51 GB 16.67% …
##.##.##.## us---east 1d Up Normal 105.85 GB 0.00% …
##.###.##.### eu---west 1a Up Normal 91.25 GB 16.67% …
AWS Terminology
Constructing a cluster in AWS
Security Group
Defines access control
between ASGs
Resiliency
• Instance
• AZ
• Multiple AZ
• Region
Resiliency - Instance
• RF=AZ=3
• Cassandra bootstrapping works really well
• Replace nodes immediately
• Repair on regular interval
Resiliency - One AZ
• RF=AZ=3
• Alternating AZs ensures that each AZ has a full replica of
data
• Provision cluster to run at 2/3 capacity
• Ride out a zone outage; do not move to another zone
• Bootstrap one node at a time
• Repair after recovery
Resiliency - Multiple AZ
• Outage; can no longer satisfy quorum
• Restore from backup and repair
Resiliency - Region
• Connectivity loss between regions – operate as island
clusters until service restored
• Repair data between regions
Cassandra serving netflix @ scale
Cassandra serving netflix @ scale
Cassandra serving netflix @ scale
NdBench - Netflix Data Benchmark
Cassandra serving netflix @ scale
•
•
•
•
•
•
Cassandra serving netflix @ scale
•
-
-
-
Cassandra serving netflix @ scale
Cassandra serving netflix @ scale
•
Cassandra serving netflix @ scale
Stitching it together
C* as a Service - Architecture
J
E
N
K
I
N
S
W
I
N
S
T
O
N
EUNOMIA
Alert Atlas Mantis
C*
C*
C*
Priam
Bolt
Cluster
Metadata
Cluster Metadata/
Advisor
Maintenance
Remediation
C*
PAGE
CDE
Alert if
needed
Capacity
Prediction
Outlier
detection
C*
Forklifter
NDBench
C* Explorer
Client
Drivers
Log
analysis
Cassandra serving netflix @ scale
•
•

More Related Content

PDF
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
PPTX
Stability Patterns for Microservices
pflueras
 
PDF
Using ClickHouse for Experimentation
Gleb Kanterov
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
Stability Patterns for Microservices
pflueras
 
Using ClickHouse for Experimentation
Gleb Kanterov
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Cassandra Introduction & Features
DataStax Academy
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 

What's hot (20)

PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Apache Camel v3, Camel K and Camel Quarkus
Claus Ibsen
 
PPTX
AWS Simple Storage Service (s3)
zekeLabs Technologies
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PDF
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
PPTX
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu
 
PPTX
Room 1 - 3 - Lê Anh TuẼn - Build a High Performance Identification at GHTK wi...
Vietnam Open Infrastructure User Group
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PPTX
AWS Cloud Watch
zekeLabs Technologies
 
PPTX
Hashicorp Corporate and Product Overview
Stenio Ferreira
 
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
PDF
AWS 엣지 서비스를 통한 글로벌 서비스 관리 전략 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
Splunk
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
NoSQL Architecture Overview
Christopher Foot
 
PDF
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
PDF
Kafka tiered-storage-meetup-2022-final-presented
Sumant Tambe
 
PDF
Common issues with Apache KafkaÂŽ Producer
confluent
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Apache Camel v3, Camel K and Camel Quarkus
Claus Ibsen
 
AWS Simple Storage Service (s3)
zekeLabs Technologies
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu
 
Room 1 - 3 - Lê Anh TuẼn - Build a High Performance Identification at GHTK wi...
Vietnam Open Infrastructure User Group
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
AWS Cloud Watch
zekeLabs Technologies
 
Hashicorp Corporate and Product Overview
Stenio Ferreira
 
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
AWS 엣지 서비스를 통한 글로벌 서비스 관리 전략 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
Splunk
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
NoSQL Architecture Overview
Christopher Foot
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
Kafka tiered-storage-meetup-2022-final-presented
Sumant Tambe
 
Common issues with Apache KafkaÂŽ Producer
confluent
 
Ad

Similar to Cassandra serving netflix @ scale (20)

PDF
Data Stores @ Netflix
Vinay Kumar Chella
 
PPTX
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
DevClub_lv
 
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016ďťż
DataStax
 
PDF
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
PDF
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
PDF
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
PPTX
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
PDF
Scalability strategies for cloud based system architecture
SangJin Kang
 
PDF
Cassandra's Odyssey @ Netflix
Roopa Tangirala
 
PPTX
Azure IaaS TanÄątÄąm - Uzun AnlatÄąm
Mustafa
 
PDF
London + Dublin Cassandra 2.0
jbellis
 
PPTX
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
PDF
Cassandra and drivers
Ben Bromhead
 
PDF
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
PPTX
Massive Lift & Shift Migrations to Microsoft Azure with the Microsoft Migrati...
Morgan Simonsen
 
PDF
Netflix at-disney-09-26-2014
Monal Daxini
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PPTX
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
PDF
Open west 2015 talk ben coverston
bcoverston
 
PDF
To Serverless and Beyond
ScyllaDB
 
Data Stores @ Netflix
Vinay Kumar Chella
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
DevClub_lv
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016ďťż
DataStax
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
Scalability strategies for cloud based system architecture
SangJin Kang
 
Cassandra's Odyssey @ Netflix
Roopa Tangirala
 
Azure IaaS TanÄątÄąm - Uzun AnlatÄąm
Mustafa
 
London + Dublin Cassandra 2.0
jbellis
 
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
Cassandra and drivers
Ben Bromhead
 
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
Massive Lift & Shift Migrations to Microsoft Azure with the Microsoft Migrati...
Morgan Simonsen
 
Netflix at-disney-09-26-2014
Monal Daxini
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
Open west 2015 talk ben coverston
bcoverston
 
To Serverless and Beyond
ScyllaDB
 
Ad

More from Vinay Kumar Chella (9)

PDF
Building and running cloud native cassandra
Vinay Kumar Chella
 
PDF
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Vinay Kumar Chella
 
PDF
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
PDF
Query and audit logging in cassandra
Vinay Kumar Chella
 
PDF
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
PDF
A glimpse of cassandra 4.0 features netflix
Vinay Kumar Chella
 
PDF
Honest performance testing with NDBench
Vinay Kumar Chella
 
PDF
Real world repairs
Vinay Kumar Chella
 
PDF
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
Vinay Kumar Chella
 
Building and running cloud native cassandra
Vinay Kumar Chella
 
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Vinay Kumar Chella
 
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
Query and audit logging in cassandra
Vinay Kumar Chella
 
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
A glimpse of cassandra 4.0 features netflix
Vinay Kumar Chella
 
Honest performance testing with NDBench
Vinay Kumar Chella
 
Real world repairs
Vinay Kumar Chella
 
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
Vinay Kumar Chella
 

Recently uploaded (20)

PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Software Development Methodologies in 2025
KodekX
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 

Cassandra serving netflix @ scale

  • 3. CDE • Cloud Database Engineering • Responsible for providing data stores as services @ Netflix
  • 5. Agenda • Cassandra @ Netflix • Challenges • Certification and benchmarking • CDE Architecture
  • 6. • 98% of streaming data is stored in Cassandra • Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment Cassandra @ Netflix
  • 7. Cassandra Footprint • Hundreds of clusters • Tens of Thousands of nodes • PBs of data • Millions of transactions / sec
  • 8. Challenges • Monitoring • Maintenance • Open source product • Production readiness
  • 10. Monitoring • What do we monitor? – Latencies • Co-ordinator Read 99th and 95th based on cluster configurations • Co-ordinator Write 99th and 95th based on cluster configurations – Health • Health check (Powered by Mantis) • Gossip issues • Thrift/ Binary services status • Heap • Dmesg - Hardware and network issues
  • 11. Monitoring • Recent maintenances – Jenkins – User initiated maintenances • Wide row metrics • Log file warning/ errors/exceptions
  • 15. Problems inherent in polling ● Point-in-time snapshot, no state ● Establishing a connection to a cluster when it’s under heavy load is problematic ● Not resilient to network hiccups, especially for large clusters
  • 16. A different approach What if we had a continuous stream of fine-grained snapshots ?
  • 17. Mantis Streaming System Stream processing system built on Apache Mesos – Provides a flexible programming model – Models computation as a distributed DAG – Designed for high throughput, low latency
  • 18. Health Check using Mantis Source Job Local Ring Agg Global Ring Agg Source Job Source Job eu-west-1us-east-1us-west-2 Local Ring Agg Local Ring Agg Score S Health Evaluator Consumes Scores FSM Health Status S S S S S S S Score MM MM MM
  • 19. That’s great, but... Now the health of the fleet is encapsulated in a single data stream, so how do we make sense of that ?
  • 20. Real Time Dash (Macro View) Macro View of the fleet
  • 21. Real Time Dash (Cluster View)
  • 22. Real Time Dash (Perspective)
  • 23. Benefits ● Faster detection of issues ● Greater accuracy ● Massive reduction in false positives ● Separation of concerns (decouples detection from remediation)
  • 25. Known problems • Distributed persistent stores (Not stateless) • Unresponsive nodes • Cloud • Configurations setup and tuning • Hot nodes / token distribution • Resiliency
  • 27. • Bootstrapping and automated token assignment • Backup and recovery/restore • Centralized configuration management • REST API for most nodetool commands • C* JMX metrics collection • Monitor C* health Building C* in cloud with Priam
  • 28. (1) Alternate availability zones (a, b, c) around the ring to ensure data is written to multiple data centers. (2) Survive the loss of a data center by ensuring that we only lose one node from each replication set. A B C A B c A B C A B C Priam runs on each node and will: * Assign tokens to each node, alternating (1) the AZs around the ring (2). * Perform nightly snapshot backup to S3 * Perform incremental SSTable backups to S3 * Bootstrap replacement nodes to use vacated tokens * Collect JMX metrics for our monitoring systems * REST API calls to most nodetool functions Cassandra Priam Tomcat Putting it all together Constructing a cluster in AWS AMI contains os, base netflix packages and Cassandra and Priam S3 2
  • 29. Address DC Rack Status State Load Owns Token … ###.##.##.### eu---west 1a Up Normal 108.97 GB 16.67% … ###.##.#.## us---east 1e Up Normal 103.72 GB 0.00% … ##.###.###.### eu---west 1b Up Normal 104.82 GB 16.67% … ##.##.##.### us---east 1c Up Normal 111.87 GB 0.00% … ###.##.##.### us---east 1e Up Normal 102.71 GB 0.00% … ##.###.###.### eu---west 1b Up Normal 101.87 GB 16.67% … ##.##.###.## us---east 1c Up Normal 102.83 GB 0.00% … ###.##.###.## eu---west 1c Up Normal 96.66 GB 16.67% … ##.##.##.### us---east 1d Up Normal 99.68 GB 0.00% … Instance Region Availability Zone (AZ) Autoscaling Groups ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc). Amazon Machine Image Image loaded onto an AWS instance; all packages needed to run an application. 2 ##.###.##.### eu---west 1c Up Normal 95.51 GB 16.67% … ##.##.##.## us---east 1d Up Normal 105.85 GB 0.00% … ##.###.##.### eu---west 1a Up Normal 91.25 GB 16.67% … AWS Terminology Constructing a cluster in AWS Security Group Defines access control between ASGs
  • 30. Resiliency • Instance • AZ • Multiple AZ • Region
  • 31. Resiliency - Instance • RF=AZ=3 • Cassandra bootstrapping works really well • Replace nodes immediately • Repair on regular interval
  • 32. Resiliency - One AZ • RF=AZ=3 • Alternating AZs ensures that each AZ has a full replica of data • Provision cluster to run at 2/3 capacity • Ride out a zone outage; do not move to another zone • Bootstrap one node at a time • Repair after recovery
  • 33. Resiliency - Multiple AZ • Outage; can no longer satisfy quorum • Restore from backup and repair
  • 34. Resiliency - Region • Connectivity loss between regions – operate as island clusters until service restored • Repair data between regions
  • 38. NdBench - Netflix Data Benchmark
  • 48. C* as a Service - Architecture J E N K I N S W I N S T O N EUNOMIA Alert Atlas Mantis C* C* C* Priam Bolt Cluster Metadata Cluster Metadata/ Advisor Maintenance Remediation C* PAGE CDE Alert if needed Capacity Prediction Outlier detection C* Forklifter NDBench C* Explorer Client Drivers Log analysis