SlideShare a Scribd company logo
©2013 DataStax Confidential. Do not distribute without consent.
Jon Haddad, Technical Evangelist
@rustyrazorblade
Diagnosing Problems in Production
1
First Step: Preparation
DataStax OpsCenter
• Will help with 90% of problems you
encounter
• Should be first place you look when
there's an issue
• Community version is free
• Enterprise version has additional
features
Server Monitoring & Alerts
• Monit
• monitor processes
• monitor disk usage
• send alerts
• Munin / collectd
• system perf statistics
• Nagios / Icinga
• Various 3rd party services
• Use whatever works for
you
Application Metrics
• Statsd / Graphite
• Grafana
• Gather constant metrics from
your application
• Measure anything & everything
• Microtimers, counters
• Graph events
• user signup
• error rates
• Cassandra Metrics Integration
• jmxtrans
Log Aggregation
• Hosted - Splunk, Loggly
• OSS - Logstash + Kibana, Greylog
• Many more…
• For best results all logs should be
aggregated here
• Oh yeah, and log your errors.
Gotchas
Incorrect Server Times
• Everything is written with a timestamp
• Last write wins
• Usually supplied by coordinator
• Can also be supplied by client
• What if your timestamps are wrong
because your clocks are off?
• Always install ntpd!
server
time: 10
server
time: 20
INSERT
real time: 12
DELETE
real time: 15
insert:20
delete:10
Tombstones
• Tombstones are a marker that data
no longer exists
• Tombstones have a timestamp just
like normal data
• They say "at time X, this no longer
exists"
Tombstone Hell
• Queries on partitions with a lot of tombstones require a lot of filtering
• This can be reaaaaaaally slow
• Consider:
• 100,000 rows in a partition
• 99,999 are tombstones
• How long to get a single row?
• Cassandra is not a queue!
read 99,999 tombstones
finally get the
right data
Not using a Snitch
• Snitch lets us distribute data in a fault tolerant way
• Changing this with a large cluster is time
consuming
• Dynamic Snitching
• use the fastest replica for reads
• RackInferring (uses IP to pick replicas)
• DC aware
• PropertyFileSnitch (cassandra-topology.properties)
• EC2Snitch & EC2MultiRegion
• GoogleCloudSnitch
• GossipingPropertyFileSnitch (recommended)
Version Mismatch
• SSTable format changed between
versions, making streaming
incompatible
• Version mismatch can break bootstrap,
repair, and decommission
• Introducing new nodes? Stick w/ the
same version
• Upgrade nodes in place
• One at a time
• One rack / AZ at a time (requires proper snitch)
Disk Space not Reclaimed
• When you add new nodes, data is
streamed from existing nodes
• … but it's not deleted from them after
• You need to run a nodetool cleanup
• Otherwise you'll run out of space just by
adding nodes
Using Shared Storage
• Single point of failure
• High latency
• Expensive
• Performance is about latency
• Can increase throughput with more
disks
• In general avoid EBS, SAN, NAS
Compaction
• Compaction merges SSTables
• Too much compaction?
• Opscenter provides insight into compaction
cluster wide
• nodetool
• compactionhistory
• getcompactionthroughput
• Leveled vs Size Tiered vs Date Tiered
• Leveled on SSD + Read Heavy
• Size tiered on Spinning rust
• Size tiered is great for write heavy time series workloads
• Date tiered is new and is showing HUGE promise
Diagnostic Tools
htop
• Process overview - nicer than top
iostat
• Disk stats
• Queue size, wait times
• Ignore %util
vmstat
• virtual memory statistics
• Am I swapping?
• Reports at an interval, to an optional count
dstat
• Flexible look at network, CPU, memory, disk
strace
• What is my process doing?
• See all system calls
• Filterable with -e
• Can attach to running
processes
jstack
tcpdump
• Watch network traffic
nodetool tpstats
• What's blocked?
• MemtableFlushWriter? - Slow
disks!
• also leads to GC issues
• Dropped mutations?
• need repair!
Histograms
• proxyhistograms
• High level read and write times
• Includes network latency
• cfhistograms <keyspace> <table>
• reports stats for single table on a single
node
• Used to identify tables with
performance problems
Query Tracing
JVM Garbage Collection
JVM GC Overview
• What is garbage collection?
• Manual vs automatic memory management
• Generational garbage collection (ParNew & CMS)
• New Generation
• Old Generation
New Generation
• New objects are created in the new gen (eden)
• Comprised of Eden & 2 survivor spaces (SurvivorRatio)
• Space identified by HEAP_NEWSIZE in cassandra-env.sh
• Historically limited to 800MB
Minor GC
• Occurs when Eden fills up
• Stop the world
• Dead objects are removed
• Copy current survivor to empty survivor
• Live objects are promoted into survivor (S0 & S1) then old gen
• Some survivor objects promoted to old gen (MaxTenuringThreshold)
• Spillover promoted to old gen
• Removing objects is fast, promoting objects is slow
Old Generation
• Objects are promoted to new gen from old gen
• Major GC
• Mostly concurrent
• 2 short stop the world pauses
Full GC
• Occurs when old gen fills up or
objects can’t be promoted
• Stop the world
• Collects all generations
• Defragments old gen
• These are bad!
• Massive pauses
Workload 1: Write Heavy
• Objects promoted: Memtables
• New gen too big
• Remember: promoting objects is slow!
• Huge new gen = potentially a lot of promotion
new gen old gen
too much promotion
Workload 2: Read Heavy
• Short lived objects being promoted into old gen
• Lots of minor GCs
• Read heavy workloads on SSD
• Results in frequent full GC
new gen old gen (full of short lived objects)
early promotion
fills up quickly
GC Profiling
• Opscenter gc stats
• Look for correlations between gc spikes
and read/write latency
• Cassandra GC Logging
• Can be activated in cassandra-env.sh
• jstat
• prints gc activity
How much does it matter?
Stuff is broken, fix it!
Narrow Down the Problem
• Is it even Cassandra? Check your
metrics!
• Nodes flapping / failing
• Check ops center
• Dig into system metrics
• Slow queries
• Find your bottleneck
• Check system stats
• JVM GC
• Compaction
• Histograms
• Tracing
©2013 DataStax Confidential. Do not distribute without consent. 39

More Related Content

PDF
Seattle Cassandra Meetup - HasOffers
btoddb
 
PDF
Target: Performance Tuning Cassandra at Target
DataStax Academy
 
PPTX
Seastar Summit 2019 Keynote
ScyllaDB
 
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PDF
Counters At Scale - A Cautionary Tale
Eric Lubow
 
PPTX
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
ScyllaDB
 
PPTX
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
ScyllaDB
 
Seattle Cassandra Meetup - HasOffers
btoddb
 
Target: Performance Tuning Cassandra at Target
DataStax Academy
 
Seastar Summit 2019 Keynote
ScyllaDB
 
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Counters At Scale - A Cautionary Tale
Eric Lubow
 
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
ScyllaDB
 
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
ScyllaDB
 

What's hot (20)

PDF
Low latency stream processing with jet
StreamNative
 
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
PDF
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
ScyllaDB
 
PPTX
Sizing Your Scylla Cluster
ScyllaDB
 
PPTX
Scylla Summit 2018: Consensus in Eventually Consistent Databases
ScyllaDB
 
PDF
Introducing Scylla Open Source 4.0
ScyllaDB
 
PDF
ScyllaDB @ Apache BigData, may 2016
Tzach Livyatan
 
PDF
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr
 
PDF
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
ScyllaDB
 
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 
PPTX
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB
 
PDF
Scylla Summit 2022: Stream Processing with ScyllaDB
ScyllaDB
 
PPTX
Writing Applications for Scylla
ScyllaDB
 
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
PDF
Measuring Database Performance on Bare Metal AWS Instances
ScyllaDB
 
PPTX
How to be Successful with Scylla
ScyllaDB
 
PPTX
Scylla’s Journey Towards Being an Elastic Cloud Native Database
ScyllaDB
 
PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
PDF
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Tzach Livyatan
 
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
DataStax Academy
 
Low latency stream processing with jet
StreamNative
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
ScyllaDB
 
Sizing Your Scylla Cluster
ScyllaDB
 
Scylla Summit 2018: Consensus in Eventually Consistent Databases
ScyllaDB
 
Introducing Scylla Open Source 4.0
ScyllaDB
 
ScyllaDB @ Apache BigData, may 2016
Tzach Livyatan
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
ScyllaDB
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB
 
Scylla Summit 2022: Stream Processing with ScyllaDB
ScyllaDB
 
Writing Applications for Scylla
ScyllaDB
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Measuring Database Performance on Bare Metal AWS Instances
ScyllaDB
 
How to be Successful with Scylla
ScyllaDB
 
Scylla’s Journey Towards Being an Elastic Cloud Native Database
ScyllaDB
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Tzach Livyatan
 
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
DataStax Academy
 
Ad

Viewers also liked (10)

ODP
Cassandra Overview
btoddb
 
PDF
Introduction to brainCloud - Sept 2014
Paul Winterhalder
 
PDF
Voice and Video on the Web
Kundan Singh
 
PDF
WebRTC for Business: Hype, Hope or Hassle?
Michael P. Monroe
 
PPT
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
btoddb
 
PDF
Cassandra overview
Sean Murphy
 
KEY
Chord presentation
GertThijs
 
PPT
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
PPTX
Cassandra - Deep Dive ...
sameiralk
 
PPT
Peer-to-peer Internet telephony
Kundan Singh
 
Cassandra Overview
btoddb
 
Introduction to brainCloud - Sept 2014
Paul Winterhalder
 
Voice and Video on the Web
Kundan Singh
 
WebRTC for Business: Hype, Hope or Hassle?
Michael P. Monroe
 
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
btoddb
 
Cassandra overview
Sean Murphy
 
Chord presentation
GertThijs
 
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
Cassandra - Deep Dive ...
sameiralk
 
Peer-to-peer Internet telephony
Kundan Singh
 
Ad

Similar to Cassandra Day London 2015: Diagnosing Problems in Production (20)

PDF
Advanced Operations
DataStax Academy
 
PDF
Diagnosing Problems in Production (Nov 2015)
Jon Haddad
 
PDF
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
Jon Haddad
 
PDF
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Outlyer
 
PDF
Cassandra Summit 2014: Diagnosing Problems in Production
DataStax Academy
 
PDF
Cassandra Summit 2014: Diagnosing Problems in Production
DataStax Academy
 
PPT
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
srisatish ambati
 
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
PDF
Fixing twitter
Roger Xia
 
PDF
Fixing_Twitter
liujianrong
 
PDF
John adams talk cloudy
John Adams
 
PDF
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
PPTX
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
PPS
Storm presentation
Shyam Raj
 
PDF
Accumulo Nutch/GORA, Storm, and Pig
Jason Trost
 
PDF
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Advanced Operations
DataStax Academy
 
Diagnosing Problems in Production (Nov 2015)
Jon Haddad
 
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Diagnosing Problems in Production: Cassandra Summit 2014
Jon Haddad
 
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Outlyer
 
Cassandra Summit 2014: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Summit 2014: Diagnosing Problems in Production
DataStax Academy
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
srisatish ambati
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing twitter
Roger Xia
 
Fixing_Twitter
liujianrong
 
John adams talk cloudy
John Adams
 
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
Storm presentation
Shyam Raj
 
Accumulo Nutch/GORA, Storm, and Pig
Jason Trost
 
How to Make Norikra Perfect
SATOSHI TAGOMORI
 

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
PDF
Apache Cassandra and Drivers
DataStax Academy
 
PDF
Getting Started with Graph Databases
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 
Apache Cassandra and Drivers
DataStax Academy
 
Getting Started with Graph Databases
DataStax Academy
 

Recently uploaded (20)

PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
The Future of Artificial Intelligence (AI)
Mukul
 
Doc9.....................................
SofiaCollazos
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

Cassandra Day London 2015: Diagnosing Problems in Production

  • 1. ©2013 DataStax Confidential. Do not distribute without consent. Jon Haddad, Technical Evangelist @rustyrazorblade Diagnosing Problems in Production 1
  • 3. DataStax OpsCenter • Will help with 90% of problems you encounter • Should be first place you look when there's an issue • Community version is free • Enterprise version has additional features
  • 4. Server Monitoring & Alerts • Monit • monitor processes • monitor disk usage • send alerts • Munin / collectd • system perf statistics • Nagios / Icinga • Various 3rd party services • Use whatever works for you
  • 5. Application Metrics • Statsd / Graphite • Grafana • Gather constant metrics from your application • Measure anything & everything • Microtimers, counters • Graph events • user signup • error rates • Cassandra Metrics Integration • jmxtrans
  • 6. Log Aggregation • Hosted - Splunk, Loggly • OSS - Logstash + Kibana, Greylog • Many more… • For best results all logs should be aggregated here • Oh yeah, and log your errors.
  • 8. Incorrect Server Times • Everything is written with a timestamp • Last write wins • Usually supplied by coordinator • Can also be supplied by client • What if your timestamps are wrong because your clocks are off? • Always install ntpd! server time: 10 server time: 20 INSERT real time: 12 DELETE real time: 15 insert:20 delete:10
  • 9. Tombstones • Tombstones are a marker that data no longer exists • Tombstones have a timestamp just like normal data • They say "at time X, this no longer exists"
  • 10. Tombstone Hell • Queries on partitions with a lot of tombstones require a lot of filtering • This can be reaaaaaaally slow • Consider: • 100,000 rows in a partition • 99,999 are tombstones • How long to get a single row? • Cassandra is not a queue! read 99,999 tombstones finally get the right data
  • 11. Not using a Snitch • Snitch lets us distribute data in a fault tolerant way • Changing this with a large cluster is time consuming • Dynamic Snitching • use the fastest replica for reads • RackInferring (uses IP to pick replicas) • DC aware • PropertyFileSnitch (cassandra-topology.properties) • EC2Snitch & EC2MultiRegion • GoogleCloudSnitch • GossipingPropertyFileSnitch (recommended)
  • 12. Version Mismatch • SSTable format changed between versions, making streaming incompatible • Version mismatch can break bootstrap, repair, and decommission • Introducing new nodes? Stick w/ the same version • Upgrade nodes in place • One at a time • One rack / AZ at a time (requires proper snitch)
  • 13. Disk Space not Reclaimed • When you add new nodes, data is streamed from existing nodes • … but it's not deleted from them after • You need to run a nodetool cleanup • Otherwise you'll run out of space just by adding nodes
  • 14. Using Shared Storage • Single point of failure • High latency • Expensive • Performance is about latency • Can increase throughput with more disks • In general avoid EBS, SAN, NAS
  • 15. Compaction • Compaction merges SSTables • Too much compaction? • Opscenter provides insight into compaction cluster wide • nodetool • compactionhistory • getcompactionthroughput • Leveled vs Size Tiered vs Date Tiered • Leveled on SSD + Read Heavy • Size tiered on Spinning rust • Size tiered is great for write heavy time series workloads • Date tiered is new and is showing HUGE promise
  • 17. htop • Process overview - nicer than top
  • 18. iostat • Disk stats • Queue size, wait times • Ignore %util
  • 19. vmstat • virtual memory statistics • Am I swapping? • Reports at an interval, to an optional count
  • 20. dstat • Flexible look at network, CPU, memory, disk
  • 21. strace • What is my process doing? • See all system calls • Filterable with -e • Can attach to running processes
  • 24. nodetool tpstats • What's blocked? • MemtableFlushWriter? - Slow disks! • also leads to GC issues • Dropped mutations? • need repair!
  • 25. Histograms • proxyhistograms • High level read and write times • Includes network latency • cfhistograms <keyspace> <table> • reports stats for single table on a single node • Used to identify tables with performance problems
  • 28. JVM GC Overview • What is garbage collection? • Manual vs automatic memory management • Generational garbage collection (ParNew & CMS) • New Generation • Old Generation
  • 29. New Generation • New objects are created in the new gen (eden) • Comprised of Eden & 2 survivor spaces (SurvivorRatio) • Space identified by HEAP_NEWSIZE in cassandra-env.sh • Historically limited to 800MB
  • 30. Minor GC • Occurs when Eden fills up • Stop the world • Dead objects are removed • Copy current survivor to empty survivor • Live objects are promoted into survivor (S0 & S1) then old gen • Some survivor objects promoted to old gen (MaxTenuringThreshold) • Spillover promoted to old gen • Removing objects is fast, promoting objects is slow
  • 31. Old Generation • Objects are promoted to new gen from old gen • Major GC • Mostly concurrent • 2 short stop the world pauses
  • 32. Full GC • Occurs when old gen fills up or objects can’t be promoted • Stop the world • Collects all generations • Defragments old gen • These are bad! • Massive pauses
  • 33. Workload 1: Write Heavy • Objects promoted: Memtables • New gen too big • Remember: promoting objects is slow! • Huge new gen = potentially a lot of promotion new gen old gen too much promotion
  • 34. Workload 2: Read Heavy • Short lived objects being promoted into old gen • Lots of minor GCs • Read heavy workloads on SSD • Results in frequent full GC new gen old gen (full of short lived objects) early promotion fills up quickly
  • 35. GC Profiling • Opscenter gc stats • Look for correlations between gc spikes and read/write latency • Cassandra GC Logging • Can be activated in cassandra-env.sh • jstat • prints gc activity
  • 36. How much does it matter?
  • 37. Stuff is broken, fix it!
  • 38. Narrow Down the Problem • Is it even Cassandra? Check your metrics! • Nodes flapping / failing • Check ops center • Dig into system metrics • Slow queries • Find your bottleneck • Check system stats • JVM GC • Compaction • Histograms • Tracing
  • 39. ©2013 DataStax Confidential. Do not distribute without consent. 39