Cassandra at Glogster
Roman Komkov – roman@glogster.com
System Engineer at Glogster
Prague Cassandra Meet up
03.09.2015
About me
  2 years at Glogster EDU as System Engineer
  5+ years of Linux administration
  5+ years of Python development
  Cluster, HA, Orchestration
  CI, CD…
  Twitter - @alkoengineering
GitHub, Freenode - decayofmind
About Glogster EDU
  Started in 2009
  Platform for presentation and interactive learning mainly
used by educators and students
  19 million users
  Over 45 million glogs
  40000 new glogs daily
  Web service, mobile applications
  http://edu.glogster.com
Cassandra at Glogster
  From 2011 as primary DB for initial Glogster.com
  From 2012 as backend (storage) DB for Glogster EDU
  Started from 0.6… or 0.8, I guess…
  10 nodes
  RF=5, QUORUM
  SATA disks
OrderPreservingPartitioner ¯_(ツ)_/¯
Architecture
Cassandra now
  5 nodes cluster
  ~600Gb average node size
  RF=5, QUORUM
  SSD disks
VNodes
OrderPreservingPartitioner…
pycassa + datastax-driver
Cassandra at Glogster
0.8 problems
  Migration with downtime by transferring a copy of data
HintedHandoff hell
  No repairs, no cleanups
  Enormous HeapSize (20GB)
  Different time on servers
SOLUTION!
  Upgrade to 1.0
1.1 problems
  Cassandra guy left Glogster
  Don’t touch it while it works
BUT…
  Load averages like 14.0-16.0
  2 disks failed
  Everything is slow
  Repairs? Never heard!
1.1 solutions
  Replace disks, rebuild nodes.
  Don’t try to run repair on new node instead of ReplaceToken
  Move old Glogster.com keyspace to another cluster
  Load gone
https://glogster.github.io/posts/2015/03/23/cassandra-
migration.html
  Nodes are fast again
  Regular repairs and cleanups? Never did!
OpsCenter installed
  Cluster upgraded to 1.2
1.2 and migration
  Cluster migrated to the new servers without
downtime
http://www.planetcassandra.org/blog/cassandra-migration-
to-ec2/
Vnodes
…
Cassandra at Glogster
  Old datacenter, connected to production was disconnected
from new datacenter
  Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)
  Forgot to run repair on cluster after
  Old DC was decommissioned
  Application switched the new one
  …
DATA GONE
Here the hell begins
  ~ 1200 glogs remain on old decommissioned datacenter
  Thanks God, we have RF=<N of nodes>
  Transfer data from one old node to the new server
  Run Cassandra on it, add node to the cluster
  Run repair on entire cluster
  Increase repair chance with read_repair_chance
  Peacefully wait until done…
  Do your complicated repairs through OpsCenter, cause it can
continue if failed.
Full repair?
Cassandra at Glogster
10 DAYS!!!
Conclusions and Improvements
  Increase max_hint_window_in_ms value to something like 3
days
  Make use of parallel things
  CQL3 and datastax-driver
  Upgrade to Cassandra 2.2
  faster repairs and other operations
  New OpsCenter
  Schedule regular backups and repairs
  We still love Cassandra!
Questions?

More Related Content

PDF
An Introduction to Priam
PPTX
Scylla Summit 2022: Making Schema Changes Safe with Raft
PPTX
How Scylla Make Adding and Removing Nodes Faster and Safer
PDF
Anatomy of an action
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
PPTX
Writing Applications for Scylla
PDF
Future Science on Future OpenStack
PDF
Galaxy on the GenomeCloud (Galaxy Community Conference 2014)
An Introduction to Priam
Scylla Summit 2022: Making Schema Changes Safe with Raft
How Scylla Make Adding and Removing Nodes Faster and Safer
Anatomy of an action
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Writing Applications for Scylla
Future Science on Future OpenStack
Galaxy on the GenomeCloud (Galaxy Community Conference 2014)

What's hot (19)

PDF
Moving from CellsV1 to CellsV2 at CERN
PPTX
Stabilising the jenga tower
PDF
Processing Big Data in Real-Time - Yanai Franchi, Tikal
PDF
Containers on Baremetal and Preemptible VMs at CERN and SKA
PPTX
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
PDF
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
PPTX
Mario on spark
PDF
Gnocchi v3 brownbag
PPT
Cassandra at talkbits
PDF
Cern Cloud Architecture - February, 2016
PPTX
Apache Incubator Samza: Stream Processing at LinkedIn
PPTX
Lightweight Transactions at Lightning Speed
PPTX
20170926 cern cloud v4
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
PDF
Multimaster
PDF
Openstack Infrastructure Containerization
ODP
Clock
PDF
Chronix as Long-Term Storage for Prometheus
Moving from CellsV1 to CellsV2 at CERN
Stabilising the jenga tower
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Containers on Baremetal and Preemptible VMs at CERN and SKA
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Mario on spark
Gnocchi v3 brownbag
Cassandra at talkbits
Cern Cloud Architecture - February, 2016
Apache Incubator Samza: Stream Processing at LinkedIn
Lightweight Transactions at Lightning Speed
20170926 cern cloud v4
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Multimaster
Openstack Infrastructure Containerization
Clock
Chronix as Long-Term Storage for Prometheus
Ad

Viewers also liked (10)

PDF
SC=DCPP Final Report=JA= 29-7-2012
PDF
Equity Risk Premium in an Emerging Market Economy
PDF
Aspergillus terrreus
DOC
ALI IRTIZA 13-01-2016(1)
PDF
The Light House.PDF
PDF
iPECS Solution Poster No.1
PDF
Empower Your Business Communications with iPECS-CM
PDF
Design Your Business Communications with iPECS UCP
PPTX
Tarea del seminario 3
SC=DCPP Final Report=JA= 29-7-2012
Equity Risk Premium in an Emerging Market Economy
Aspergillus terrreus
ALI IRTIZA 13-01-2016(1)
The Light House.PDF
iPECS Solution Poster No.1
Empower Your Business Communications with iPECS-CM
Design Your Business Communications with iPECS UCP
Tarea del seminario 3
Ad

Recently uploaded (20)

PPTX
Chapter_05_System Modeling for software engineering
PDF
CapCut PRO for PC Crack New Download (Fully Activated 2025)
PDF
AI-Powered Fuzz Testing: The Future of QA
PPTX
Human Computer Interaction lecture Chapter 2.pptx
PDF
SOFTWARE ENGINEERING Software Engineering (3rd Edition) by K.K. Aggarwal & Yo...
PDF
Mobile App Backend Development with WordPress REST API: The Complete eBook
PDF
Engineering Document Management System (EDMS)
PPTX
ROI from Efficient Content & Campaign Management in the Digital Media Industry
PPTX
ESDS_SAP Application Cloud Offerings.pptx
PPTX
Folder Lock 10.1.9 Crack With Serial Key
PPTX
A Spider Diagram, also known as a Radial Diagram or Mind Map.
PDF
Ragic Data Security Overview: Certifications, Compliance, and Network Safegua...
PDF
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
PPTX
Human-Computer Interaction for Lecture 2
PPTX
HackYourBrain__UtrechtJUG__11092025.pptx
PPTX
Foundations of Marketo Engage: Nurturing
PDF
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
PPTX
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
PPTX
Lesson-3-Operation-System-Support.pptx-I
PDF
Top 10 Project Management Software for Small Teams in 2025.pdf
Chapter_05_System Modeling for software engineering
CapCut PRO for PC Crack New Download (Fully Activated 2025)
AI-Powered Fuzz Testing: The Future of QA
Human Computer Interaction lecture Chapter 2.pptx
SOFTWARE ENGINEERING Software Engineering (3rd Edition) by K.K. Aggarwal & Yo...
Mobile App Backend Development with WordPress REST API: The Complete eBook
Engineering Document Management System (EDMS)
ROI from Efficient Content & Campaign Management in the Digital Media Industry
ESDS_SAP Application Cloud Offerings.pptx
Folder Lock 10.1.9 Crack With Serial Key
A Spider Diagram, also known as a Radial Diagram or Mind Map.
Ragic Data Security Overview: Certifications, Compliance, and Network Safegua...
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
Human-Computer Interaction for Lecture 2
HackYourBrain__UtrechtJUG__11092025.pptx
Foundations of Marketo Engage: Nurturing
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
Lesson-3-Operation-System-Support.pptx-I
Top 10 Project Management Software for Small Teams in 2025.pdf

Cassandra at Glogster

  • 1. Cassandra at Glogster Roman Komkov – [email protected] System Engineer at Glogster Prague Cassandra Meet up 03.09.2015
  • 2. About me   2 years at Glogster EDU as System Engineer   5+ years of Linux administration   5+ years of Python development   Cluster, HA, Orchestration   CI, CD…   Twitter - @alkoengineering GitHub, Freenode - decayofmind
  • 3. About Glogster EDU   Started in 2009   Platform for presentation and interactive learning mainly used by educators and students   19 million users   Over 45 million glogs   40000 new glogs daily   Web service, mobile applications   http://edu.glogster.com
  • 4. Cassandra at Glogster   From 2011 as primary DB for initial Glogster.com   From 2012 as backend (storage) DB for Glogster EDU   Started from 0.6… or 0.8, I guess…   10 nodes   RF=5, QUORUM   SATA disks OrderPreservingPartitioner ¯_(ツ)_/¯
  • 6. Cassandra now   5 nodes cluster   ~600Gb average node size   RF=5, QUORUM   SSD disks VNodes OrderPreservingPartitioner… pycassa + datastax-driver
  • 8. 0.8 problems   Migration with downtime by transferring a copy of data HintedHandoff hell   No repairs, no cleanups   Enormous HeapSize (20GB)   Different time on servers SOLUTION!   Upgrade to 1.0
  • 9. 1.1 problems   Cassandra guy left Glogster   Don’t touch it while it works BUT…   Load averages like 14.0-16.0   2 disks failed   Everything is slow   Repairs? Never heard!
  • 10. 1.1 solutions   Replace disks, rebuild nodes.   Don’t try to run repair on new node instead of ReplaceToken   Move old Glogster.com keyspace to another cluster   Load gone https://glogster.github.io/posts/2015/03/23/cassandra- migration.html   Nodes are fast again   Regular repairs and cleanups? Never did! OpsCenter installed   Cluster upgraded to 1.2
  • 11. 1.2 and migration   Cluster migrated to the new servers without downtime http://www.planetcassandra.org/blog/cassandra-migration- to-ec2/ Vnodes …
  • 13.   Old datacenter, connected to production was disconnected from new datacenter   Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)   Forgot to run repair on cluster after   Old DC was decommissioned   Application switched the new one   … DATA GONE
  • 14. Here the hell begins   ~ 1200 glogs remain on old decommissioned datacenter   Thanks God, we have RF=<N of nodes>   Transfer data from one old node to the new server   Run Cassandra on it, add node to the cluster   Run repair on entire cluster   Increase repair chance with read_repair_chance   Peacefully wait until done…   Do your complicated repairs through OpsCenter, cause it can continue if failed.
  • 18. Conclusions and Improvements   Increase max_hint_window_in_ms value to something like 3 days   Make use of parallel things   CQL3 and datastax-driver   Upgrade to Cassandra 2.2   faster repairs and other operations   New OpsCenter   Schedule regular backups and repairs   We still love Cassandra!