SlideShare a Scribd company logo
© 2015 GridPoint, Inc. Proprietary and Confidential 1
Managing (Schema) Migrations in Cassandra
Mitch Gitman
senior software engineer
GridPoint, Inc.
© 2015 GridPoint, Inc. Proprietary and Confidential 2
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 3
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 4
10/23/2015
migration
A word with many meanings.
© 2015 GridPoint, Inc. Proprietary and Confidential 5
10/23/2015
disclaimer…
image © Ana Camamiel
© 2015 GridPoint, Inc. Proprietary and Confidential 6
What I mean by migrations
• Live-data migrations
10/23/2015
One-off as opposed to ETL
© 2015 GridPoint, Inc. Proprietary and Confidential 7
What I mean by migrations
• Source-driven migrations
− Schema migrations
− Reference data migrations
− Test/sample data migrations
• CQL commands as opposed to real data (sstables), generally
10/23/2015
source control
versioning
artifact versioning
publish
© 2015 GridPoint, Inc. Proprietary and Confidential 8
Database refactoring
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 9
• Integration test & functional test automation (bootstrap-ability)
• CI server pipelines
• Containerization??
• Consistency & repeatability across environments
− Local developer box
− Dev environments
− Integration & QA environments
− Staging
− Production
Source-driven DB refactoring—the benefits
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 10
We need tools!
• Built into web application frameworks
• Standalone
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 11
What do (perhaps) all these tools have in common?
10/23/2015
They’re relational. They’re for SQL.
© 2015 GridPoint, Inc. Proprietary and Confidential 12
NoSQL Distilled
10/23/2015
Chapter 12. Schema Migrations
"We have seen that developing and maintaining
an application in the brave new world of
schemaless databases requires careful
attention to be given to schema migration."
either/or:
• RDBMS = strong schema
• NoSQL = no schema
© 2015 GridPoint, Inc. Proprietary and Confidential 13
10/23/2015
CREATE TABLE entities (
doc_id int,
attribute_name String,
attribute_value String,
...
PRIMARY KEY(doc_id, attribute_name)
);
• partition keys & clustering keys
• table-per-query denormalization
• shift from Thrift to CQL
• Thrift: super columns & super column families
• CQL: collection types
“metadata-driven documents
in columnar storage:”
Does Cassandra like weak schemas?
So how have teams been
managing their keyspace & table
definitions?
© 2015 GridPoint, Inc. Proprietary and Confidential 14
The Cassandra migration tools landscape
10/23/2015
• Flyway: First-class Cassandra support.
− Requires JDBC.
− https://github.com/flyway/flyway/issues/823
• Pillar: Scala tool.
• mutagen-cassandra: Java tool, Astyanax driver.
• Trireme: Python tool.
• cql-migrate: Python tool.
• mschematool: Python tool.
© 2015 GridPoint, Inc. Proprietary and Confidential 15
What’s the secret behind DB migration tools?
10/23/2015
The migrations version tracking table
© 2015 GridPoint, Inc. Proprietary and Confidential 16
Migration tool philosophies
10/23/2015
© Martha Stewart Living Omnimedia Inc. © Harpo Print, LLC
© 2015 GridPoint, Inc. Proprietary and Confidential 17
Flyway for Cassandra
10/23/2015
• First-class Flyway• Faked-out Flyway
migrations
(in SQL)
CQL
© 2015 GridPoint, Inc. Proprietary and Confidential 18
The tradeoff
10/23/2015
• Store the migrations tracking table in an RDBMS
© 2015 GridPoint, Inc. Proprietary and Confidential 19
Programmatically invoke Flyway
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 20
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 21
CassandraFlywayCallback
10/23/2015
implements FlywayCallback
© 2015 GridPoint, Inc. Proprietary and Confidential 22
Two-step process
10/23/2015
source control
artifact repository
MigrationsBuilder
FlywayMigrator
© 2015 GridPoint, Inc. Proprietary and Confidential 23
The migrations source
10/23/2015
The input to
MigrationsBuilder
© 2015 GridPoint, Inc. Proprietary and Confidential 24
10/23/2015
Run MigrationsBuilder for CQL:
Run MigrationsBuilder for SQL:
© 2015 GridPoint, Inc. Proprietary and Confidential 25
The generated
migrations
10/23/2015
The output from
MigrationsBuilder
© 2015 GridPoint, Inc. Proprietary and Confidential 26
The generated SQL script
10/23/2015
Faking out Flyway
© 2015 GridPoint, Inc. Proprietary and Confidential 27
10/23/2015
Run FlywayMigrator for CQL:
Run FlywayMigrator for SQL:
java -classpath /…/flyway-migrator-postgresql.jar 
com.gridpoint.tools.migrator.flyway.FlywayMigrator postgresql
java -classpath /…/flyway-migrator-cassandra.jar 
com.gridpoint.tools.migrator.flyway.FlywayMigrator cassandra
© 2015 GridPoint, Inc. Proprietary and Confidential 28
10/23/2015
flyway-migrator-postgresql.jarflyway-migrator-cassandra.jar
© 2015 GridPoint, Inc. Proprietary and Confidential 29
The migrations version tracking table
10/23/2015
The Cassandra incarnation
© 2015 GridPoint, Inc. Proprietary and Confidential 30
Best practices
10/23/2015
• Variations on versions
− Version control: f94c7d7f8b130df360a4e9e4f586eafc618ddc50
− Artifact repository: 3.5.1
− Migration tool: 201505270800 or 10 or whatever you want
− Effective contract versions—multiple versions can coexist at runtime
• Consistent deployment across environments
• Failure handling
• Baselining
• Rollbacks?
• Check schema agreement
© 2015 GridPoint, Inc. Proprietary and Confidential 31
Schema agreement
10/23/2015
https://datastax.github.io/java-driver/2.1.8/features/metadata/
© 2015 GridPoint, Inc. Proprietary and Confidential 32
Cassandra… migrations… limitations
10/23/2015
• Limitations of our Flyway-based solution
− You need a relational database
− Not open-sourced
• Limitations of source-driven migrations, in general
© 2015 GridPoint, Inc. Proprietary and Confidential 33
Static vs. dynamic tables
10/23/2015
© 2015 GridPoint, Inc. Proprietary and Confidential 34
Deploy time vs. runtime
10/23/2015
Dedicated migration application vs. part of main application
© 2015 GridPoint, Inc. Proprietary and Confidential 35
Source-driven, but…
10/23/2015
• The orchestration is in source control
• Actual data rather than CQL commands
− Not necessarily live data
− Maybe doesn’t need to be in source control
© 2015 GridPoint, Inc. Proprietary and Confidential 36
Embracing polyglot persistence
10/23/2015
A unified migrations solution
© 2015 GridPoint, Inc. Proprietary and Confidential 37
Takeaways
10/23/2015
•challenging
•exciting
•routine
•boring
© 2015 GridPoint, Inc. Proprietary and Confidential 38
10/23/2015
Thank you!
Mitch Gitman
 mgitman@gridpoint.com
 mgitman@nilistics.net
 mgitman@gmail.com
 skeletal presence @ LinkedIn

More Related Content

What's hot (20)

PPTX
Introduction to CI/CD
Steve Mactaggart
 
PDF
MariaDB 10: The Complete Tutorial
Colin Charles
 
PPTX
SQL to Azure Migrations
Datavail
 
PDF
A microservice approach for legacy modernisation
luisw19
 
PPTX
Azure data platform overview
James Serra
 
PDF
Hypervisors
SrikantMishra12
 
PPTX
Tableau Best Practices.pptx
AnitaB33
 
PPTX
Azure IAAS architecture with High Availability for beginners and developers -...
Malleswar Reddy
 
PDF
Cloud Native Bern 05.2023 — Zero Trust Visibility
Raphaël PINSON
 
PDF
Social Media Monitoring with NiFi, Druid and Superset
Thiago Santiago
 
PPTX
WebSphere App Server vs JBoss vs WebLogic vs Tomcat (InterConnect 2016)
Roman Kharkovski
 
PPTX
Why to Cloud Native
Karthik Gaekwad
 
PDF
Microsoft Azure Overview
David J Rosenthal
 
PPSX
Microservices, DevOps & SRE
Araf Karsh Hamid
 
PPTX
Azure Storage Services - Part 01
Neeraj Kumar
 
PPTX
API Security in a Microservice Architecture
Matt McLarty
 
PDF
Kubernetes 101
Crevise Technologies
 
PDF
Oracle CodeOne 2019: Descending the Testing Pyramid: Effective Testing Strate...
Chris Richardson
 
PDF
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Databricks
 
PDF
Introduction to Azure Synapse Webinar
Peter Ward
 
Introduction to CI/CD
Steve Mactaggart
 
MariaDB 10: The Complete Tutorial
Colin Charles
 
SQL to Azure Migrations
Datavail
 
A microservice approach for legacy modernisation
luisw19
 
Azure data platform overview
James Serra
 
Hypervisors
SrikantMishra12
 
Tableau Best Practices.pptx
AnitaB33
 
Azure IAAS architecture with High Availability for beginners and developers -...
Malleswar Reddy
 
Cloud Native Bern 05.2023 — Zero Trust Visibility
Raphaël PINSON
 
Social Media Monitoring with NiFi, Druid and Superset
Thiago Santiago
 
WebSphere App Server vs JBoss vs WebLogic vs Tomcat (InterConnect 2016)
Roman Kharkovski
 
Why to Cloud Native
Karthik Gaekwad
 
Microsoft Azure Overview
David J Rosenthal
 
Microservices, DevOps & SRE
Araf Karsh Hamid
 
Azure Storage Services - Part 01
Neeraj Kumar
 
API Security in a Microservice Architecture
Matt McLarty
 
Kubernetes 101
Crevise Technologies
 
Oracle CodeOne 2019: Descending the Testing Pyramid: Effective Testing Strate...
Chris Richardson
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Databricks
 
Introduction to Azure Synapse Webinar
Peter Ward
 

Similar to Managing (Schema) Migrations in Cassandra (20)

PPTX
Flyway
Rajesh Kumar
 
PDF
Springone2gx 2015 Cassandra and Grails
Jeff Beck
 
DOCX
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PDF
Migrating to Cassandra
Instaclustr
 
PPT
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
DataStax
 
PPT
Scaling web applications with cassandra presentation
Murat Çakal
 
PDF
Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...
ScyllaDB
 
PPTX
When and how to migrate from a relational database to Cassandra
Ben Slater
 
PPTX
Using Cassandra with your Web Application
supertom
 
PDF
Введение в Apache Cassandra
Open-IT
 
PDF
Instaclustr: When and how to migrate from a relational database to Cassandra
DataStax Academy
 
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
PDF
RDBMS to NoSQL: Practical Advice from Successful Migrations
ScyllaDB
 
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
PPTX
Learning Cassandra NoSQL
Pankaj Khattar
 
PDF
Slides: Relational to NoSQL Migration
DATAVERSITY
 
PPTX
Webinar: Migrating from RDBMS to MongoDB (June 2015)
MongoDB
 
PPTX
Strata west 2012_java_cassandra
zznate
 
PDF
About "Apache Cassandra"
Jihyun Ahn
 
PPTX
Apache Cassandra introduction
fardinjamshidi
 
Flyway
Rajesh Kumar
 
Springone2gx 2015 Cassandra and Grails
Jeff Beck
 
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Migrating to Cassandra
Instaclustr
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
DataStax
 
Scaling web applications with cassandra presentation
Murat Çakal
 
Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...
ScyllaDB
 
When and how to migrate from a relational database to Cassandra
Ben Slater
 
Using Cassandra with your Web Application
supertom
 
Введение в Apache Cassandra
Open-IT
 
Instaclustr: When and how to migrate from a relational database to Cassandra
DataStax Academy
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
RDBMS to NoSQL: Practical Advice from Successful Migrations
ScyllaDB
 
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
Learning Cassandra NoSQL
Pankaj Khattar
 
Slides: Relational to NoSQL Migration
DATAVERSITY
 
Webinar: Migrating from RDBMS to MongoDB (June 2015)
MongoDB
 
Strata west 2012_java_cassandra
zznate
 
About "Apache Cassandra"
Jihyun Ahn
 
Apache Cassandra introduction
fardinjamshidi
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 
Ad

Recently uploaded (20)

PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 

Managing (Schema) Migrations in Cassandra

  • 1. © 2015 GridPoint, Inc. Proprietary and Confidential 1 Managing (Schema) Migrations in Cassandra Mitch Gitman senior software engineer GridPoint, Inc.
  • 2. © 2015 GridPoint, Inc. Proprietary and Confidential 2 10/23/2015
  • 3. © 2015 GridPoint, Inc. Proprietary and Confidential 3 10/23/2015
  • 4. © 2015 GridPoint, Inc. Proprietary and Confidential 4 10/23/2015 migration A word with many meanings.
  • 5. © 2015 GridPoint, Inc. Proprietary and Confidential 5 10/23/2015 disclaimer… image © Ana Camamiel
  • 6. © 2015 GridPoint, Inc. Proprietary and Confidential 6 What I mean by migrations • Live-data migrations 10/23/2015 One-off as opposed to ETL
  • 7. © 2015 GridPoint, Inc. Proprietary and Confidential 7 What I mean by migrations • Source-driven migrations − Schema migrations − Reference data migrations − Test/sample data migrations • CQL commands as opposed to real data (sstables), generally 10/23/2015 source control versioning artifact versioning publish
  • 8. © 2015 GridPoint, Inc. Proprietary and Confidential 8 Database refactoring 10/23/2015
  • 9. © 2015 GridPoint, Inc. Proprietary and Confidential 9 • Integration test & functional test automation (bootstrap-ability) • CI server pipelines • Containerization?? • Consistency & repeatability across environments − Local developer box − Dev environments − Integration & QA environments − Staging − Production Source-driven DB refactoring—the benefits 10/23/2015
  • 10. © 2015 GridPoint, Inc. Proprietary and Confidential 10 We need tools! • Built into web application frameworks • Standalone 10/23/2015
  • 11. © 2015 GridPoint, Inc. Proprietary and Confidential 11 What do (perhaps) all these tools have in common? 10/23/2015 They’re relational. They’re for SQL.
  • 12. © 2015 GridPoint, Inc. Proprietary and Confidential 12 NoSQL Distilled 10/23/2015 Chapter 12. Schema Migrations "We have seen that developing and maintaining an application in the brave new world of schemaless databases requires careful attention to be given to schema migration." either/or: • RDBMS = strong schema • NoSQL = no schema
  • 13. © 2015 GridPoint, Inc. Proprietary and Confidential 13 10/23/2015 CREATE TABLE entities ( doc_id int, attribute_name String, attribute_value String, ... PRIMARY KEY(doc_id, attribute_name) ); • partition keys & clustering keys • table-per-query denormalization • shift from Thrift to CQL • Thrift: super columns & super column families • CQL: collection types “metadata-driven documents in columnar storage:” Does Cassandra like weak schemas? So how have teams been managing their keyspace & table definitions?
  • 14. © 2015 GridPoint, Inc. Proprietary and Confidential 14 The Cassandra migration tools landscape 10/23/2015 • Flyway: First-class Cassandra support. − Requires JDBC. − https://github.com/flyway/flyway/issues/823 • Pillar: Scala tool. • mutagen-cassandra: Java tool, Astyanax driver. • Trireme: Python tool. • cql-migrate: Python tool. • mschematool: Python tool.
  • 15. © 2015 GridPoint, Inc. Proprietary and Confidential 15 What’s the secret behind DB migration tools? 10/23/2015 The migrations version tracking table
  • 16. © 2015 GridPoint, Inc. Proprietary and Confidential 16 Migration tool philosophies 10/23/2015 © Martha Stewart Living Omnimedia Inc. © Harpo Print, LLC
  • 17. © 2015 GridPoint, Inc. Proprietary and Confidential 17 Flyway for Cassandra 10/23/2015 • First-class Flyway• Faked-out Flyway migrations (in SQL) CQL
  • 18. © 2015 GridPoint, Inc. Proprietary and Confidential 18 The tradeoff 10/23/2015 • Store the migrations tracking table in an RDBMS
  • 19. © 2015 GridPoint, Inc. Proprietary and Confidential 19 Programmatically invoke Flyway 10/23/2015
  • 20. © 2015 GridPoint, Inc. Proprietary and Confidential 20 10/23/2015
  • 21. © 2015 GridPoint, Inc. Proprietary and Confidential 21 CassandraFlywayCallback 10/23/2015 implements FlywayCallback
  • 22. © 2015 GridPoint, Inc. Proprietary and Confidential 22 Two-step process 10/23/2015 source control artifact repository MigrationsBuilder FlywayMigrator
  • 23. © 2015 GridPoint, Inc. Proprietary and Confidential 23 The migrations source 10/23/2015 The input to MigrationsBuilder
  • 24. © 2015 GridPoint, Inc. Proprietary and Confidential 24 10/23/2015 Run MigrationsBuilder for CQL: Run MigrationsBuilder for SQL:
  • 25. © 2015 GridPoint, Inc. Proprietary and Confidential 25 The generated migrations 10/23/2015 The output from MigrationsBuilder
  • 26. © 2015 GridPoint, Inc. Proprietary and Confidential 26 The generated SQL script 10/23/2015 Faking out Flyway
  • 27. © 2015 GridPoint, Inc. Proprietary and Confidential 27 10/23/2015 Run FlywayMigrator for CQL: Run FlywayMigrator for SQL: java -classpath /…/flyway-migrator-postgresql.jar com.gridpoint.tools.migrator.flyway.FlywayMigrator postgresql java -classpath /…/flyway-migrator-cassandra.jar com.gridpoint.tools.migrator.flyway.FlywayMigrator cassandra
  • 28. © 2015 GridPoint, Inc. Proprietary and Confidential 28 10/23/2015 flyway-migrator-postgresql.jarflyway-migrator-cassandra.jar
  • 29. © 2015 GridPoint, Inc. Proprietary and Confidential 29 The migrations version tracking table 10/23/2015 The Cassandra incarnation
  • 30. © 2015 GridPoint, Inc. Proprietary and Confidential 30 Best practices 10/23/2015 • Variations on versions − Version control: f94c7d7f8b130df360a4e9e4f586eafc618ddc50 − Artifact repository: 3.5.1 − Migration tool: 201505270800 or 10 or whatever you want − Effective contract versions—multiple versions can coexist at runtime • Consistent deployment across environments • Failure handling • Baselining • Rollbacks? • Check schema agreement
  • 31. © 2015 GridPoint, Inc. Proprietary and Confidential 31 Schema agreement 10/23/2015 https://datastax.github.io/java-driver/2.1.8/features/metadata/
  • 32. © 2015 GridPoint, Inc. Proprietary and Confidential 32 Cassandra… migrations… limitations 10/23/2015 • Limitations of our Flyway-based solution − You need a relational database − Not open-sourced • Limitations of source-driven migrations, in general
  • 33. © 2015 GridPoint, Inc. Proprietary and Confidential 33 Static vs. dynamic tables 10/23/2015
  • 34. © 2015 GridPoint, Inc. Proprietary and Confidential 34 Deploy time vs. runtime 10/23/2015 Dedicated migration application vs. part of main application
  • 35. © 2015 GridPoint, Inc. Proprietary and Confidential 35 Source-driven, but… 10/23/2015 • The orchestration is in source control • Actual data rather than CQL commands − Not necessarily live data − Maybe doesn’t need to be in source control
  • 36. © 2015 GridPoint, Inc. Proprietary and Confidential 36 Embracing polyglot persistence 10/23/2015 A unified migrations solution
  • 37. © 2015 GridPoint, Inc. Proprietary and Confidential 37 Takeaways 10/23/2015 •challenging •exciting •routine •boring
  • 38. © 2015 GridPoint, Inc. Proprietary and Confidential 38 10/23/2015 Thank you! Mitch Gitman  [email protected][email protected][email protected]  skeletal presence @ LinkedIn

Editor's Notes

  • #2: We've had some sexy, exciting, cutting-edge topics today. This is not one of them. ... This is more the sort of routine, good-housekeeping, foundational work that can make the exciting stuff a little less exciting. I’m going to be talking about managing migrations in Cassandra and in particular schema migrations.
  • #3: Let me give a nod to my employer. From the web site: “ GridPoint is a leader in comprehensive, data-driven energy management solutions (EMS) that leverage the power of real-time data collection, big data analytics and cloud computing to maximize energy savings, operational efficiency, capital utilization and sustainability benefits.” The company is based in Arlington, VA, with a development office in Seattle.
  • #6: Disclaimer… This is my perspective. Oh, the statue you see is from Bonn, Germany, according to the photographer.
  • #7: A live-data migration is the process that runs to take the data in one table and adapt it to another table, such that the data in the first table can eventually be retired. I’m not going to be focusing so much on live-data migrations.
  • #8: I’m going to be focusing instead on what I would call source-driven migrations. For schema migrations, think DDL. The migrations are stored in source control and subject to source control versioning. They may be published to an artifact repository, where they artifact versioning and release versioning can be applied. I’ll be focusing in particular on schema migrations.
  • #9: These sorts of problems are covered in depth in this book from the Martin Fowler series that came out in 2006.
  • #10: I can’t speak to containerizing migrations. We haven’t explored that.
  • #11: A couple other established standalone tools are DBMaintain and DBDeploy, although those projects have not been active in recent years.
  • #13: 12.2. Schema Changes in RDBMS Liquibase, Mybatis Migrator, DBDeploy, DBMaintain 12.3. Schema Changes in a NoSQL Data Store the schema needs to change frequently in response to changing business requirements | can use similar techniques as with databases with strong schemas   with schemalessness at the database level, the burden of supporting the effective schema is shifted up to the application | the application still needs to be able to (un)marshal the data
  • #14: With this slide, I hope you can see that I’m setting up a bit of a straw man. (A straw man with a strong man.) There was a StackOverflow thread on schema migration tools for Cassandra (http://stackoverflow.com/questions/25286273/is-there-a-schema-versioning-tool-for-cassandra), and there was an erroneous answer I found amusing: "Cassandra is by its nature… 'schemaless.' It is a structured key-value store, so it is very different from a traditional rdbms in that regard.”   Think about it though. With Cassandra as much as with a relational database, you pay a bitter price for getting your schema wrong.   You end up defining a good number of tables.   I have the fortune of not having worked much with Thrift. But I know that with Thrift, you'd be in the business of manipulating the contents of messages, which obscures the database's desire to have a schema applied to it.   With Thrift, you had super columns and super column families. With CQL, you have collections. But the collections still have to be part of a table. The things that might smack of schemalessness still come back to a schema. =========================================== Thought experiment. Go into cqlsh and execute: describe keyspace keyspace_name   How big is that output getting? How much is it changing over time? =========================================== At last month's Cassandra Summit, there was an interesting talk by a company called Reltio, and they described how they were using Cassandra to support "metadata-driven documents in columnar storage." So they produced a keyspace that had a generic table like this. And maybe that schema only had one or two tables. But even they acknowledged that this is an atypical use case for Cassandra. =========================================== So how have teams been managing their keyspace and table definitions? My anecdotal experience is that whenever the question has come up, teams have usually rolled their own, especially because, on the face of it, or in the simple case, this seems like such a simple thing.  
  • #15: Next I want to get into the tools that are out there for Cassandra migrations, and the roadblocks teams have faced trying to manage Cassandra schema migrations via LiquiBase and Flyway. =========================================== Some history. The obvious way to integrate Liquibase or Flyway with Cassandra comes back to the prospect of the DataStax Java Driver supporting JDBC. There’s this statement from the 2013 announcement of the introduction of the driver (http://www.datastax.com/dev/blog/new-datastax-drivers-a-new-face-for-cassandra): "Today, DataStax announces version 1.0.0 of a new Java Driver, designed for CQL and based on years of experience within the Cassandra community. This Java driver is a first step; an object mapping and a JDBC extension will be available soon…." Let’s keep that JDBC extension in mind. =========================================== There was a liquibase-cassandra project that seemed to hit a wall. So some people gravitated toward Flyway. =========================================== Then there was a GitHub issue for the Flyway project , “Cassandra support.” https://github.com/flyway/flyway/issues/823   In January someone mentions a cassandra-jdbc project that’s out there and which also seems to have hit a wall. "I …recently looked into adding support for Cassandra to Flyway, but using the existing cassandra-jdbc driver from https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ , just to see how far I could get. I found a few issues:" Proceeds to list the issues. "I disabled or stubbed out code to get past these, but gave up soon after."   That same poster referenced a thread he started on the DataStax Java Driver user mailing list. =========================================== So if we go to that thread, which is from last December (https://groups.google.com/a/lists.datastax.com/forum/#!msg/java-driver-user/kspAx0neZlI/8A59HmYc-rwJ): Subject: "Timeline for JDBC support?"   "Is there any timeline for JDBC support in the DataStax Java Driver for Cassandra, please?"   Alex Popescu, Sen. Product Manager @ DataStax responds: "While I cannot (yet) promise an ETA for JDBC support, what I can say is that it's on our todo list (and very close to the top)." =========================================== I look forward to seeing how DataStax pulls off the Cassandra JDBC support, but to my mind, trying to do JDBC against Cassandra seems like, I dunno, a bit of an uphill climb.   So let's side aside the prospect of first-class Cassandra support in Flyway and see what else is out there. =========================================== Toward the end of the DataStax Java Driver mailing list thread, someone else chimes in and mentions Pillar, which is a dedicated Cassandra migrations tool written in Scala.   And here’s roughly what I wrote in my own internal tool evaluation: “Before settling on (our) Flyway design for Cassandra schema migrations, I evaluated various open-source Cassandra migration tools. They’re listed below. Of them, the most promising tool was Pillar, which is implemented in Scala. The problem with Pillar vs. (Flyway) was the risk. I was afraid I’d invest time with Pillar and come up emptyhanded, that it wouldn’t deliver the sort of contract I expect from Flyway.” That’s what I wrote. I’m happy we went down the road we did (if I weren’t I wouldn’t be here talking about it), but I’d still maintain that Pillar is worth checking out.   There's mutagen-cassandra, which is a Java tool written against the Astyanax driver but which hasn't been adapted to the DataStax Java Driver.   Then there are these three Python-based tools: Trireme, cql-migrate, mschematool.
  • #16: Here’s a view of a migrations table that’s responsible for several schemas in PostgreSQL, with PostgeSQL’s concept of a schema, analogous to a keyspace in Cassandra.
  • #17: So let’s get back to the two prominent database migration tools in the relational world. I think of Liquibase as the Martha Stewart of migration tools. It’s somewhat of a control freak. It wants to do everything itself. On the other hand, I think of Flyway as the Oprah of migration tools. It provides a framework and then gives you the space to figure things out for yourself. You see, Liquibase wants to generate the SQL from XML constructs. In the typical usage, the SQL is NOT a first-class citizen. You can define Liquibase migrations as SQL, but even then (to the best of my knowledge) you have to define it inline in the XML. With Flyway, though, SQL is a first-class citizen. You can make migrations ouf of straight .sql files. It’s Flyway’s lightweight, inobtrusive, extensible approach that’s going to provide the leverage for using it with Cassandra.
  • #18: So instead of first-class Flway, we’re going to do faked-out Flyway. The idea is, let Flyway do what it knows, which is migrations. Let Cassandra do what it knows, which is CQL. All we need is an adapter or translator to connect the two. And one key point. When I say that Flyway knows migrations, I’m saying that Flyway knows migrations in SQL.
  • #19: So here’s the tradeoff. Or “the weird trick,” to use the parlance of an Internet ad. Here’s what I wrote in my own internal design doc: “The reality is that first-class Flyway support for Cassandra doesn’t really gain us anything more than our fake-Flyway solution does, especially considering that we’re fine with persisting the Flyway migrations table to PostgreSQL; once you’re embracing polyglot persistence, you’d realize that a relational database is a better fit anyway for keeping track of the migrations.” 
  • #22: Failure handling: If a migration produces invalid CQL, the driver throws a RuntimeException. The act of throwing a RuntimeException is the signal I need to tell the JDBC Connection to roll back the transaction. This emulates the JDBC contract where RuntimeExceptions cause the transaction in the actual migrate call to roll back. We do this with the beforeEachMigrate hook so that we have a chance to fail the migration before our dummy, token migration has a chance to run. Flyway will have succeeded with all the migrations up to that point; it will fail only with this particular migration. That preserves the expected Flyway behavior.
  • #23: Our migrations follow a two-step process. At build time, we produce an artifact that gets published to an artifact repository. That’s the work of a proprietary class called MigrationsBuilder. At runtime, we have another custom class called FlywayMigrator that runs the published migrations against the target database. In the simple case with Flyway, there’s only a single step, the deploy-time step, even if that might be executed at build time, or to be precise, by a build tool like Maven or Gradle. It’s worth noting that we use the same two-step process, with the same classes, just the same way if the destination database is PostgreSQL.
  • #24: We have the .cql files organized into directories according to our releases.
  • #25: Here you can see that MigrationsBuilder is executed in a maven build. And you can see that the execution for CQL as opposed to SQL differs only by some arguments.
  • #26: Here we can see the output of MigrationsBuilder. MigrationsBuilder creates .sql files in a package structure that Flyway expects. But our .cql files just show up in the root of the classpath. The generated .sql files have the same simple names as the generated .cql files, and those names have been tweaked from the names in source control to comply with Flyway conventions.
  • #27: Contains the CQL script’s contents. This is the dummy, token script that the Flyway class executes with its migrate method.
  • #28: Now, at deploy time, when we go to execute FlywayMigrator against the destination database, you can see that the CQL and SQL invocations are quite similar.
  • #29: Here we see the dependencies for the standalone JAR that’s executed at deploy time. Both JARs depend on the flywayMigrator library. The Cassandra JAR has only one other dependency because it has to support only one keyspace. The PostgreSQL JAR has numerous other dependencies because it has to support multiple schemas along with some migrations and constructs that don’t fit nicely in a schema.
  • #30: Here you can see how the migrations version tracking table for Cassandra has been populated after a FlywayMigrator execution.
  • #31: Now I want to go beyond our own Cassandra migration solution and share some best practices that I’ve arrived at and that I’d recommend however you do your migrations. First, it’s worth keeping in mind the distinction between different kinds of versioning. Regarding effective contract versions, there’s a nice discussion in Chapter 12 of “NoSQL Distilled” of making two schema versions coexist in a running application. Consistent deployment across environments. You should be trying to execute your migrations the same way on a local dev box as you do in production. Or at least isolate the differences. Failure handling: This goes back to the rollback semantics I was describing in beforeEachMigrate. The Flyway contract is every migration up to the migration that failed sticks because every migration up to that failure succeeded. Baselining: If you haven’t been doing formalized database migrations from the get-go, you can use the current state of production as the starting point for your migrations by taking the “describe keyspace” CQL from cqlsh and make that be your initial migration, but only for installations that you want to create from scratch. And if you’ve made a lot of changes to your tables but your migrations haven’t made it to production yet, you can scrap all the history and start from your latest definitions. You get to call a mulligan. Declaring migration bankruptcy. Rollbacks: Something that Liquibase supports. Part of why Liquibase tries to be such a control freak. Flyway, on the other hand, purposely does not support rollbacks. When I first looked into Flyway, that to me was a downside. But I eventually came around to the Flyway way of thinking. You keep progressing forward, even if you’re semantically going backwards. A little like an event sourcing paradigm.
  • #32: The DataStax Java Driver has a nice mechanism for checking that your schema changes have propagated across the entire cluster. This snippet is taken from the DataStax Java Driver documentation.
  • #33: The graphic is showing how a source-driven migration can inevitably expand into incorporating a live-data migration as well. Maybe you’re changing a column or moving from one table to another, and in the process, you need to copy over the data. This isn’t so much a limitation. In a way, it’s a strength. Because we’re doing everything programmatically, there’s nothing stopping us from coupling a live-data migration with a source-driven migration. It’s just an extra amount of complexity to account for.
  • #34: Now here is an actual limitation. The two tables you see represent the same data, but with one having the data clustered in ascending order and the other with the data clustered in descending order. We need to have a time bucket to keep the partitions from growing indefinitely. In the ascending table, we’re able to incorporate the bucket into the partition key. But with the descending table, we want to be able to drop the tables entirely after a certain amount of time. So with those tables, we make the effective bucket part of the table name. The ascending table, where the bucket is part of the partition key—that we’re able to create statically in the migrations. But the descending table we have to create dynamically on the fly in the application. So it falls outside the realm of the migrations. I’m sure there’s a better solution out there; we’re living with this solution for now.
  • #35: Some other considerations… Making it part of the main app is what I believe a lot of teams do.
  • #36: Other use case where you want to migrate not CQL but actual sstables. At this point you might consider storing the data in a filesystem like S3 or even a separate Cassandra cluster.
  • #37: I mentioned Chapter 12 of “NoSQL Distilled,” “Schema Migrations.” Well, Chapter 13 is “Polyglot Persistence.” And the authors proceed to state the obvious, that different databases solve different problems. Relational databases excel at enforcing the existence of relationships. Not good at discovering relationships or pulling data from different tables into a single object. (Of course, these days some folks will say relational databases aren’t good enough at anything to justify their existence, but even then, that doesn’t necessarily mean that Cassandra is the best fit for everything either.) 13.5. Choosing the Right Technology "Initially, the pendulum had shifted from specialty databases to a single RDBMS database which allows all types of data models to be stored, although with some abstraction. The trend is now shifting back to using the data storage that supports the implementation of solutions natively." "Encapsulating data access into services reduces the impact of data storage choices on other parts of a system.“ Our Flyway-based solution has the promise to be a unified migrations solution for disparate persistence stores. What you see here is the view in PostgreSQL’s pgAdmin3 GUI of our dedicated flyway schema. There are two tables, one for the Cassandra migration versions, the other for the PostgreSQL migration versions. The name of that one is flyway_schema_version; it should really be called postgresql_schema_version. Not that I want to be encouraging persistence store proliferation, but you could see how we could create another table for another RDBMS vendor or for another entirely different type of persistence store.
  • #38: I hope by now you can appreciate that I’m not trying to sell you on our particular solution. I am trying to sell you on the value of source-driven schema migrations for Cassandra, and more broadly on the value of adding automation in building blocks at the right granularity. I’d initially figured this talk would be a better fit for the beginners’ track. It’s not one of the more challenging and exciting things you’ll be doing with Cassandra, but it’s doing the routine, boring things like this which I believe will eventually pay off for you and your work with Cassandra.