Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Practices

Migrating SQL Schemas
for ScyllaDB:
Data Modeling Best Practices
Pascal Desmarets
Founder & CEO

Pascal Desmarets
■ Married, father of 2 boys in business school
■ Passionate about data, technology, and doing things right
■ Avid sailboat racer, preferably offshore
Founder & CEO
YOUR PHOTO
GOES HERE

Why is Data Modeling a key success factor?

Data Modeling is a Key Success Factor
Data models and schemas are perhaps the
most important part of developing software,
because they have such a profound effect:
■ not only on how the software is written,
■ but also on how we think about the
problem that we are solving.
Martin Kleppmann,
Designing Data-Intensive Applications

The ideal ScyllaDB application has the following characteristics
■ Writes exceed reads by a large margin
■ Data is rarely updated and when updates are made, they are idempotent (the
result of a successful performed operation is independent of the number of
times it is executed)
■ Read Access is by a known primary key
■ Data can be partitioned via a key that allows the database to be spread evenly
across multiple nodes
■ There is no need for joins or aggregates

Excellent ScyllaDB Use Cases
■ Transaction logging: purchases, test scores, movies watched and movie latest location
■ Recommendation and personalization engines
■ Fraud detection
■ Tracking pretty much anything including order status, packages, etc
■ Storing time series data (as long as you do your own aggregates)
• Health tracker data
• Weather service history
• Internet of things status and event history
• Sensor data in general
■ Messaging systems: chats, collaboration, and instant messaging apps, etc

It may be misleading that…
■ ScyllaDB tables look like RDBMS tables
■ CQL looks like SQL

Denormalization is expected
Writes are (almost) free
No DB-level joins
No referential integrity
Indexing useful in specific
circumstances
Differences
between
ScyllaDB
and
relational
databases

Mindshift from application-agnostic to
application-speciﬁc modeling
Data Data Model Application
Application
Design
Access
patterns
& Queries
Data Model Data
Relational
NoSQL

ScyllaDB Data Model Principles (1 of 3)
■ Keyspace: container for tables in a Cassandra data model
■ Table: container for an ordered collection of rows
■ Rows: made of a primary key plus an ordered set of columns, themselves
made of name/value pairs.
■ No need to store a value for every column each time a new row is stored.

■ Primary key: a composite made of a partition key plus an optional set of
clustering columns.
• Partition key: is responsible for data distribution across the nodes. It determines which node
will store a given row. It can be one or more columns.
• Clustering columns: is responsible for sorting the rows within the partition. It can be zero or
more columns.

■ Data type: deﬁned to constrain the values stored in a column. Data types include character and
numeric types, collections, and user-deﬁned types. A column also has other attributes:
timestamps and time-to-live.
■ Secondary index: an index on any columns that is not part of the primary key. Secondary indexes
are not recommended on columns with high cardinality or very low cardinality, or on columns that
a frequently updated or deleted.
■ Joins: cannot be performed at the database level. If there is need for a join, either it must be
performed at the application level, or preferably, the data model should be adapted to create a
denormalized table that represents the join results.

Data modeling for ScyllaDB is a
balancing act
■ Two primary rules of data modeling in ScyllaDB:
• each partition should have roughly same amount of data
• read operations should access minimum partitions, ideally only one
■ The two data modeling principles often conﬂict, therefore you have to ﬁnd a
balance between the two based on domain understanding and business needs
■ Anticipate growth: a data model that may make sense with a particular
transaction volume, may not longer make sense when multiplied 100x or 1000x

5 steps to a data model
■ Step 1: Build the application workﬂow
■ Step 2: Model the queries required by the application
■ Step 3: Create the tables
■ Step 4: Get the primary key right
■ Step 5: Use data types effectively
■ Example derived from
https://care-pet.docs.scylladb.com/master/design_and_data_model.html

Step 1: Build the application workﬂow

Step 2a: Model the queries required by the application

Step 2b: identify attributes for each entity

Step 3: Create the tables
■ In ScyllaDB, tables can be grouped into two distinct categories:
• Tables with single-row partitions:
• tables for which the primary key is also the partition keys
• used to store entities and are usually normalized.
• should be named based on the entity for clarity (i.e., pet or owner).
• Tables with multi-row partitions:
• tables with primary keys composed of partition and clustering keys
• used to store relationships and related entities (Remember: ScyllaDB doesn’t support joins,
so developers need to structure tables to support queries that relate to multiple data items
• give tables meaningful names so that people examining the schema can understand the
purpose of different tables (i.e., sensor, measurement, etc.).

Step 4: Get the primary key right
■ The primary key is made up of
• a partition key. For most applications, this should be a unique key (UUID or custom)
• followed by one or more optional clustering columns that control how rows are laid out in a
ScyllaDB partition
■ Getting the primary key right for each table is one of the most crucial aspects
of designing a good data model
■ Remember the two primary rules of data modeling in Cassandra:
• each partition should have roughly same amount of data
• read operations should access minimum partitions, ideally only one

Step 5: Use data types effectively
■ String: ascii, text, varchar, inet
■ Numeric: int, bigint, smallint, tinyint, varint,
counter, decimal, double, ﬂoat
■ UUIDs: uuid, timeuuid
■ Miscellaneous: Boolean, blob
■ Date/time: timestamp, date, time, duration
■ Geospatial
■ Collections: list, map, set, tuple, nested
■ User-Deﬁned Types (UDT)

Collections
■ List: ordered collection of one or more elements
■ Set: unordered collection of one or more unique elements
■ Map: collection of arbitrary key-value pairs
■ Tuple: holds fixed-length sets of typed positional fields
■ Frozen: serialization of multiple components into a single value – updates to
individual fields is not possible – treated as a blob so as to be able to nest
collections
■ User-Defined Type: re-usable set of multiple fields of related information,
e.g. an address

A single table per query
Use denormalization to avoid
joins
Ensure that the choice of
primary key guarantees
uniqueness
Break up large partitions in
buckets
Best
Practices

Migrating relational database structures to ScyllaDB
RDBMS ScyllaDB

Beneﬁts of data modeling
■ While traditional data modeling may be perceived to get in
the way of development and take too much time…
■ Next-gen data modeling tools such as Hackolade are
recognized to:
• facilitate Agile development
• reduce development time
• increase application quality
• implement consistent deﬁnitions of data
• improve data quality
• enable better data governance and compliance
• facilitate documentation and communication
To leverage the dynamic schema of ScyllaDB, data
modeling turns out to be even more important than
with relational databases

Thank you!
Stay in touch
Pascal Desmarets
@Hackolade
pascal.desmarets@hackolade.com

Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Practices

More Related Content

What's hot (20)

Similar to Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Practices (20)

More from ScyllaDB (20)

Recently uploaded (20)

Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Practices