Apache Cassandra & Data Modeling

Apache Cassandra
and data modeling
© 2017 DataStax, All Rights Reserved. Company Confidential

About Me
Massimiliano Tomassi
Software Engineer at DataStax
max.tomassi@datastax.com
@max_tomassi
2

CONTEXTUAL
Characteristics of cloud applications
ALWAYS-ON DISTRIBUTED SCALABLEREAL-TIME

DataStax provides data management
for cloud applications.

From validation to momentum.
400+
Employees
$190M
Funding
500+
Customers
Founded in April 2010
Santa Clara • San Francisco • Austin •
London • Paris • Berlin • Tokyo • Sydney
(Series E – Sept. 2014) 30% +
2016 World’s Best
100 Cloud Companies
Ranked #1 in multiple operational
database categories

Products: DataStax Enterprise (DSE)
6

Distributed architecture
9
Masterless architecture
(easier to scale)

10
CREATE TABLE myapp.measurements (
sensor_id uuid,
time timestamp,
value double
PRIMARY KEY ((sensor_id), time)
);
PARTITION
KEY
CLUSTERNING
KEY

11
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Cassandra Query Language (CQL)

12
(100, '2017-01-30 11:44:42', 980.50);
Coordinator

13
(100, '2017-01-30 11:44:42', 980.50);
Coordinator

14
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Hash function
59 token

15
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Hash function
59 token

16
Coordinator

19
Replication
RF = 3
Each token range is replicated
into RF numbers of nodes.

Replication
22
RF = 3
Consistency?
CL = ONE
CL = QUORUM
CL = ALL

Geographically distributed
24
New York London

Separate different workloads
25
OLTP Analytics

Hybrid on premise/cloud deployment
26

Automatic replication across datacenters
27
Data Center 1
Data Center 2

Automatic replication across datacenters
28
Data Center 1
Data Center 2

https://www.meetup.com/Rome-Cassandra-Users/
32

The relational way
34
Authors Books
1
*
CREATE TABLE books (
book_id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)

The relational way
SELECT *
FROM books
WHERE id = 9876;
35
QUERY1: Find a book by its id
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
)

The relational way
36
SELECT id, name, release_date
FROM books
WHERE author_id = 12345
ORDER BY release_date DESC;
QUERY2: Find all books for an author sorted by release date descending
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
)

The Cassandra way
37
• No foreign keys
• No JOINS
• No filtering allowed on non primary key columns (could use secondary indexes but…)
So what?
• Denormalize!
• Write your data the way you want to read it, even if it means duplication
• Data should be read from a SINGLE node and in a SEQUENTIAL way (choose your
partition key and clustering key wisely)

The Cassandra way
38
QUERY 1: Find a book by its id
book_id uuid,
release_date timestamp,
name text,
PRIMARY KEY (book_id)
);
SELECT *
FROM books
WHERE book_id = 9876;

The Cassandra way
39
QUERY 2: Find all books for an author sorted by release date descending
CREATE TABLE books_by_author (
author_id uuid,
release_date timestamp,
book_id uuid,
name text
PRIMARY KEY ((author_id), release_date, book_id)
)
WITH CLUSTERING ORDER BY (release_date DESC);
SELECT book_id, name, release_date
FROM books_by_author
WHERE author_id = 12345;

40
sensor_id uuid,
time timestamp,
value double
PRIMARY KEY ((sensor_id), time)
);
PARTITION
KEY
CLUSTERNING
KEY
How’s data stored?

41
sensor_id 100
timestamp value
2017-01-30 11:00:00 75.90
2017-01-30 11:30:00 112.50
2017-01-30 12:00:00 45
2017-01-30 12:30:00 92.30
2017-01-30 13:00:00 67.15
2017-01-30 13:30:00 32.20
SINGLE LARGE PARTITION

Large partitions causing issues before C* 3.6
• Slow reads
• Compaction issues
• Repair issues
42
C* 3.6 mitigated this issue
• But still be aware of that and always TEST your data model against your workload

43
sensor_id uuid,
time_bucket timestamp,
time timestamp,
value double
PRIMARY KEY ((sensor_id, time_bucket), time)
);
PARTITION
KEY
CLUSTERNING
KEY
Possible alternative

sensor_id 100
timestamp value
time_bucket 2017-01-30 12:00:00
44
sensor_id 100
timestamp value
2017-01-30 11:00:00 75.90
2017-01-30 11:30:00 112.50
2017-01-30 12:00:00 45
2017-01-30 12:30:00 92.30
time_bucket 2017-01-30 11:00:00
MULTIPLE PARTITIONS

Caveat: choose your time_bucket wisely!
• Different partitions will be stored into different nodes
• Use a time_bucket that can satisfies the way you want to read your data
• (e.g. don’t use an hourly time bucket if you want to extract data by day)
45

A real(ish) example!
KillrVideo
killrvideo.github.io
46

Find video by id
CREATE TABLE videos (
videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type int,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
);
48

Get number of views by video id
First attempt
CREATE TABLE video_playback_stats (
videoid uuid,
views int,
);
SELECT views
FROM video_playback_stats where videoid = 12345;
UPDATE video_playback_stats SET views = 101
WHERE videoid = 12345;
50
100

Problem: concurrent access
51
User 1: Read(views: 100) Write(views: 101)
User 2: Read(views: 100) Write(views: 101)
We lost an update

Better solution
CREATE TABLE video_playback_stats (
videoid uuid,
views counter,
);
UPDATE video_playback_stats
SET views = views + 1 WHERE videoid = 12345;
52

Counter type
53
• Special column used to store a number that is changed in
increments
• All non-counter columns in the table must be defined as part of the
primary key
• NON IDEMPOTENT OPERATIONS! Might lead to a non 100%
accurate count

Get average ratings by video id
CREATE TABLE video_ratings (
videoid uuid,
rating_counter counter,
rating_total counter,
);
UPDATE video_ratings
SET rating_counter = rating_counter + 1,
rating_total = rating_total + 5
55
• Cassandra doesn’t provide aggregation operations
(SUM/MIN/MAX)

Find latest comments by video id
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY ((videoid), commentid)
)
WITH CLUSTERING ORDER BY (commentid DESC);
57
Where do we get this from?

Client side JOINs
58
videoid K
commentid C ↓
userid
comment
comments_by_video
userid K
(… details …)
users
SELECT *
FROM comments_by_video
For each comment in result:
SELECT *
FROM users
WHERE userid = <comment.userid>
(Chebotko diagrams)

Find latest videos order by added_date desc
CREATE TABLE latest_videos (
yyyymmdd text,
videoid uuid,
userid uuid,
name text,
PRIMARY KEY ((yyyymmdd), added_date, videoid)
)
WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);
60
• How can be the data partitioned?

Mind the hotspots!
• All the videos added the same day will be stored on the same node.
This means that all the writes into that table will go to the same node
for 24 hours.
• Can be mitigated by splitting the row using an arbitrary group
number, making the partition key (yyyymmdd, group_number)
61

Find videos by tag
CREATE TABLE videos_by_tag (
tag text,
videoid uuid,
name text,
tagged_date timestamp,
userid uuid,
PRIMARY KEY ((tag), videoid)
);
63

Attention: unbounded duplication
• If a user adds 50 tags to a video, the same data is duplicated 50
times for a single video.
• The duplication factor is not under our control: potential risk for data
growing very quickly
• Consider limiting the number of tags the user can use on the
application side
64

First attempt
66
CREATE TABLE IF NOT EXISTS users (
userid uuid,
firstname text,
lastname text,
email text,
password, text
created_date timestamp,
PRIMARY KEY (userid)
);
We have to support lookup by email

Better solution
67
CREATE TABLE user_credentials (
email text,
password text,
userid uuid,
PRIMARY KEY (email)
);
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);

Mind the overwrites!
68
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456);
VALUES (’max.tomassi@datastax.com’, ‘yyy’, 98765);
email password userid
max.tomassi@datastax.com xxx 123456
email password userid
max.tomassi@datastax.com yyy 98765

Lightweight Transactions
69
VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456)
IF NOT EXISTS

Find videos by user order by added_date desc
CREATE TABLE user_videos (
userid uuid,
videoid uuid,
name text,
PRIMARY KEY ((userid), added_date, videoid))
WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);
71

Data duplication
72
• Tables videos, user_videos, latest_videos,
videos_by_tag all store similar information about videos
• The same data is duplicated 4 times: disk space consumed more
quickly
Alternative
• Store the video information in the videos table only
• The other 3 tables only store the video_id, the data is joined on the
client side
• We save disk space but we lose performance

Data consistency
73
• Every time a new video is added we need to insert a new record into
4 tables. Those records must be kept in sync
• We can use batches for this, Cassandra will ensure that all the
statements will succeed
BEGIN BATCH
INSERT INTO videos (…) VALUES (…);
INSERT INTO user_videos (…) VALUES (…);
INSERT INTO latest_videos (…) VALUES (…);
INSERT INTO videos_by_tag (…) VALUES (…);
APPLY BATCH;

Find latest comments by user
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY ((userid), commentid))
WITH CLUSTERING ORDER BY (commentid DESC);
75

Thank you!
For more information and training
www.datastax.com
academy.datastax.com
77

Apache Cassandra & Data Modeling

More Related Content

What's hot (20)

Similar to Apache Cassandra & Data Modeling (20)

Recently uploaded (20)

Apache Cassandra & Data Modeling