Apache  Cassandra  
and  data  modeling
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
About  Me
Massimiliano  Tomassi
Software  Engineer  at  DataStax
max.tomassi@datastax.com
@max_tomassi
2
CONTEXTUAL
Characteristics  of  cloud  applications
ALWAYS-­ON DISTRIBUTED SCALABLEREAL-­TIME
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
DataStax  provides  data  management  
for  cloud  applications.
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
From  validation  to  momentum.
400+
Employees
$190M
Funding
500+
Customers
Founded  in  April  2010
Santa  Clara  •  San  Francisco  •  Austin  •  
London  •  Paris  •  Berlin  •  Tokyo  •  Sydney
(Series  E  – Sept.  2014) 30%    +
2016  World’s  Best
100  Cloud  Companies  
Ranked  #1  in  multiple  operational
database  categories
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
Products:  DataStax Enterprise  (DSE)  
6
Why  Apache  Cassandra?
7
Distributed
8
Distributed  architecture
9
Masterless architecture
(easier  to  scale)
10
CREATE TABLE myapp.measurements (
sensor_id uuid,
time timestamp,
value double
PRIMARY KEY ((sensor_id), time)
);
PARTITION  
KEY
CLUSTERNING  
KEY
Distributed  architecture
11
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Cassandra  Query  Language  (CQL)
Distributed  architecture
12
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Distributed  architecture
13
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Distributed  architecture
14
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Hash  function
59 token
Distributed  architecture
15
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Hash  function
59 token
Distributed  architecture
16
Coordinator
Replicated
17
Replication
18
RF  =  1
19
Replication
RF  =  3
Each  token  range  is  replicated  
into  RF numbers  of  nodes.
Replication
20
RF  =  3
Replication
21
RF  =  3
Replication
22
RF  =  3
Consistency?
CL  =  ONE
CL  =  QUORUM
CL  =  ALL
Multi  data  center
23
Geographically  distributed
24
New  York London
Separate  different  workloads
25
OLTP Analytics
Hybrid  on  premise/cloud  deployment  
26
Automatic  replication  across  datacenters
27
Data  Center  1
Data  Center  2
Automatic  replication  across  datacenters
28
Data  Center  1
Data  Center  2
Use  Cases
29
Use  Cases
30
31
https://www.meetup.com/Rome-­Cassandra-­Users/
32
Data  Modeling
33
The  relational  way
34
Authors Books
1
*
CREATE TABLE books (
book_id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)
The  relational  way
SELECT *
FROM books
WHERE id = 9876;
35
QUERY1:  Find  a  book  by  its  id
CREATE TABLE books (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)
The  relational  way
36
SELECT id, name, release_date
FROM books
WHERE author_id = 12345
ORDER BY release_date DESC;
QUERY2:  Find  all  books  for  an  author  sorted  by  release  date  descending  
CREATE TABLE books (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)
The  Cassandra  way
37
• No  foreign  keys
• No  JOINS
• No  filtering  allowed  on  non  primary  key  columns  (could  use  secondary  indexes  but…)
So  what?
• Denormalize!
• Write  your  data  the  way  you  want  to  read  it,  even  if  it  means  duplication
• Data  should  be  read  from  a  SINGLE node  and  in  a  SEQUENTIAL way  (choose  your  
partition  key  and  clustering  key  wisely)
The  Cassandra  way
38
QUERY  1:  Find  a  book  by  its  id
CREATE TABLE books (
book_id uuid,
release_date timestamp,
name text,
PRIMARY KEY (book_id)
);
SELECT *
FROM books
WHERE book_id = 9876;
The  Cassandra  way
39
QUERY  2:  Find  all  books  for  an  author  sorted  by  release  date  descending
CREATE TABLE books_by_author (
author_id uuid,
release_date timestamp,
book_id uuid,
name text
PRIMARY KEY ((author_id), release_date, book_id)
)
WITH CLUSTERING ORDER BY (release_date DESC);
SELECT book_id, name, release_date
FROM books_by_author
WHERE author_id = 12345;
40
CREATE TABLE myapp.measurements (
sensor_id uuid,
time timestamp,
value double
PRIMARY KEY ((sensor_id), time)
);
PARTITION  
KEY
CLUSTERNING  
KEY
How’s  data  stored?
41
sensor_id 100
timestamp value
2017-01-30 11:00:00 75.90
2017-01-30 11:30:00 112.50
2017-01-30 12:00:00 45
2017-01-30 12:30:00 92.30
2017-01-30 13:00:00 67.15
2017-01-30 13:30:00 32.20
SINGLE  LARGE  PARTITION
Large  partitions  causing  issues  before  C*  3.6
• Slow  reads
• Compaction  issues
• Repair  issues
42
C*  3.6  mitigated  this  issue
• But  still  be  aware  of  that  and  always  TEST  your  data  model  against  your  workload
43
CREATE TABLE myapp.measurements (
sensor_id uuid,
time_bucket timestamp,
time timestamp,
value double
PRIMARY KEY ((sensor_id, time_bucket), time)
);
PARTITION  
KEY
CLUSTERNING  
KEY
Possible  alternative
sensor_id 100
timestamp value
time_bucket 2017-01-30 12:00:00
44
sensor_id 100
timestamp value
2017-01-30 11:00:00 75.90
2017-01-30 11:30:00 112.50
2017-01-30 12:00:00 45
2017-01-30 12:30:00 92.30
time_bucket 2017-01-30 11:00:00
MULTIPLE  PARTITIONS
Caveat:  choose  your  time_bucket wisely!
• Different  partitions  will  be  stored  into  different  nodes
• Use  a  time_bucket that  can  satisfies  the  way  you  want  to  read  your  data  
• (e.g.  don’t  use  an  hourly  time  bucket  if  you  want  to  extract  data  by  day)
45
A  real(ish)  example!
KillrVideo
killrvideo.github.io
46
47
Find  video  by  id
CREATE TABLE videos (
videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type int,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
);
48
49
Get  number  of  views  by  video  id
First  attempt
CREATE TABLE video_playback_stats (
videoid uuid,
views int,
PRIMARY KEY (videoid)
);
SELECT views
FROM video_playback_stats where videoid = 12345;
UPDATE video_playback_stats SET views = 101
WHERE videoid = 12345;
50
100
Get  number  of  views  by  video  id
Problem:  concurrent  access
51
User  1:              Read(views:  100)                                                                                                  Write(views:  101)
User  2:                                          Read(views:  100)            Write(views:  101)
We  lost  an  update
Get  number  of  views  by  video  id
Better  solution
CREATE TABLE video_playback_stats (
videoid uuid,
views counter,
PRIMARY KEY (videoid)
);
UPDATE video_playback_stats
SET views = views + 1 WHERE videoid = 12345;
52
Counter  type
53
• Special  column  used  to  store  a  number  that  is  changed  in  
increments
• All  non-­counter  columns  in  the  table  must  be  defined  as  part  of  the  
primary  key
• NON  IDEMPOTENT  OPERATIONS!  Might  lead  to  a  non  100%  
accurate  count
54
Get  average  ratings  by  video  id
CREATE TABLE video_ratings (
videoid uuid,
rating_counter counter,
rating_total counter,
PRIMARY KEY (videoid)
);
UPDATE video_ratings
SET rating_counter = rating_counter + 1,
rating_total = rating_total + 5
WHERE videoid = 12345;
55
• Cassandra  doesn’t  provide  aggregation  operations  
(SUM/MIN/MAX)
56
Find  latest  comments  by  video  id
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY ((videoid), commentid)
)
WITH CLUSTERING ORDER BY (commentid DESC);
57
Where  do  we  get  this  from?
Client  side  JOINs
58
videoid K
commentid C ↓
userid
comment
comments_by_video
userid K
(… details …)
users
SELECT *
FROM comments_by_video
WHERE videoid = 12345;
For each comment in result:
SELECT *
FROM users
WHERE userid = <comment.userid>
(Chebotko diagrams)
59
Find  latest  videos  order  by  added_date desc
CREATE TABLE latest_videos (
yyyymmdd text,
added_date timestamp,
videoid uuid,
userid uuid,
name text,
preview_image_location text,
PRIMARY KEY ((yyyymmdd), added_date, videoid)
)
WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);
60
• How  can  be  the  data  partitioned?
Mind  the  hotspots!
• All  the  videos  added  the  same  day  will  be  stored  on  the  same  node.  
This  means  that  all  the  writes  into  that  table  will  go  to  the  same  node  
for  24  hours.
• Can  be  mitigated  by  splitting  the  row  using  an  arbitrary  group  
number,  making  the  partition  key  (yyyymmdd, group_number)
61
62
TAG
Find  videos  by  tag
CREATE TABLE videos_by_tag (
tag text,
videoid uuid,
added_date timestamp,
name text,
preview_image_location text,
tagged_date timestamp,
userid uuid,
PRIMARY KEY ((tag), videoid)
);
63
Attention:  unbounded  duplication
• If  a  user  adds  50  tags  to  a  video,  the  same  data  is  duplicated  50  
times  for  a  single  video.  
• The  duplication  factor  is  not  under  our  control:  potential  risk  for  data  
growing  very  quickly
• Consider  limiting  the  number  of  tags  the  user  can  use  on  the  
application  side
64
65
First  attempt
66
CREATE TABLE IF NOT EXISTS users (
userid uuid,
firstname text,
lastname text,
email text,
password, text
created_date timestamp,
PRIMARY KEY (userid)
);
We  have  to  support  lookup  by  email
Better  solution
67
CREATE TABLE user_credentials (
email text,
password text,
userid uuid,
PRIMARY KEY (email)
);
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
Mind  the  overwrites!
68
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456);
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘yyy’, 98765);
email password userid
max.tomassi@datastax.com xxx 123456
email password userid
max.tomassi@datastax.com yyy 98765
Lightweight  Transactions
69
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456)
IF NOT EXISTS
70
Find  videos  by  user  order  by  added_date desc
CREATE TABLE user_videos (
userid uuid,
added_date timestamp,
videoid uuid,
name text,
preview_image_location text,
PRIMARY KEY ((userid), added_date, videoid))
WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);
71
Data  duplication
72
• Tables  videos, user_videos, latest_videos,
videos_by_tag all  store  similar  information  about  videos
• The  same  data  is  duplicated  4  times:  disk  space  consumed  more  
quickly
Alternative
• Store  the  video  information  in  the  videos table  only
• The  other  3  tables  only  store  the  video_id,  the  data  is  joined  on  the  
client  side
• We  save  disk  space  but  we  lose  performance
Data  consistency
73
• Every  time  a  new  video  is  added  we  need  to  insert  a  new  record  into  
4  tables.  Those  records  must  be  kept  in  sync
• We  can  use  batches  for  this,  Cassandra  will  ensure  that  all  the  
statements  will  succeed  
BEGIN BATCH
INSERT INTO videos (…) VALUES (…);
INSERT INTO user_videos (…) VALUES (…);
INSERT INTO latest_videos (…) VALUES (…);
INSERT INTO videos_by_tag (…) VALUES (…);
APPLY BATCH;
74
Find  latest  comments  by  user
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY ((userid), commentid))
WITH CLUSTERING ORDER BY (commentid DESC);
75
Q&A
76
Thank  you!
For  more  information  and  training
www.datastax.com
academy.datastax.com
77

More Related Content

PDF
Datastax day 2016 : Cassandra data modeling basics
PDF
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
PDF
Cassandra nice use cases and worst anti patterns
PDF
Real data models of silicon valley
PDF
Cassandra Day Chicago 2015: Advanced Data Modeling
PDF
CQL3 in depth
PDF
Advanced data modeling with apache cassandra
PDF
Cassandra 3.0 advanced preview
Datastax day 2016 : Cassandra data modeling basics
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
Cassandra nice use cases and worst anti patterns
Real data models of silicon valley
Cassandra Day Chicago 2015: Advanced Data Modeling
CQL3 in depth
Advanced data modeling with apache cassandra
Cassandra 3.0 advanced preview

What's hot (20)

PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Storing time series data with Apache Cassandra
PDF
The world's next top data model
PPTX
Cassandra 2.2 & 3.0
PDF
Time series with Apache Cassandra - Long version
PDF
Cassandra 2.0 better, faster, stronger
PDF
Advanced Data Modeling with Apache Cassandra
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PDF
Cassandra summit keynote 2014
PDF
Cassandra 3.0 - JSON at scale - StampedeCon 2015
PDF
Cassandra Basics, Counters and Time Series Modeling
PDF
Cassandra 3.0 Awesomeness
PDF
Cassandra Summit 2015
PDF
Cassandra EU - Data model on fire
PDF
Big data 101 for beginners riga dev days
PDF
Cassandra 3.0
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PDF
Apache Cassandra Lesson: Data Modelling and CQL3
PDF
Cassandra Day Atlanta 2015: Data Modeling 101
Introduction to Data Modeling with Apache Cassandra
Storing time series data with Apache Cassandra
The world's next top data model
Cassandra 2.2 & 3.0
Time series with Apache Cassandra - Long version
Cassandra 2.0 better, faster, stronger
Advanced Data Modeling with Apache Cassandra
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Cassandra summit keynote 2014
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra Basics, Counters and Time Series Modeling
Cassandra 3.0 Awesomeness
Cassandra Summit 2015
Cassandra EU - Data model on fire
Big data 101 for beginners riga dev days
Cassandra 3.0
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Enabling Search in your Cassandra Application with DataStax Enterprise
Apache Cassandra Lesson: Data Modelling and CQL3
Cassandra Day Atlanta 2015: Data Modeling 101
Ad

Similar to Apache Cassandra & Data Modeling (20)

PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Introduction to data modeling with apache cassandra
PDF
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
PDF
Cassandra Day London 2015: Data Modeling 101
ODP
Cassandra Data Modelling
PPTX
Apache Cassandra Developer Training Slide Deck
PDF
Cassandra Data Modeling
PPTX
Presentation
PDF
The data model is dead, long live the data model
PDF
Cassandra - lesson learned
PDF
Jan 2015 - Cassandra101 Manchester Meetup
PDF
A Deep Dive into Apache Cassandra for .NET Developers
PDF
Cassandra for impatients
PPTX
Introduction to cassandra
PDF
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
PDF
Apache Cassandra for Timeseries- and Graph-Data
PDF
Cassandra lesson learned - extended
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PDF
Cassandra Data Modelling
PPTX
Learning Cassandra NoSQL
Introduction to Data Modeling with Apache Cassandra
Introduction to data modeling with apache cassandra
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Cassandra Day London 2015: Data Modeling 101
Cassandra Data Modelling
Apache Cassandra Developer Training Slide Deck
Cassandra Data Modeling
Presentation
The data model is dead, long live the data model
Cassandra - lesson learned
Jan 2015 - Cassandra101 Manchester Meetup
A Deep Dive into Apache Cassandra for .NET Developers
Cassandra for impatients
Introduction to cassandra
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Apache Cassandra for Timeseries- and Graph-Data
Cassandra lesson learned - extended
Cassandra Data Modelling with CQL (OSCON 2015)
Cassandra Data Modelling
Learning Cassandra NoSQL
Ad

Recently uploaded (20)

PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
The AI Revolution in Customer Service - 2025
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
Co-training pseudo-labeling for text classification with support vector machi...
Data Virtualization in Action: Scaling APIs and Apps with FME
Auditboard EB SOX Playbook 2023 edition.
4 layer Arch & Reference Arch of IoT.pdf
Presentation - Principles of Instructional Design.pptx
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
The AI Revolution in Customer Service - 2025
Module 1 Introduction to Web Programming .pptx
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Build automations faster and more reliably with UiPath ScreenPlay
A symptom-driven medical diagnosis support model based on machine learning te...
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
Co-training pseudo-labeling for text classification with support vector machi...

Apache Cassandra & Data Modeling

  • 1. Apache  Cassandra   and  data  modeling ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 2. About  Me Massimiliano  Tomassi Software  Engineer  at  DataStax [email protected] @max_tomassi 2
  • 3. CONTEXTUAL Characteristics  of  cloud  applications ALWAYS-­ON DISTRIBUTED SCALABLEREAL-­TIME ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 4. DataStax  provides  data  management   for  cloud  applications. ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 5. From  validation  to  momentum. 400+ Employees $190M Funding 500+ Customers Founded  in  April  2010 Santa  Clara  •  San  Francisco  •  Austin  •   London  •  Paris  •  Berlin  •  Tokyo  •  Sydney (Series  E  – Sept.  2014) 30%    + 2016  World’s  Best 100  Cloud  Companies   Ranked  #1  in  multiple  operational database  categories ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 10. 10 CREATE TABLE myapp.measurements ( sensor_id uuid, time timestamp, value double PRIMARY KEY ((sensor_id), time) ); PARTITION   KEY CLUSTERNING   KEY
  • 11. Distributed  architecture 11 INSERT INTO myapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Cassandra  Query  Language  (CQL)
  • 12. Distributed  architecture 12 INSERT INTO myapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator
  • 13. Distributed  architecture 13 INSERT INTO myapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator
  • 14. Distributed  architecture 14 INSERT INTO myapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator Hash  function 59 token
  • 15. Distributed  architecture 15 INSERT INTO myapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator Hash  function 59 token
  • 19. 19 Replication RF  =  3 Each  token  range  is  replicated   into  RF numbers  of  nodes.
  • 22. Replication 22 RF  =  3 Consistency? CL  =  ONE CL  =  QUORUM CL  =  ALL
  • 26. Hybrid  on  premise/cloud  deployment   26
  • 27. Automatic  replication  across  datacenters 27 Data  Center  1 Data  Center  2
  • 28. Automatic  replication  across  datacenters 28 Data  Center  1 Data  Center  2
  • 31. 31
  • 34. The  relational  way 34 Authors Books 1 * CREATE TABLE books ( book_id INT NOT NULL AUTO_INCREMENT, name VARCHAR(50), release_date date, author_id int, PRIMARY KEY (book_id), FOREIGN KEY (author_id) REFERENCES authors(id) )
  • 35. The  relational  way SELECT * FROM books WHERE id = 9876; 35 QUERY1:  Find  a  book  by  its  id CREATE TABLE books ( id INT NOT NULL AUTO_INCREMENT, name VARCHAR(50), release_date date, author_id int, PRIMARY KEY (book_id), FOREIGN KEY (author_id) REFERENCES authors(id) )
  • 36. The  relational  way 36 SELECT id, name, release_date FROM books WHERE author_id = 12345 ORDER BY release_date DESC; QUERY2:  Find  all  books  for  an  author  sorted  by  release  date  descending   CREATE TABLE books ( id INT NOT NULL AUTO_INCREMENT, name VARCHAR(50), release_date date, author_id int, PRIMARY KEY (book_id), FOREIGN KEY (author_id) REFERENCES authors(id) )
  • 37. The  Cassandra  way 37 • No  foreign  keys • No  JOINS • No  filtering  allowed  on  non  primary  key  columns  (could  use  secondary  indexes  but…) So  what? • Denormalize! • Write  your  data  the  way  you  want  to  read  it,  even  if  it  means  duplication • Data  should  be  read  from  a  SINGLE node  and  in  a  SEQUENTIAL way  (choose  your   partition  key  and  clustering  key  wisely)
  • 38. The  Cassandra  way 38 QUERY  1:  Find  a  book  by  its  id CREATE TABLE books ( book_id uuid, release_date timestamp, name text, PRIMARY KEY (book_id) ); SELECT * FROM books WHERE book_id = 9876;
  • 39. The  Cassandra  way 39 QUERY  2:  Find  all  books  for  an  author  sorted  by  release  date  descending CREATE TABLE books_by_author ( author_id uuid, release_date timestamp, book_id uuid, name text PRIMARY KEY ((author_id), release_date, book_id) ) WITH CLUSTERING ORDER BY (release_date DESC); SELECT book_id, name, release_date FROM books_by_author WHERE author_id = 12345;
  • 40. 40 CREATE TABLE myapp.measurements ( sensor_id uuid, time timestamp, value double PRIMARY KEY ((sensor_id), time) ); PARTITION   KEY CLUSTERNING   KEY How’s  data  stored?
  • 41. 41 sensor_id 100 timestamp value 2017-01-30 11:00:00 75.90 2017-01-30 11:30:00 112.50 2017-01-30 12:00:00 45 2017-01-30 12:30:00 92.30 2017-01-30 13:00:00 67.15 2017-01-30 13:30:00 32.20 SINGLE  LARGE  PARTITION
  • 42. Large  partitions  causing  issues  before  C*  3.6 • Slow  reads • Compaction  issues • Repair  issues 42 C*  3.6  mitigated  this  issue • But  still  be  aware  of  that  and  always  TEST  your  data  model  against  your  workload
  • 43. 43 CREATE TABLE myapp.measurements ( sensor_id uuid, time_bucket timestamp, time timestamp, value double PRIMARY KEY ((sensor_id, time_bucket), time) ); PARTITION   KEY CLUSTERNING   KEY Possible  alternative
  • 44. sensor_id 100 timestamp value time_bucket 2017-01-30 12:00:00 44 sensor_id 100 timestamp value 2017-01-30 11:00:00 75.90 2017-01-30 11:30:00 112.50 2017-01-30 12:00:00 45 2017-01-30 12:30:00 92.30 time_bucket 2017-01-30 11:00:00 MULTIPLE  PARTITIONS
  • 45. Caveat:  choose  your  time_bucket wisely! • Different  partitions  will  be  stored  into  different  nodes • Use  a  time_bucket that  can  satisfies  the  way  you  want  to  read  your  data   • (e.g.  don’t  use  an  hourly  time  bucket  if  you  want  to  extract  data  by  day) 45
  • 47. 47
  • 48. Find  video  by  id CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, location text, location_type int, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) ); 48
  • 49. 49
  • 50. Get  number  of  views  by  video  id First  attempt CREATE TABLE video_playback_stats ( videoid uuid, views int, PRIMARY KEY (videoid) ); SELECT views FROM video_playback_stats where videoid = 12345; UPDATE video_playback_stats SET views = 101 WHERE videoid = 12345; 50 100
  • 51. Get  number  of  views  by  video  id Problem:  concurrent  access 51 User  1:              Read(views:  100)                                                                                                  Write(views:  101) User  2:                                          Read(views:  100)            Write(views:  101) We  lost  an  update
  • 52. Get  number  of  views  by  video  id Better  solution CREATE TABLE video_playback_stats ( videoid uuid, views counter, PRIMARY KEY (videoid) ); UPDATE video_playback_stats SET views = views + 1 WHERE videoid = 12345; 52
  • 53. Counter  type 53 • Special  column  used  to  store  a  number  that  is  changed  in   increments • All  non-­counter  columns  in  the  table  must  be  defined  as  part  of  the   primary  key • NON  IDEMPOTENT  OPERATIONS!  Might  lead  to  a  non  100%   accurate  count
  • 54. 54
  • 55. Get  average  ratings  by  video  id CREATE TABLE video_ratings ( videoid uuid, rating_counter counter, rating_total counter, PRIMARY KEY (videoid) ); UPDATE video_ratings SET rating_counter = rating_counter + 1, rating_total = rating_total + 5 WHERE videoid = 12345; 55 • Cassandra  doesn’t  provide  aggregation  operations   (SUM/MIN/MAX)
  • 56. 56
  • 57. Find  latest  comments  by  video  id CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY ((videoid), commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); 57 Where  do  we  get  this  from?
  • 58. Client  side  JOINs 58 videoid K commentid C ↓ userid comment comments_by_video userid K (… details …) users SELECT * FROM comments_by_video WHERE videoid = 12345; For each comment in result: SELECT * FROM users WHERE userid = <comment.userid> (Chebotko diagrams)
  • 59. 59
  • 60. Find  latest  videos  order  by  added_date desc CREATE TABLE latest_videos ( yyyymmdd text, added_date timestamp, videoid uuid, userid uuid, name text, preview_image_location text, PRIMARY KEY ((yyyymmdd), added_date, videoid) ) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC); 60 • How  can  be  the  data  partitioned?
  • 61. Mind  the  hotspots! • All  the  videos  added  the  same  day  will  be  stored  on  the  same  node.   This  means  that  all  the  writes  into  that  table  will  go  to  the  same  node   for  24  hours. • Can  be  mitigated  by  splitting  the  row  using  an  arbitrary  group   number,  making  the  partition  key  (yyyymmdd, group_number) 61
  • 63. Find  videos  by  tag CREATE TABLE videos_by_tag ( tag text, videoid uuid, added_date timestamp, name text, preview_image_location text, tagged_date timestamp, userid uuid, PRIMARY KEY ((tag), videoid) ); 63
  • 64. Attention:  unbounded  duplication • If  a  user  adds  50  tags  to  a  video,  the  same  data  is  duplicated  50   times  for  a  single  video.   • The  duplication  factor  is  not  under  our  control:  potential  risk  for  data   growing  very  quickly • Consider  limiting  the  number  of  tags  the  user  can  use  on  the   application  side 64
  • 65. 65
  • 66. First  attempt 66 CREATE TABLE IF NOT EXISTS users ( userid uuid, firstname text, lastname text, email text, password, text created_date timestamp, PRIMARY KEY (userid) ); We  have  to  support  lookup  by  email
  • 67. Better  solution 67 CREATE TABLE user_credentials ( email text, password text, userid uuid, PRIMARY KEY (email) ); CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) );
  • 68. Mind  the  overwrites! 68 INSERT INTO user_credentials (email, password, userid) VALUES (’[email protected]’, ‘xxx’, 123456); INSERT INTO user_credentials (email, password, userid) VALUES (’[email protected]’, ‘yyy’, 98765); email password userid [email protected] xxx 123456 email password userid [email protected] yyy 98765
  • 69. Lightweight  Transactions 69 INSERT INTO user_credentials (email, password, userid) VALUES (’[email protected]’, ‘xxx’, 123456) IF NOT EXISTS
  • 70. 70
  • 71. Find  videos  by  user  order  by  added_date desc CREATE TABLE user_videos ( userid uuid, added_date timestamp, videoid uuid, name text, preview_image_location text, PRIMARY KEY ((userid), added_date, videoid)) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC); 71
  • 72. Data  duplication 72 • Tables  videos, user_videos, latest_videos, videos_by_tag all  store  similar  information  about  videos • The  same  data  is  duplicated  4  times:  disk  space  consumed  more   quickly Alternative • Store  the  video  information  in  the  videos table  only • The  other  3  tables  only  store  the  video_id,  the  data  is  joined  on  the   client  side • We  save  disk  space  but  we  lose  performance
  • 73. Data  consistency 73 • Every  time  a  new  video  is  added  we  need  to  insert  a  new  record  into   4  tables.  Those  records  must  be  kept  in  sync • We  can  use  batches  for  this,  Cassandra  will  ensure  that  all  the   statements  will  succeed   BEGIN BATCH INSERT INTO videos (…) VALUES (…); INSERT INTO user_videos (…) VALUES (…); INSERT INTO latest_videos (…) VALUES (…); INSERT INTO videos_by_tag (…) VALUES (…); APPLY BATCH;
  • 74. 74
  • 75. Find  latest  comments  by  user CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY ((userid), commentid)) WITH CLUSTERING ORDER BY (commentid DESC); 75
  • 77. Thank  you! For  more  information  and  training www.datastax.com academy.datastax.com 77