Cassandra 3 new features 2016

@doanduyhai
New Cassandra 3 Features
DuyHai DOAN
Apache Cassandra Evangelist

@doanduyhai
Who Am I ?
Duy Hai DOAN
Apache Cassandra Evangelist
•  talks, meetups, confs …
•  open-source projects (Achilles, Apache Zeppelin ...)
•  OSS Cassandra point of contact
• 
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2

@doanduyhai
Datastax
•  Founded in April 2010
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 400+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, ofﬁces in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features
3

@doanduyhai
Agenda
4
•  Materialized Views
•  User Deﬁned Functions (UDF) and Aggregates (UDA)
•  JSON Syntax
•  New SASI full text search index

@doanduyhai
Materialized Views (MV)
•  Why ?
•  Gotchas

@doanduyhai
Why Materialized Views ?
•  Relieve the pain of manual denormalization
CREATE TABLE user(id int PRIMARY KEY, country text, …);
CREATE TABLE user_by_country( country text, id int, …,
PRIMARY KEY(country, id));
6

@doanduyhai
CREATE TABLE user_by_country (
country text, id int,
firstname text, lastname text,
PRIMARY KEY(country, id));
Materialzed View In Action
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)
7

@doanduyhai
Materialized View Performance
•  Write performance
•  slower than normal write
•  local lock + read-before-write cost (but paid only once for all views)
•  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra
mutations for the views
9

@doanduyhai
•  Write performance vs manual denormalization
•  MV better because no client-server network trafﬁc for read-before-write
•  MV better because less network trafﬁc for multiple views (client-side BATCH)
•  Makes developer life easier à priceless
10

@doanduyhai
•  Read performance vs secondary index
•  MV better because single node read (secondary index can hit many nodes)
•  MV better because single read path (secondary index = read index + read data)
11

@doanduyhai
Materialized Views Consistency
•  Consistency level
•  CL honoured for base table, ONE for MV + local batchlog
•  Weaker consistency guarantees for MV than for base table.
12

@doanduyhai
User Deﬁne Functions (UDF)
•  Why ?
•  UDAs
•  Gotchas

@doanduyhai
Rationale
•  Push computation server-side
•  save network bandwidth (1000 nodes!)
•  simplify client-side code
•  provide standard & useful function (sum, avg …)
•  accelerate analytics use-case (pre-aggregation for Spark)
15

@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
16

@doanduyhai
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code
Type = Cassandra type
17

@doanduyhai
RETURNS returnType
LANGUAGE language // j
AS $$
// source code here
$$;
Always called
Null-check mandatory in code
18

@doanduyhai
RETURNS returnType
LANGUAGE language // jav
AS $$
// source code here
$$;
If any input is null, function execution is skipped and return null
19

@doanduyhai
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Cassandra types
•  primitives (boolean, int, …)
•  collections (list, set, map)
•  tuples
•  UDT
20

@doanduyhai
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
JVM supported languages
•  Java, Scala
•  Javascript (slow)
•  Groovy, Jython, JRuby
•  Clojure ( JSR 223 impl issue)
21

@doanduyhai
User Deﬁned Aggregates (UDA)
•  Real use-case for UDF
•  Aggregation server-side à huge network bandwidth saving
•  Provide similar behavior for Group By, Sum, Avg etc …
23

@doanduyhai
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name
State type
Initial state type
24

@doanduyhai
STYPE stateType
INITCOND initCond;
Accumulator function. Signature:
accumulatorFunction(stateType, type1, type2, …)
RETURNS stateType
25

@doanduyhai
STYPE stateType
INITCOND initCond;
Optional final function. Signature:
finalFunction(stateType)
26

@doanduyhai
Gotchas
28
•  UDA in Cassandra is not distributed !
•  Do not execute UDA on a large number of rows (106 for ex.)
•  single fat partition
•  multiple partitions
•  full table scan
•  à Increase client-side timeout
•  default Java driver timeout = 12 secs

@doanduyhai
Cassandra UDA or Apache Spark ?
29
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark deﬁnitely

@doanduyhai
JSON Syntax
•  Why ?
•  Example

@doanduyhai
Why JSON ?
32
•  JSON is a very good exchange format
•  But a terrible schema …
•  How to have best of both worlds ?
•  use Cassandra schema
•  convert rows to JSON format

@doanduyhai
JSON syntax for INSERT/UPDATE/DELETE
33
CREATE TABLE users (
id text PRIMARY KEY,
age int,
state text );
INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’;
INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA');
UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"');
DELETE FROM users WHERE id = fromJson('"me"');

@doanduyhai
JSON syntax for SELECT
34
> SELECT JSON * FROM users WHERE id = 'me';
[json]
----------------------------------------
{"id": "me", "age": 25, "state": "CA”}
> SELECT JSON age,state FROM users WHERE id = 'me';
[json]
----------------------------------------
{"age": 25, "state": "CA"}
> SELECT age, toJson(state) FROM users WHERE id = 'me';
age | system.tojson(state)
-----+----------------------
25 | "CA"

@doanduyhai
SASI index, the search is over!
•  Why ?
•  How ?
•  Who ?
•  Demo !
•  When ?

@doanduyhai
Why SASI ?
•  Searching (and full text search) was always a pain point for Cassandra
•  limited search predicates (=, <=, <, > and >= only)
•  limited scope (only on primary key columns)
•  Existing secondary index performance is poor
•  reversed-index
•  use Cassandra itself as index storage …
•  limited predicate ( = ). Inequality predicate = full cluster scan 😱
38

@doanduyhai
How ?
•  New index structure = suffix trees
•  Extended predicates (=, inequalities, LIKE %)
•  Full text search (tokenizers, stop-words, stemming …)
•  Query Planner to optimize AND predicates
•  NO, we don’t use Apache Lucene
39

@doanduyhai
Who ?
•  Open source contribution by an engineers team from …
40

@doanduyhai
When ?
•  Cassandra 3.5
•  Later
•  support for OR clause : ( aaa OR bbb) AND (ccc OR ddd)
•  index on collections (Set, List, Map)
42

@doanduyhai
Comparison
43
SASI vs Solr/ElasticSearch ?
•  Cassandra is not a search engine !!! (database = durability)
•  always slower because 2 passes (SASI index read + original Cassandra data)
•  no scoring
•  no ordering (ORDER BY)
•  no grouping (GROUP BY) à Apache Spark for analytics
Still, SASI covers 80% of search use-cases and people are happy !

@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You
45

Cassandra 3 new features 2016

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Cassandra 3 new features 2016 (20)

More from Duyhai Doan (7)

Recently uploaded (20)

Cassandra 3 new features 2016