@doanduyhai
New Cassandra 3 Features
DuyHai DOAN
Apache Cassandra Evangelist
@doanduyhai
Who Am I ?
Duy Hai DOAN
Apache Cassandra Evangelist
•  talks, meetups, confs …
•  open-source projects (Achilles, Apache Zeppelin ...)
•  OSS Cassandra point of contact
• 
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
@doanduyhai
Datastax
•  Founded in April 2010
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 400+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, offices in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features
3
@doanduyhai
Agenda
4
•  Materialized Views
•  User Defined Functions (UDF) and Aggregates (UDA)
•  JSON Syntax
•  New SASI full text search index
@doanduyhai
Materialized Views (MV)
•  Why ?
•  Gotchas
@doanduyhai
Why Materialized Views ?
•  Relieve the pain of manual denormalization
CREATE TABLE user(id int PRIMARY KEY, country text, …);
CREATE TABLE user_by_country( country text, id int, …,
PRIMARY KEY(country, id));
6
@doanduyhai
CREATE TABLE user_by_country (
country text, id int,
firstname text, lastname text,
PRIMARY KEY(country, id));
Materialzed View In Action
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)
7
Materialized Views Demo
8
@doanduyhai
Materialized View Performance
•  Write performance
•  slower than normal write
•  local lock + read-before-write cost (but paid only once for all views)
•  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra
mutations for the views
9
@doanduyhai
Materialized View Performance
•  Write performance vs manual denormalization
•  MV better because no client-server network traffic for read-before-write
•  MV better because less network traffic for multiple views (client-side BATCH)
•  Makes developer life easier à priceless
10
@doanduyhai
Materialized View Performance
•  Read performance vs secondary index
•  MV better because single node read (secondary index can hit many nodes)
•  MV better because single read path (secondary index = read index + read data)
11
@doanduyhai
Materialized Views Consistency
•  Consistency level
•  CL honoured for base table, ONE for MV + local batchlog
•  Weaker consistency guarantees for MV than for base table.
12
Q & A
! "
13
@doanduyhai
User Define Functions (UDF)
•  Why ?
•  UDAs
•  Gotchas
@doanduyhai
Rationale
•  Push computation server-side
•  save network bandwidth (1000 nodes!)
•  simplify client-side code
•  provide standard & useful function (sum, avg …)
•  accelerate analytics use-case (pre-aggregation for Spark)
15
@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
16
@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code
Type = Cassandra type
17
@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language // j
AS $$
// source code here
$$;
Always called
Null-check mandatory in code
18
@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language // jav
AS $$
// source code here
$$;
If any input is null, function execution is skipped and return null
19
@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Cassandra types
•  primitives (boolean, int, …)
•  collections (list, set, map)
•  tuples
•  UDT
20
@doanduyhai
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
JVM supported languages
•  Java, Scala
•  Javascript (slow)
•  Groovy, Jython, JRuby
•  Clojure ( JSR 223 impl issue)
21
UDF Demo
22
@doanduyhai
User Defined Aggregates (UDA)
•  Real use-case for UDF
•  Aggregation server-side à huge network bandwidth saving
•  Provide similar behavior for Group By, Sum, Avg etc …
23
@doanduyhai
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name
State type
Initial state type
24
@doanduyhai
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Accumulator function. Signature:
accumulatorFunction(stateType, type1, type2, …)
RETURNS stateType
25
@doanduyhai
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Optional final function. Signature:
finalFunction(stateType)
26
UDA Demo
27
@doanduyhai
Gotchas
28
•  UDA in Cassandra is not distributed !
•  Do not execute UDA on a large number of rows (106 for ex.)
•  single fat partition
•  multiple partitions
•  full table scan
•  à Increase client-side timeout
•  default Java driver timeout = 12 secs
@doanduyhai
Cassandra UDA or Apache Spark ?
29
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark definitely
Q & A
! "
30
@doanduyhai
JSON Syntax
•  Why ?
•  Example
@doanduyhai
Why JSON ?
32
•  JSON is a very good exchange format
•  But a terrible schema …
•  How to have best of both worlds ?
•  use Cassandra schema
•  convert rows to JSON format
@doanduyhai
JSON syntax for INSERT/UPDATE/DELETE
33
CREATE TABLE users (
id text PRIMARY KEY,
age int,
state text );
INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’;
INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA');
UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"');
DELETE FROM users WHERE id = fromJson('"me"');
@doanduyhai
JSON syntax for SELECT
34
> SELECT JSON * FROM users WHERE id = 'me';
[json]
----------------------------------------
{"id": "me", "age": 25, "state": "CA”}
> SELECT JSON age,state FROM users WHERE id = 'me';
[json]
----------------------------------------
{"age": 25, "state": "CA"}
> SELECT age, toJson(state) FROM users WHERE id = 'me';
age | system.tojson(state)
-----+----------------------
25 | "CA"
JSON Syntax Demo
35
Q & A
! "
36
@doanduyhai
SASI index, the search is over!
•  Why ?
•  How ?
•  Who ?
•  Demo !
•  When ?
@doanduyhai
Why SASI ?
•  Searching (and full text search) was always a pain point for Cassandra
•  limited search predicates (=, <=, <, > and >= only)
•  limited scope (only on primary key columns)
•  Existing secondary index performance is poor
•  reversed-index
•  use Cassandra itself as index storage …
•  limited predicate ( = ). Inequality predicate = full cluster scan 😱
38
@doanduyhai
How ?
•  New index structure = suffix trees
•  Extended predicates (=, inequalities, LIKE %)
•  Full text search (tokenizers, stop-words, stemming …)
•  Query Planner to optimize AND predicates
•  NO, we don’t use Apache Lucene
39
@doanduyhai
Who ?
•  Open source contribution by an engineers team from …
40
SASI Demo
41
@doanduyhai
When ?
•  Cassandra 3.5
•  Later
•  support for OR clause : ( aaa OR bbb) AND (ccc OR ddd)
•  index on collections (Set, List, Map)
42
@doanduyhai
Comparison
43
SASI vs Solr/ElasticSearch ?
•  Cassandra is not a search engine !!! (database = durability)
•  always slower because 2 passes (SASI index read + original Cassandra data)
•  no scoring
•  no ordering (ORDER BY)
•  no grouping (GROUP BY) à Apache Spark for analytics
Still, SASI covers 80% of search use-cases and people are happy !
Q & A
! "
44
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You
45

More Related Content

PDF
Cassandra 3 new features @ Geecon Krakow 2016
PDF
Spark Cassandra 2016
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Sasi, cassandra on full text search ride
PDF
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
PDF
Data stax academy
PDF
Cassandra introduction 2016
PDF
Apache cassandra in 2016
Cassandra 3 new features @ Geecon Krakow 2016
Spark Cassandra 2016
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Sasi, cassandra on full text search ride
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Data stax academy
Cassandra introduction 2016
Apache cassandra in 2016

What's hot (20)

PDF
Datastax day 2016 introduction to apache cassandra
PDF
Big data 101 for beginners devoxxpl
PDF
Spark zeppelin-cassandra at synchrotron
PDF
Big data 101 for beginners riga dev days
PDF
Datastax enterprise presentation
PDF
Cassandra 3.0
PDF
Apache zeppelin the missing component for the big data ecosystem
PDF
Apache zeppelin, the missing component for the big data ecosystem
PDF
Apache Spark and DataStax Enablement
PDF
SQL to Hive Cheat Sheet
PDF
Solr Black Belt Pre-conference
PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
PDF
Managing Your Content with Elasticsearch
PDF
Data Engineering with Solr and Spark
PDF
Big data analytics with Spark & Cassandra
PPTX
Apache spark Intro
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Datastax day 2016 introduction to apache cassandra
Big data 101 for beginners devoxxpl
Spark zeppelin-cassandra at synchrotron
Big data 101 for beginners riga dev days
Datastax enterprise presentation
Cassandra 3.0
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystem
Apache Spark and DataStax Enablement
SQL to Hive Cheat Sheet
Solr Black Belt Pre-conference
Using existing language skillsets to create large-scale, cloud-based analytics
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Managing Your Content with Elasticsearch
Data Engineering with Solr and Spark
Big data analytics with Spark & Cassandra
Apache spark Intro
Parallel SQL and Streaming Expressions in Apache Solr 6
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Ad

Viewers also liked (19)

PDF
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
PPTX
Cassandra 2.2 & 3.0
PDF
Cassandra 3.0 - JSON at scale - StampedeCon 2015
PDF
Real-time Personal Trainer on the SMACK Stack
PDF
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
PDF
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
PDF
Apache Zeppelin @DevoxxFR 2016
PDF
Cassandra introduction mars jug
PDF
KillrChat presentation
PDF
Cassandra drivers and libraries
PDF
Fast track to getting started with DSE Max @ ING
PDF
Introduction to KillrChat
PDF
Cassandra introduction @ NantesJUG
PDF
KillrChat Data Modeling
PDF
Cassandra introduction @ ParisJUG
PDF
Spark cassandra integration 2016
PDF
Cassandra introduction at FinishJUG
PDF
Spark cassandra integration, theory and practice
PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Cassandra 2.2 & 3.0
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Real-time Personal Trainer on the SMACK Stack
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
Apache Zeppelin @DevoxxFR 2016
Cassandra introduction mars jug
KillrChat presentation
Cassandra drivers and libraries
Fast track to getting started with DSE Max @ ING
Introduction to KillrChat
Cassandra introduction @ NantesJUG
KillrChat Data Modeling
Cassandra introduction @ ParisJUG
Spark cassandra integration 2016
Cassandra introduction at FinishJUG
Spark cassandra integration, theory and practice
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Ad

Similar to Cassandra 3 new features 2016 (20)

PDF
Cassandra and materialized views
PPTX
3 CityNetConf - sql+c#=u-sql
PDF
Hadoop Overview & Architecture
 
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
Building Dynamic AWS Lambda Applications with BoxLang
PDF
Building Dynamic AWS Lambda Applications with BoxLang
PPTX
Challenges of Implementing an Advanced SQL Engine on Hadoop
KEY
Scaling php applications with redis
PDF
Cassandra UDF and Materialized Views
ODP
Sumedh Wale's presentation
PDF
Spark and cassandra (Hulu Talk)
PDF
Turning a Search Engine into a Relational Database
PPTX
Old code doesn't stink
PPTX
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
KEY
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
PDF
Hadoop Overview kdd2011
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
Solid And Sustainable Development in Scala
Cassandra and materialized views
3 CityNetConf - sql+c#=u-sql
Hadoop Overview & Architecture
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
Building Dynamic AWS Lambda Applications with BoxLang
Building Dynamic AWS Lambda Applications with BoxLang
Challenges of Implementing an Advanced SQL Engine on Hadoop
Scaling php applications with redis
Cassandra UDF and Materialized Views
Sumedh Wale's presentation
Spark and cassandra (Hulu Talk)
Turning a Search Engine into a Relational Database
Old code doesn't stink
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
Hadoop Overview kdd2011
BoxLang Dynamic AWS Lambda - Japan Edition
Solid And Sustainable Development in Scala

More from Duyhai Doan (7)

PDF
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
PDF
Le futur d'apache cassandra
PDF
Datastax day 2016 : Cassandra data modeling basics
PDF
Cassandra introduction 2016
PDF
Distributed algorithms for big data @ GeeCon
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Algorithmes distribues pour le big data @ DevoxxFR 2015
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Le futur d'apache cassandra
Datastax day 2016 : Cassandra data modeling basics
Cassandra introduction 2016
Distributed algorithms for big data @ GeeCon
Spark cassandra connector.API, Best Practices and Use-Cases
Algorithmes distribues pour le big data @ DevoxxFR 2015

Recently uploaded (20)

PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PPTX
Microsoft User Copilot Training Slide Deck
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PPTX
Build Your First AI Agent with UiPath.pptx
DOCX
Basics of Cloud Computing - Cloud Ecosystem
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Statistics on Ai - sourced from AIPRM.pdf
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Data Virtualization in Action: Scaling APIs and Apps with FME
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
search engine optimization ppt fir known well about this
Improvisation in detection of pomegranate leaf disease using transfer learni...
sustainability-14-14877-v2.pddhzftheheeeee
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Microsoft User Copilot Training Slide Deck
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Module 1 Introduction to Web Programming .pptx
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Early detection and classification of bone marrow changes in lumbar vertebrae...
NewMind AI Weekly Chronicles – August ’25 Week IV
Build Your First AI Agent with UiPath.pptx
Basics of Cloud Computing - Cloud Ecosystem

Cassandra 3 new features 2016

  • 1. @doanduyhai New Cassandra 3 Features DuyHai DOAN Apache Cassandra Evangelist
  • 2. @doanduyhai Who Am I ? Duy Hai DOAN Apache Cassandra Evangelist •  talks, meetups, confs … •  open-source projects (Achilles, Apache Zeppelin ...) •  OSS Cassandra point of contact •  ☞ [email protected] ☞ @doanduyhai 2
  • 3. @doanduyhai Datastax •  Founded in April 2010 •  We contribute a lot to Apache Cassandra™ •  400+ customers (25 of the Fortune 100), 400+ employees •  Headquarter in San Francisco Bay area •  EU headquarter in London, offices in France and Germany •  Datastax Enterprise = OSS Cassandra + extra features 3
  • 4. @doanduyhai Agenda 4 •  Materialized Views •  User Defined Functions (UDF) and Aggregates (UDA) •  JSON Syntax •  New SASI full text search index
  • 6. @doanduyhai Why Materialized Views ? •  Relieve the pain of manual denormalization CREATE TABLE user(id int PRIMARY KEY, country text, …); CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id)); 6
  • 7. @doanduyhai CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id)); Materialzed View In Action CREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastname FROM user WHERE country IS NOT NULL AND id IS NOT NULL PRIMARY KEY(country, id) 7
  • 9. @doanduyhai Materialized View Performance •  Write performance •  slower than normal write •  local lock + read-before-write cost (but paid only once for all views) •  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra mutations for the views 9
  • 10. @doanduyhai Materialized View Performance •  Write performance vs manual denormalization •  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side BATCH) •  Makes developer life easier à priceless 10
  • 11. @doanduyhai Materialized View Performance •  Read performance vs secondary index •  MV better because single node read (secondary index can hit many nodes) •  MV better because single read path (secondary index = read index + read data) 11
  • 12. @doanduyhai Materialized Views Consistency •  Consistency level •  CL honoured for base table, ONE for MV + local batchlog •  Weaker consistency guarantees for MV than for base table. 12
  • 13. Q & A ! " 13
  • 14. @doanduyhai User Define Functions (UDF) •  Why ? •  UDAs •  Gotchas
  • 15. @doanduyhai Rationale •  Push computation server-side •  save network bandwidth (1000 nodes!) •  simplify client-side code •  provide standard & useful function (sum, avg …) •  accelerate analytics use-case (pre-aggregation for Spark) 15
  • 16. @doanduyhai How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; 16
  • 17. @doanduyhai How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; Param name to refer to in the code Type = Cassandra type 17
  • 18. @doanduyhai How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language // j AS $$ // source code here $$; Always called Null-check mandatory in code 18
  • 19. @doanduyhai How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language // jav AS $$ // source code here $$; If any input is null, function execution is skipped and return null 19
  • 20. @doanduyhai How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; Cassandra types •  primitives (boolean, int, …) •  collections (list, set, map) •  tuples •  UDT 20
  • 21. @doanduyhai How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; JVM supported languages •  Java, Scala •  Javascript (slow) •  Groovy, Jython, JRuby •  Clojure ( JSR 223 impl issue) 21
  • 23. @doanduyhai User Defined Aggregates (UDA) •  Real use-case for UDF •  Aggregation server-side à huge network bandwidth saving •  Provide similar behavior for Group By, Sum, Avg etc … 23
  • 24. @doanduyhai How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Only type, no param name State type Initial state type 24
  • 25. @doanduyhai How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …) RETURNS stateType 25
  • 26. @doanduyhai How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Optional final function. Signature: finalFunction(stateType) 26
  • 28. @doanduyhai Gotchas 28 •  UDA in Cassandra is not distributed ! •  Do not execute UDA on a large number of rows (106 for ex.) •  single fat partition •  multiple partitions •  full table scan •  à Increase client-side timeout •  default Java driver timeout = 12 secs
  • 29. @doanduyhai Cassandra UDA or Apache Spark ? 29 Consistency Level Single/Multiple Partition(s) Recommended Approach ONE Single partition UDA with token-aware driver because node local ONE Multiple partitions Apache Spark because distributed reads > ONE Single partition UDA because data-locality lost with Spark > ONE Multiple partitions Apache Spark definitely
  • 30. Q & A ! " 30
  • 32. @doanduyhai Why JSON ? 32 •  JSON is a very good exchange format •  But a terrible schema … •  How to have best of both worlds ? •  use Cassandra schema •  convert rows to JSON format
  • 33. @doanduyhai JSON syntax for INSERT/UPDATE/DELETE 33 CREATE TABLE users ( id text PRIMARY KEY, age int, state text ); INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’; INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA'); UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"'); DELETE FROM users WHERE id = fromJson('"me"');
  • 34. @doanduyhai JSON syntax for SELECT 34 > SELECT JSON * FROM users WHERE id = 'me'; [json] ---------------------------------------- {"id": "me", "age": 25, "state": "CA”} > SELECT JSON age,state FROM users WHERE id = 'me'; [json] ---------------------------------------- {"age": 25, "state": "CA"} > SELECT age, toJson(state) FROM users WHERE id = 'me'; age | system.tojson(state) -----+---------------------- 25 | "CA"
  • 36. Q & A ! " 36
  • 37. @doanduyhai SASI index, the search is over! •  Why ? •  How ? •  Who ? •  Demo ! •  When ?
  • 38. @doanduyhai Why SASI ? •  Searching (and full text search) was always a pain point for Cassandra •  limited search predicates (=, <=, <, > and >= only) •  limited scope (only on primary key columns) •  Existing secondary index performance is poor •  reversed-index •  use Cassandra itself as index storage … •  limited predicate ( = ). Inequality predicate = full cluster scan 😱 38
  • 39. @doanduyhai How ? •  New index structure = suffix trees •  Extended predicates (=, inequalities, LIKE %) •  Full text search (tokenizers, stop-words, stemming …) •  Query Planner to optimize AND predicates •  NO, we don’t use Apache Lucene 39
  • 40. @doanduyhai Who ? •  Open source contribution by an engineers team from … 40
  • 42. @doanduyhai When ? •  Cassandra 3.5 •  Later •  support for OR clause : ( aaa OR bbb) AND (ccc OR ddd) •  index on collections (Set, List, Map) 42
  • 43. @doanduyhai Comparison 43 SASI vs Solr/ElasticSearch ? •  Cassandra is not a search engine !!! (database = durability) •  always slower because 2 passes (SASI index read + original Cassandra data) •  no scoring •  no ordering (ORDER BY) •  no grouping (GROUP BY) à Apache Spark for analytics Still, SASI covers 80% of search use-cases and people are happy !
  • 44. Q & A ! " 44