SlideShare a Scribd company logo
Introduction to NOSQL And Cassandra @rantav  @outbrain
SQL is good Rich language Easy to use and integrate Rich toolset  Many vendors The promise:  ACID Atomicity Consistency Isolation Durability
SQL Rules
BUT
The Challenge: Modern web apps Internet-scale data size High read-write rates Frequent schema changes "social" apps - not banks They don't need the same  level of ACID  SCALING
Scaling Solutions - Replication Scales Reads
Scaling Solutions - Sharding Scales also Writes
Brewer's CAP Theorem:  You can only choose two
CAP
Availability + Partition Tolerance (no Consistency)
Existing NOSQL Solutions
Taxonomy of NOSQL data stores Document Oriented CouchDB, MongoDB, Lotus Notes, SimpleDB, Orient Key-Value Voldemort, Dynamo, Riak (sort of), Redis, Tokyo  Column Cassandra, HBase, BigTable Graph Databases   Neo4J, FlockDB, DEX, AlegroGraph http://en.wikipedia.org/wiki/NoSQL
  Developed at facebook Follows the  BigTable Data Model  - column oriented Follows the  Dynamo Eventual Consistency  model Opensourced at Apache Implemented in Java
N/R/W N - Number of replicas (nodes) for any data item W - Number or nodes a write operation blocks on R - Number of nodes a read operation blocks on CONSISTENCY DOWN TO EARTH
N/R/W - Typical Values W=1 =>  Block until first node written successfully W=N =>  Block until all nodes written successfully W=0 =>  Async writes R=1 =>  Block until the first node returns an answer R=N =>  Block until all nodes return an answer R=0 =>  Doesn't make sense QUORUM: R = N/2+1 W = N/2+1 => Fully consistent
Data Model - Forget SQL Do you know SQL?
Data Model - Vocabulary Keyspace – like namespace for unique keys. Column Family – very much like a table… but not quite. Key – a key that represent row (of columns) Column – representation of value with: Column name Value Timestamp Super Column – Column that holds list of columns inside
Data Model - Columns struct Column {     1: required binary  name ,     2: optional binary  value ,     3: optional i64  timestamp ,     4: optional i32  ttl , } JSON-ish notation: {    "name":      "emailAddress",    "value":     "foo@bar.com",    "timestamp": 123456789 }
Data Model - Column Family Similar to SQL tables Has many columns Has many rows
Data Model - Rows Primary key for objects All keys are arbitrary length binaries Users:                                  CF      ran:                                ROW          emailAddress: foo@bar.com,      COLUMN          webSite: http://bar.com         COLUMN      f.rat:                              ROW          emailAddress: f.rat@mouse.com   COLUMN Stats:                                  CF      ran:                                ROW          visits: 243                     COLUMN
Data Model - Songs example Songs:       Meir Ariel:           Shir Keev: 6:13,           Tikva: 4:11,          Erol: 6:17          Suetz: 5:30          Dr Hitchakmut: 3:30      Mashina:          Rakevet Layla: 3:02          Optikai: 5:40
Data Model - Super Columns Columns whose values are lists of columns
Data Model - Super Columns Songs:       Meir Ariel:          Shirey Hag :              Shir Keev: 6:13,               Tikva: 4:11,              Erol: 6:17          Vegluy Eynaim :               Suetz: 5:30              Dr Hitchakmut: 3:30      Mashina:          ...
The API - Read get get_slice get_count multiget multiget_slice get_ranage_slices get_indexed_slices
The True API get(keyspace, key, column_path,  consistency ) get_slice( ks, key, column_parent, predicate,  consistency ) multiget(ks, keys, column_path,  consistency ) multiget_slice( ks, keys, column_parent, predicate,  consistency ) ...
The API - Write insert add remove remove_counter batch_mutate
The API - Meta describe_schema_versions describe_keyspaces describe_cluster_name describe_version describe_ring describe_partitioner describe_snitch
The API - DDL system_add_column_family system_drop_column_family system_add_keyspace system_drop_keyspace system_update_keyspace system_update_column_family
The API - CQL execute_cql_query cqlsh> SELECT key, state FROM users; cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT');
Consistency Model N  - per keyspace R  - per each read requests W  - per each write request
Consistency Model Cassandra defines: enum ConsistencyLevel {      ONE,      QUORUM,      LOCAL_QUORUM,      EACH_QUORUM,      ALL,      ANY,      TWO,      THREE, }
Java Code TTransport tr = new TSocket("localhost", 9160);  TProtocol proto = new TBinaryProtocol(tr);  Cassandra.Client client = new Cassandra.Client(proto);  tr.open();  String key_user_id = "1";  long  timestamp  = System.currentTimeMillis();  client. insert ("Keyspace1",                 key_user_id,                 new ColumnPath("Standard1",                                null,                               "name".getBytes("UTF-8")),                 "Chris Goffinet".getBytes("UTF-8"),                timestamp,                 ConsistencyLevel.ONE); 
Java Client - Hector http://github.com/rantav/hector The de-facto java client for cassandra Encapsulates thrift Adds JMX (Monitoring) Connection pooling Failover Open-sourced at github and has a growing community of developers and users.
Java Client - Hector - cont   /**     * Insert a new value keyed by key     *     * @param key   Key for the value     * @param value the String value to insert     */    public void  insert (final String key, final String value) {      Mutator m = createMutator(keyspaceOperator);      m.insert(key,                CF_NAME,                createColumn(COLUMN_NAME, value));    }
Java Client - Hector - cont    /**     * Get a string value.     *     * @return The string value; null if no value exists for the given key.     */    public String  get (final String key) throws HectorException {      ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer);      Result<HColumn<String, String>> r = q.setKey(key).          setName(COLUMN_NAME).          setColumnFamily(CF_NAME).          execute();      HColumn<String, String> c = r.get();      return c == null ? null : c.getValue();    }
Extra If you're not snoring yet...
Sorting Columns are sorted by their type  BytesType  UTF8Type AsciiType LongType LexicalUUIDType TimeUUIDType Rows are sorted by their Partitioner RandomPartitioner OrderPreservingPartitioner CollatingOrderPreservingPartitioner
Thrift Cross-language protocol Compiles to: C++, Java, PHP, Ruby, Erlang, Perl, ... struct UserProfile {       1: i32     uid ,       2: string  name ,       3: string  blurb   }  service UserStorage {       void          store (1: UserProfile user),      UserProfile   retrieve (1: i32 uid)  }
Thrift Generating sources: thrift --gen java cassandra.thrift thrift -- gen py cassandra.thrift
Internals
Agenda Background and history Architectural Layers Transport: Thrift Write Path (and sstables, memtables) Read Path Compactions Bloom Filters Gossip Deletions More...
Required Reading ;-) BigTable  http://labs.google.com/papers/bigtable.html Dynamo  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
From Dynamo: Symmetric p2p architecture Gossip based discovery and error detection Distributed key-value store Pluggable partitioning  Pluggable topology discovery Eventual consistent and Tunable per operation 
From BigTable Sparse Column oriented sparse array SSTable disk storage Append-only commit log Memtable (buffering and sorting) Immutable sstable files Compactions High write performance 
Architecture Layers Cluster Management Messaging service  Gossip  Failure detection  Cluster state  Partitioner  Replication  Single Host Commit log  Memtable  SSTable  Indexes  Compaction 
Write Path
Writing
Writing
Writing
Writing
Memtables In-memory representation of recently written data When the table is full, it's sorted and then flushed to disk -> sstable
SSTables Sorted Strings Tables Immutable On-disk Sorted by a string key In-memory index of elements Binary search (in memory) to find element location Bloom filter to reduce number of unneeded binary searches.
Write Properties No Locks in the critical path Always available to writes, even if there are failures. No reads No seeks  Fast  Atomic within a Row
Read Path
Reads
Reading
Reading
Reading
Reading
Bloom Filters Space efficient probabilistic data structure Test whether an element is a member of a set Allow false positive, but not false negative  k hash functions Union and intersection are implemented as bitwise OR, AND
Read Properteis Read multiple SSTables  Slower than writes (but still fast)  Seeks can be mitigated with more RAM Uses probabilistic bloom filters to reduce lookups. Extensive optional caching Key Cache Row Cache Excellent monitoring
Compactions  
Compactions Merge keys  Combine columns  Discard tombstones Use bloom filters bitwise OR operation
Gossip p2p Enables seamless nodes addition. Rebalancing of keys Fast detection of nodes that goes down. Every node knows about all others - no master.
Deletions Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction  Read repair complicates things a little  Eventually consistent complicates things more  Solution: configurable delay before tombstone GC, after which tombstones are not repaired
Extra Long list of subjects SEDA (Staged Events Driven Architecture) Anti Entropy and Merkle Trees Hinted Handoff repair on read
SEDA Mutate Stream Gossip Response Anti Entropy Load Balance Migration 
Anti Entropy and Merkle Trees
Hinted Handoff
References http://horicky.blogspot.com/2009/11/nosql-patterns.html http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf http://labs.google.com/papers/bigtable.html http://bret.appspot.com/entry/how-friendfeed-uses-mysql http://www.julianbrowne.com/article/viewer/brewers-cap-theorem http://www.allthingsdistributed.com/2008/12/eventually_consistent.html http://wiki.apache.org/cassandra/DataModel http://incubator.apache.org/thrift/ http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf

More Related Content

PDF
2014 holden - databricks umd scala crash course
Holden Karau
 
PDF
Spark with Elasticsearch
Holden Karau
 
PDF
Spark with Elasticsearch - umd version 2014
Holden Karau
 
PDF
Collections forceawakens
RichardWarburton
 
PDF
Refactoring for Software Design Smells - Tech Talk
CodeOps Technologies LLP
 
PPTX
Bioinformatics v2014 wim_vancriekinge
Prof. Wim Van Criekinge
 
PDF
Rust Workshop - NITC FOSSMEET 2017
pramode_ce
 
PDF
Core Java - Quiz Questions - Bug Hunt
CodeOps Technologies LLP
 
2014 holden - databricks umd scala crash course
Holden Karau
 
Spark with Elasticsearch
Holden Karau
 
Spark with Elasticsearch - umd version 2014
Holden Karau
 
Collections forceawakens
RichardWarburton
 
Refactoring for Software Design Smells - Tech Talk
CodeOps Technologies LLP
 
Bioinformatics v2014 wim_vancriekinge
Prof. Wim Van Criekinge
 
Rust Workshop - NITC FOSSMEET 2017
pramode_ce
 
Core Java - Quiz Questions - Bug Hunt
CodeOps Technologies LLP
 

What's hot (20)

PPTX
CCM AlchemyAPI and Real-time Aggregation
Victor Anjos
 
PDF
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
DataStax
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PPT
typemap in Perl/XS
charsbar
 
PDF
Java Full Throttle
José Paumard
 
PDF
Java Keeps Throttling Up!
José Paumard
 
ODP
PHP applications/environments monitoring: APM & Pinba
Patrick Allaert
 
PDF
Lucene revolution 2011
Takahiko Ito
 
KEY
Alfresco the clojure way
Carlo Sciolla
 
PDF
NoSQL and SQL Anti Patterns
Gleicon Moraes
 
PDF
Free your lambdas
José Paumard
 
ZIP
Oral presentation v2
Yeqi He
 
PDF
Klee and angr
Wei-Bo Chen
 
PPT
Hector v2: The Second Version of the Popular High-Level Java Client for Apach...
zznate
 
PDF
Java 8 Stream API and RxJava Comparison
José Paumard
 
PDF
An introduction to Rust: the modern programming language to develop safe and ...
Claudio Capobianco
 
PDF
S1 DML Syntax and Invocation
Arvind Surve
 
PDF
Free your lambdas
José Paumard
 
PDF
The Rust Borrow Checker
Nell Shamrell-Harrington
 
PPT
ShmooCon 2009 - (Re)Playing(Blind)Sql
Chema Alonso
 
CCM AlchemyAPI and Real-time Aggregation
Victor Anjos
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
DataStax
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
typemap in Perl/XS
charsbar
 
Java Full Throttle
José Paumard
 
Java Keeps Throttling Up!
José Paumard
 
PHP applications/environments monitoring: APM & Pinba
Patrick Allaert
 
Lucene revolution 2011
Takahiko Ito
 
Alfresco the clojure way
Carlo Sciolla
 
NoSQL and SQL Anti Patterns
Gleicon Moraes
 
Free your lambdas
José Paumard
 
Oral presentation v2
Yeqi He
 
Klee and angr
Wei-Bo Chen
 
Hector v2: The Second Version of the Popular High-Level Java Client for Apach...
zznate
 
Java 8 Stream API and RxJava Comparison
José Paumard
 
An introduction to Rust: the modern programming language to develop safe and ...
Claudio Capobianco
 
S1 DML Syntax and Invocation
Arvind Surve
 
Free your lambdas
José Paumard
 
The Rust Borrow Checker
Nell Shamrell-Harrington
 
ShmooCon 2009 - (Re)Playing(Blind)Sql
Chema Alonso
 
Ad

Viewers also liked (19)

ODP
Cloud storage in azienda: perche` Riak ci e` piaciuto
BioDec
 
PPTX
Cassandra at no_sql
srisatish ambati
 
PDF
Cassandra - Wellington No Sql
aaronmorton
 
PPT
Seminar presentation final
Nazmul Hossain Bilash
 
PDF
NoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
Steve Maraspin
 
PPTX
No SQL Cassandra
Prashanth M.S
 
PPT
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Jeremy Hanna
 
PDF
NoSql - Key Value
Giovanni Grano
 
PPTX
AWS (Amazon Web Services) - Trevisan Davide
Davide Trevisan
 
PPT
Eletti big data_trento_25ott14
Valerio Eletti
 
PDF
Cassandra, web scale no sql data platform
Marko Švaljek
 
PPTX
Cassandra ppt 1
Skillwise Group
 
PDF
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 
ODP
Introduzione a Riak
Dimitri De Franciscis
 
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
PDF
NoSQL, No Worries: Vecchi Problemi, Nuove Soluzioni
Steve Maraspin
 
PPT
Cassandra Data Model
ebenhewitt
 
PDF
Dynamo and BigTable - Review and Comparison
Grisha Weintraub
 
PPTX
Dynamodb Presentation
advaitdeo
 
Cloud storage in azienda: perche` Riak ci e` piaciuto
BioDec
 
Cassandra at no_sql
srisatish ambati
 
Cassandra - Wellington No Sql
aaronmorton
 
Seminar presentation final
Nazmul Hossain Bilash
 
NoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
Steve Maraspin
 
No SQL Cassandra
Prashanth M.S
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Jeremy Hanna
 
NoSql - Key Value
Giovanni Grano
 
AWS (Amazon Web Services) - Trevisan Davide
Davide Trevisan
 
Eletti big data_trento_25ott14
Valerio Eletti
 
Cassandra, web scale no sql data platform
Marko Švaljek
 
Cassandra ppt 1
Skillwise Group
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 
Introduzione a Riak
Dimitri De Franciscis
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
NoSQL, No Worries: Vecchi Problemi, Nuove Soluzioni
Steve Maraspin
 
Cassandra Data Model
ebenhewitt
 
Dynamo and BigTable - Review and Comparison
Grisha Weintraub
 
Dynamodb Presentation
advaitdeo
 
Ad

Similar to NOSQL and Cassandra (20)

PPTX
Apache Cassandra, part 1 – principles, data model
Andrey Lomakin
 
PPTX
Apache Cassandra, part 2 – data model example, machinery
Andrey Lomakin
 
ODP
Meetup cassandra for_java_cql
zznate
 
ODP
Introduction to apache_cassandra_for_developers-lhg
zznate
 
PPTX
Introduction to c_plus_plus
Sayed Ahmed
 
PPTX
Introduction to c_plus_plus (6)
Sayed Ahmed
 
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
PDF
Rails on Oracle 2011
Raimonds Simanovskis
 
PPTX
Modern C++
Richard Thomson
 
PPTX
Optimizing Tcl Bytecode
Donal Fellows
 
PDF
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
apidays
 
PDF
String Comparison Surprises: Did Postgres lose my data?
Jeremy Schneider
 
PPT
Scaling Web Applications with Cassandra Presentation.ppt
ssuserbad56d
 
PPTX
Inside SQL Server In-Memory OLTP
Bob Ward
 
PPTX
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
PDF
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
NoSQLmatters
 
PPTX
MYSQL -1.pptx
akshat205573
 
PPT
Lec 1-Introduction.ppt power point of intro
rabiyanaseer1
 
ODP
Introduciton to Apache Cassandra for Java Developers (JavaOne)
zznate
 
Apache Cassandra, part 1 – principles, data model
Andrey Lomakin
 
Apache Cassandra, part 2 – data model example, machinery
Andrey Lomakin
 
Meetup cassandra for_java_cql
zznate
 
Introduction to apache_cassandra_for_developers-lhg
zznate
 
Introduction to c_plus_plus
Sayed Ahmed
 
Introduction to c_plus_plus (6)
Sayed Ahmed
 
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
Rails on Oracle 2011
Raimonds Simanovskis
 
Modern C++
Richard Thomson
 
Optimizing Tcl Bytecode
Donal Fellows
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
apidays
 
String Comparison Surprises: Did Postgres lose my data?
Jeremy Schneider
 
Scaling Web Applications with Cassandra Presentation.ppt
ssuserbad56d
 
Inside SQL Server In-Memory OLTP
Bob Ward
 
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
NoSQLmatters
 
MYSQL -1.pptx
akshat205573
 
Lec 1-Introduction.ppt power point of intro
rabiyanaseer1
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
zznate
 

Recently uploaded (20)

PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Software Development Methodologies in 2025
KodekX
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Future of Artificial Intelligence (AI)
Mukul
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 

NOSQL and Cassandra

  • 1. Introduction to NOSQL And Cassandra @rantav  @outbrain
  • 2. SQL is good Rich language Easy to use and integrate Rich toolset  Many vendors The promise: ACID Atomicity Consistency Isolation Durability
  • 4. BUT
  • 5. The Challenge: Modern web apps Internet-scale data size High read-write rates Frequent schema changes &quot;social&quot; apps - not banks They don't need the same  level of ACID  SCALING
  • 6. Scaling Solutions - Replication Scales Reads
  • 7. Scaling Solutions - Sharding Scales also Writes
  • 8. Brewer's CAP Theorem:  You can only choose two
  • 9. CAP
  • 10. Availability + Partition Tolerance (no Consistency)
  • 12. Taxonomy of NOSQL data stores Document Oriented CouchDB, MongoDB, Lotus Notes, SimpleDB, Orient Key-Value Voldemort, Dynamo, Riak (sort of), Redis, Tokyo  Column Cassandra, HBase, BigTable Graph Databases   Neo4J, FlockDB, DEX, AlegroGraph http://en.wikipedia.org/wiki/NoSQL
  • 13.   Developed at facebook Follows the BigTable Data Model - column oriented Follows the Dynamo Eventual Consistency model Opensourced at Apache Implemented in Java
  • 14. N/R/W N - Number of replicas (nodes) for any data item W - Number or nodes a write operation blocks on R - Number of nodes a read operation blocks on CONSISTENCY DOWN TO EARTH
  • 15. N/R/W - Typical Values W=1 => Block until first node written successfully W=N => Block until all nodes written successfully W=0 => Async writes R=1 => Block until the first node returns an answer R=N => Block until all nodes return an answer R=0 => Doesn't make sense QUORUM: R = N/2+1 W = N/2+1 => Fully consistent
  • 16. Data Model - Forget SQL Do you know SQL?
  • 17. Data Model - Vocabulary Keyspace – like namespace for unique keys. Column Family – very much like a table… but not quite. Key – a key that represent row (of columns) Column – representation of value with: Column name Value Timestamp Super Column – Column that holds list of columns inside
  • 18. Data Model - Columns struct Column {     1: required binary name ,     2: optional binary value ,     3: optional i64 timestamp ,     4: optional i32 ttl , } JSON-ish notation: {    &quot;name&quot;:      &quot;emailAddress&quot;,    &quot;value&quot;:     &quot;[email protected]&quot;,    &quot;timestamp&quot;: 123456789 }
  • 19. Data Model - Column Family Similar to SQL tables Has many columns Has many rows
  • 20. Data Model - Rows Primary key for objects All keys are arbitrary length binaries Users:                                 CF      ran:                               ROW          emailAddress: [email protected],      COLUMN          webSite: http://bar.com         COLUMN      f.rat:                              ROW          emailAddress: [email protected]   COLUMN Stats:                                  CF      ran:                               ROW          visits: 243                     COLUMN
  • 21. Data Model - Songs example Songs:       Meir Ariel:           Shir Keev: 6:13,           Tikva: 4:11,          Erol: 6:17          Suetz: 5:30          Dr Hitchakmut: 3:30      Mashina:          Rakevet Layla: 3:02          Optikai: 5:40
  • 22. Data Model - Super Columns Columns whose values are lists of columns
  • 23. Data Model - Super Columns Songs:       Meir Ariel:          Shirey Hag :              Shir Keev: 6:13,               Tikva: 4:11,              Erol: 6:17          Vegluy Eynaim :               Suetz: 5:30              Dr Hitchakmut: 3:30      Mashina:          ...
  • 24. The API - Read get get_slice get_count multiget multiget_slice get_ranage_slices get_indexed_slices
  • 25. The True API get(keyspace, key, column_path, consistency ) get_slice( ks, key, column_parent, predicate,  consistency ) multiget(ks, keys, column_path,  consistency ) multiget_slice( ks, keys, column_parent, predicate,  consistency ) ...
  • 26. The API - Write insert add remove remove_counter batch_mutate
  • 27. The API - Meta describe_schema_versions describe_keyspaces describe_cluster_name describe_version describe_ring describe_partitioner describe_snitch
  • 28. The API - DDL system_add_column_family system_drop_column_family system_add_keyspace system_drop_keyspace system_update_keyspace system_update_column_family
  • 29. The API - CQL execute_cql_query cqlsh> SELECT key, state FROM users; cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT');
  • 30. Consistency Model N - per keyspace R - per each read requests W - per each write request
  • 31. Consistency Model Cassandra defines: enum ConsistencyLevel {     ONE,     QUORUM,     LOCAL_QUORUM,     EACH_QUORUM,     ALL,     ANY,     TWO,     THREE, }
  • 32. Java Code TTransport tr = new TSocket(&quot;localhost&quot;, 9160);  TProtocol proto = new TBinaryProtocol(tr);  Cassandra.Client client = new Cassandra.Client(proto);  tr.open();  String key_user_id = &quot;1&quot;;  long timestamp = System.currentTimeMillis();  client. insert (&quot;Keyspace1&quot;,                 key_user_id,                 new ColumnPath(&quot;Standard1&quot;,                                null,                               &quot;name&quot;.getBytes(&quot;UTF-8&quot;)),                 &quot;Chris Goffinet&quot;.getBytes(&quot;UTF-8&quot;),                timestamp,                 ConsistencyLevel.ONE); 
  • 33. Java Client - Hector http://github.com/rantav/hector The de-facto java client for cassandra Encapsulates thrift Adds JMX (Monitoring) Connection pooling Failover Open-sourced at github and has a growing community of developers and users.
  • 34. Java Client - Hector - cont   /**    * Insert a new value keyed by key    *    * @param key   Key for the value    * @param value the String value to insert    */    public void insert (final String key, final String value) {      Mutator m = createMutator(keyspaceOperator);      m.insert(key,               CF_NAME,               createColumn(COLUMN_NAME, value));    }
  • 35. Java Client - Hector - cont    /**    * Get a string value.    *    * @return The string value; null if no value exists for the given key.    */    public String get (final String key) throws HectorException {      ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer);      Result<HColumn<String, String>> r = q.setKey(key).          setName(COLUMN_NAME).          setColumnFamily(CF_NAME).          execute();      HColumn<String, String> c = r.get();      return c == null ? null : c.getValue();    }
  • 36. Extra If you're not snoring yet...
  • 37. Sorting Columns are sorted by their type  BytesType  UTF8Type AsciiType LongType LexicalUUIDType TimeUUIDType Rows are sorted by their Partitioner RandomPartitioner OrderPreservingPartitioner CollatingOrderPreservingPartitioner
  • 38. Thrift Cross-language protocol Compiles to: C++, Java, PHP, Ruby, Erlang, Perl, ... struct UserProfile {       1: i32     uid ,       2: string name ,       3: string blurb   }  service UserStorage {       void          store (1: UserProfile user),      UserProfile   retrieve (1: i32 uid)  }
  • 39. Thrift Generating sources: thrift --gen java cassandra.thrift thrift -- gen py cassandra.thrift
  • 41. Agenda Background and history Architectural Layers Transport: Thrift Write Path (and sstables, memtables) Read Path Compactions Bloom Filters Gossip Deletions More...
  • 42. Required Reading ;-) BigTable  http://labs.google.com/papers/bigtable.html Dynamo  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
  • 43. From Dynamo: Symmetric p2p architecture Gossip based discovery and error detection Distributed key-value store Pluggable partitioning  Pluggable topology discovery Eventual consistent and Tunable per operation 
  • 44. From BigTable Sparse Column oriented sparse array SSTable disk storage Append-only commit log Memtable (buffering and sorting) Immutable sstable files Compactions High write performance 
  • 45. Architecture Layers Cluster Management Messaging service  Gossip  Failure detection  Cluster state  Partitioner  Replication  Single Host Commit log  Memtable  SSTable  Indexes  Compaction 
  • 51. Memtables In-memory representation of recently written data When the table is full, it's sorted and then flushed to disk -> sstable
  • 52. SSTables Sorted Strings Tables Immutable On-disk Sorted by a string key In-memory index of elements Binary search (in memory) to find element location Bloom filter to reduce number of unneeded binary searches.
  • 53. Write Properties No Locks in the critical path Always available to writes, even if there are failures. No reads No seeks  Fast  Atomic within a Row
  • 55. Reads
  • 60. Bloom Filters Space efficient probabilistic data structure Test whether an element is a member of a set Allow false positive, but not false negative  k hash functions Union and intersection are implemented as bitwise OR, AND
  • 61. Read Properteis Read multiple SSTables  Slower than writes (but still fast)  Seeks can be mitigated with more RAM Uses probabilistic bloom filters to reduce lookups. Extensive optional caching Key Cache Row Cache Excellent monitoring
  • 63. Compactions Merge keys  Combine columns  Discard tombstones Use bloom filters bitwise OR operation
  • 64. Gossip p2p Enables seamless nodes addition. Rebalancing of keys Fast detection of nodes that goes down. Every node knows about all others - no master.
  • 65. Deletions Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction  Read repair complicates things a little  Eventually consistent complicates things more  Solution: configurable delay before tombstone GC, after which tombstones are not repaired
  • 66. Extra Long list of subjects SEDA (Staged Events Driven Architecture) Anti Entropy and Merkle Trees Hinted Handoff repair on read
  • 67. SEDA Mutate Stream Gossip Response Anti Entropy Load Balance Migration 
  • 68. Anti Entropy and Merkle Trees
  • 70. References http://horicky.blogspot.com/2009/11/nosql-patterns.html http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf http://labs.google.com/papers/bigtable.html http://bret.appspot.com/entry/how-friendfeed-uses-mysql http://www.julianbrowne.com/article/viewer/brewers-cap-theorem http://www.allthingsdistributed.com/2008/12/eventually_consistent.html http://wiki.apache.org/cassandra/DataModel http://incubator.apache.org/thrift/ http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf

Editor's Notes

  • #22: Columns may have dynamic names