Databases
Eduard Tudenhöfner
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
Why NoSQL?
● original intention: modern web-scale DBs
○ amount of data drastically increased
○ data in the web is less structured
● higher requirements regarding performance
● some problems are easier to solve without the relational approach
● scaling out & running on commodity HW is much cheaper than scaling up
Typical Characteristics
● non-relational
● horizontally scalable
● flexible schema
● easy replication support
● simple API
● eventually consistent -> BASE principle
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
Classification
source: http://blog.octo.com/wp-content/uploads/2012/07/QuadrantNoSQL.png
Classification
source: http://www.sics.se/~amir/files/download/dic/NoSQL%20Databases.pdf
Key/Value Stores
● data model: collection of key/value pairs
● keys and values can be complex compounds
● based on Amazon’s Dynamo Paper
● designed to handle massive load
Key/Value Stores
● no complex query filters
● all joins must be in the code
● easy to distribute across cluster
● very predictable performance -> O(1)
Wide Column Stores
● Tables are similar to RDBMS, but semi-structured
● based on Google’s BigTable
● Rows can have arbitrary columns
Wide Column Stores -> BigTable
● <RowKey, ColumnKey, Timestamp> triple as key for lookups, inserts, deletes
● ColumnKey uses syntax family:qualifier
● arbitrary columns on a row-by-row basis
● does not support a relational model
○ no table-wide integrity constraints
○ no multi-row transactions
source: http://research.google.com/archive/bigtable.html
Document Stores
● inspired by Lotus Notes
● central concept of a Document
● Documents encapsulate/encode data in some format/encoding
● Encodings:
○ XML, YAML, JSON, BSON, PDF
Document Stores
source: http://www.mongodb.org/
Document Stores
source: http://www.mongodb.org/
Graph Databases
● based on Graph Theory -> G = (V, E)
● designed for data that is well represented in a graph
○ social networks, public transport links, network topologies, road maps
● nodes, edges, properties are used to represent and store data
● graph relationships are queryable
Graph Databases
source: http://www.neo4j.org/
Graph Databases
source: http://en.wikipedia.org/wiki/Graph_database
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
CAP Theorem
source: http://blog.nahurst.com/visual-guide-to-nosql-systems
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
ACID
● Atomicity
○ all-or-nothing approach
● Consistency
○ DB will be in a consistent state before & after a transaction
● Isolation
○ transaction will behave as if it’s the only operation being performed upon the
DB
● Durability
○ once a transaction is committed, it is durably preserved
● CA-Systems are ACID-Systems
BASE
● an application that works basically all the time, does not have to be
consistent all the time, but will be in some known state eventually
● Basically Available
○ achieved by using a highly distributed approach
● Soft State
○ state of the system is always “soft” due to eventual consistency
● Eventual Consistency (in German: schlussendliche Konsistenz)
○ at some point in the future, the data will be consistent
○ no guarantees are made about when this will occur
BASE vs ACID
source: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
Cassandra
● initially created by Facebook for Inbox Search
● distributed, horizontally scalable database
● high availability
● very flexible data model
○ data might be structured, semi-structured, unstructured
● commercial support through DataStax
Cassandra - Design
● all nodes are equally important
● no Single-Point-of-Failure
● no central controller
● no master/slave relationships
● every node knows how to route requests
and where the data lives
source: http://cassandra.apache.org/
Scales Linearly
source: http://www.datastax.com
Uses Consistent Hashing
Murmur3Partitioner generates hash
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeHashing_c.html
Uses Consistent Hashing
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeHashing_c.html
Writes are very fast
● All writes are sequential
● no reading & seeking before a
write
● Each of the N node will perform
the following upon receiving the
RowMutation message:
○ Append write to the commit log
○ Update in-memory Memtable data
structure
○ Write is done!
● If Memtable gets full, it’s flushed
to disk (SSTable)
source: http://www.roman10.net/how-apache-cassandra-write-works/
Write Requests
● Client requests can go to any node in the cluster because all nodes are
peers
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
write consistency level
is configurable
Write Requests
● Cassandra chooses one Coordinator per remote data center to handle
requests to replicas
● coordinator only needs to forward WR to one node in each remote data
center
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
Read Requests
● Two different types of Read Requests
○ direct read request (RR)
○ background read repair request (RRR)
● number of replicas contacted by a RR is determined by Consistency Level
● RRR are sent to any additional nodes that did not get a direct RR
● RRR ensure consistency
Read Requests
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html
Read Requests
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html
2 of the 3 replicas for the
given row must respond
to fulfill the read request
Read Requests
source: http://www.datastax.com/documentation/cassandra/2.
0/cassandra/architecture/architectureClientRequestsRead_c.html
CQL
● very similar to SQL
● does not support JOINS / Subqueries
● no referential integrity
● no cascading operations
We denormalize the data because joins
are not performant in a distributed
system
CQL
CQL
no index, no service :)
CQL - Collections
● CQL introduced collections to columns
○ list
○ map
○ set
● Add new collections to the previous example
CQL - Collections
Cassandra vs MySQL (50GB)
● MySQL
○ writes avg: ~300ms
○ reads avg: ~350ms
● Cassandra
○ writes avg: ~0.12ms
○ reads avg: ~15ms
source: http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf
Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
Summary
● elastic scaling (scaling out instead of up)
● huge amounts of data can be handled while maintaining high
throughput rates
● require less DBA’s and management resources
○ automatic repairs/data distribution
○ simpler data models
● better economics
○ cost per GB is much lower than for RDBMS due to clusters of
commodity HW
○ we handle more data with less money
● flexible data models
○ very relaxed or even non-existent data model restrictions
○ changes to data model are much cheaper
Summary
● might not be mature enough for enterprises
● compatibility issues regarding standards
○ each DB has its own API
○ not easy to switch to another NoSQL DB
● search support is not the same as in RDBMS
● easier to find experienced RDBMS experts than NoSQL experts
Which DB for which purpose?
● NoSQL is an alternative
○ addresses certain limitations of the relational DB world
● depends on characteristics of data
○ if data is well structured -> relational DB might be better
○ if data is very complex -> might be difficult to map it to the
relational model
● depends on volatility of the data model
○ what if schema changes daily?
● relational DBs still have their pluses
○ relational model / transactions / query language
○ should be used when multi-row transactions and strict consistency is
required
Thank you! - Questions?

NoSQL Databases