05 No SQL Sudarshan.ppt

Distributed Data Stores and
No SQL Databases
S. Sudarshan
IIT Bombay

Parallel Databases and Data Stores
 Relational Databases – mainstay of business
 Web-based applications caused spikes
 Especially true for public-facing e-Commerce sites
 Many application servers, one database
 Easy to parallelize application servers to 1000s of
servers, harder to parallelize databases to same scale
 First solution: memcache or other caching
mechanisms to reduce database access

Scaling Up
 What if the dataset is huge, and very high
number of transactions per second
 Use multiple servers to host database
 Parallel databases have been around for a
while
 But expensive, and designed for decision
support not OLTP

Scaling RDBMS – Master/Slave
 Master-Slave
 All writes are written to the master. All reads
performed against the replicated slave
databases
 Good for mostly read, very few update
applications
 Critical reads may be incorrect as writes may
not have been propagated down
 Large data sets can pose problems as
master needs to duplicate data to slaves

Scaling RDBMS - Partitioning
 Partitioning
 Divide the database across many machines
 E.g. hash or range partitioning
 Handled transparently by parallel databases
 but they are expensive
 “Sharding”
 Divide data amongst many cheap databases
(MySQL/PostgreSQL)
 Manage parallel access in the application
 Scales well for both reads and writes
 Not transparent, application needs to be partition-aware

What is NoSQL?
 Stands for Not Only SQL
 Class of non-relational data storage systems
 E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
 Usually do not require a fixed table schema nor
do they use the concept of joins
 All NoSQL offerings relax one or more of the
ACID properties (will talk about the CAP
theorem)
 Not a backlash/rebellion against RDBMS
 SQL is a rich query language that cannot be
rivaled by the current list of NoSQL offerings

Why Now?
 Explosion of social media sites (Facebook,
Twitter) with large data needs
 Explosion of storage needs in large web
sites such as Google, Yahoo
 Much of the data is not files
 Rise of cloud-based solutions such as
Amazon S3 (simple storage solution)
 Shift to dynamically-typed data with frequent
schema changes
 Open-source community

Distributed Key-Value Data Stores
 Distributed key-value data storage systems allow
key-value pairs to be stored (and retrieved on key)
in a massively parallel system
 E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon
Dynamo, ..
 Partitioning, high availability etc completely
transparent to application
 Sharding systems and key-value stores don’t
support many relational features
 No join operations (except within partition)
 No referential integrity constraints across partitions
 etc.

Typical NoSQL API
 Basic API access:
 get(key) -- Extract the value given a key
 put(key, value) -- Create or update the value
given its key
 delete(key) -- Remove the key and its
associated value
 execute(key, operation, parameters) --
Invoke an operation to the value (given its
key) which is a special data structure (e.g.
List, Set, Map .... etc).

Flexible Data Model
ColumnFamily: Rockets
Key Value
1
2
3
Name Value
toon
inventoryQty
brakes
Rocket-Powered Roller Skates
Ready, Set, Zoom
5
false
name
Name Value
toon
inventoryQty
brakes
Little Giant Do-It-Yourself Rocket-Sled Kit
Beep Prepared
4
false
Name Value
toon
inventoryQty
wheels
Acme Jet Propelled Unicycle
Hot Rod and Reel
1
1
name
name

NoSQL Data Storage: Classification
 Uninterpreted key/value or ‘the big hash
table’.
 Amazon S3 (Dynamo)
 Flexible schema
 BigTable, Cassandra, HBase (ordered keys,
semi-structured data),
 Sherpa/PNuts (unordered keys, JSON)
 MongoDB (based on JSON)
 CouchDB (name/value in text)

PNUTS Data Storage Architecture

CAP Theorem
 Three properties of a system
 Consistency (all copies have same value)
 Availability (system can run even if parts have failed)
 Partitions (network can break into two or more parts,
each with active systems that can’t talk to other parts)
 Brewer’s CAP “Theorem”: You can have at most
two of these three properties for any system
 Very large systems will partition at some point
 Choose one of consistency or availablity
 Traditional database choose consistency
 Most Web applications choose availability
 Except for specific parts such as order processing

Availability
 Traditionally, thought of as the
server/process available five 9’s (99.999
%).
 However, for large node system, at almost
any point in time there’s a good chance that
a node is either down or there is a network
disruption among the nodes.
 Want a system that is resilient in the face of
network disruption

Eventual Consistency
 When no updates occur for a long period of time, eventually
all updates will propagate through the system and all the
nodes will be consistent
 For a given accepted update and a given node, eventually
either the update reaches the node or the node is removed
from service
 Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
 Soft state: copies of a data item may be inconsistent
 Eventually Consistent – copies becomes consistent at
some later time if there are no more updates to that data
item

Common Advantages
 Cheap, easy to implement (open source)
 Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be
partitioned
 When data is written, the latest version is on at least
one node and then replicated to other nodes
 Down nodes easily replaced
 No single point of failure
 Easy to distribute
 Don't require a schema

What does NoSQL Not Provide?
 Joins
 Group by
 But PNUTS provides interesting
materialized view approach to
joins/aggregation.
 ACID transactions
 SQL
 Integration with applications that are based
on SQL

Should I be using NoSQL Databases?
 NoSQL Data storage systems makes sense for
applications that need to deal with very very large
semi-structured data
 Log Analysis
 Social Networking Feeds
 Most of us work on organizational databases,
which are not that large and have low
update/query rates
 regular relational databases are THE correct
solution for such applications

Further Reading
 Lots of material on the Web
 E.g. nice presentation on NoSQL by Perry Hoekstra
(Perficient)
 Some material in this talk is from above
presentation
 Use a search engine to find information on data
storage systems such as
 BigTable (Google), Dynamo (Amazon), Cassandra
(Facebook/Apache), Pnuts/Sherpa (Yahoo),
CouchDB, MongoDB, …
 Several of above are open source

05 No SQL Sudarshan.ppt

More Related Content

Similar to 05 No SQL Sudarshan.ppt (20)

Recently uploaded (20)

05 No SQL Sudarshan.ppt