SlideShare a Scribd company logo
Distributed Data Stores and
No SQL Databases
S. Sudarshan
IIT Bombay
Parallel Databases and Data Stores
 Relational Databases – mainstay of business
 Web-based applications caused spikes
 Especially true for public-facing e-Commerce sites
 Many application servers, one database
 Easy to parallelize application servers to 1000s of
servers, harder to parallelize databases to same scale
 First solution: memcache or other caching
mechanisms to reduce database access
Scaling Up
 What if the dataset is huge, and very high
number of transactions per second
 Use multiple servers to host database
 Parallel databases have been around for a
while
 But expensive, and designed for decision
support not OLTP
Scaling RDBMS – Master/Slave
 Master-Slave
 All writes are written to the master. All reads
performed against the replicated slave
databases
 Good for mostly read, very few update
applications
 Critical reads may be incorrect as writes may
not have been propagated down
 Large data sets can pose problems as
master needs to duplicate data to slaves
Scaling RDBMS - Partitioning
 Partitioning
 Divide the database across many machines
 E.g. hash or range partitioning
 Handled transparently by parallel databases
 but they are expensive
 “Sharding”
 Divide data amongst many cheap databases
(MySQL/PostgreSQL)
 Manage parallel access in the application
 Scales well for both reads and writes
 Not transparent, application needs to be partition-aware
What is NoSQL?
 Stands for Not Only SQL
 Class of non-relational data storage systems
 E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
 Usually do not require a fixed table schema nor
do they use the concept of joins
 All NoSQL offerings relax one or more of the
ACID properties (will talk about the CAP
theorem)
 Not a backlash/rebellion against RDBMS
 SQL is a rich query language that cannot be
rivaled by the current list of NoSQL offerings
Why Now?
 Explosion of social media sites (Facebook,
Twitter) with large data needs
 Explosion of storage needs in large web
sites such as Google, Yahoo
 Much of the data is not files
 Rise of cloud-based solutions such as
Amazon S3 (simple storage solution)
 Shift to dynamically-typed data with frequent
schema changes
 Open-source community
Distributed Key-Value Data Stores
 Distributed key-value data storage systems allow
key-value pairs to be stored (and retrieved on key)
in a massively parallel system
 E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon
Dynamo, ..
 Partitioning, high availability etc completely
transparent to application
 Sharding systems and key-value stores don’t
support many relational features
 No join operations (except within partition)
 No referential integrity constraints across partitions
 etc.
Typical NoSQL API
 Basic API access:
 get(key) -- Extract the value given a key
 put(key, value) -- Create or update the value
given its key
 delete(key) -- Remove the key and its
associated value
 execute(key, operation, parameters) --
Invoke an operation to the value (given its
key) which is a special data structure (e.g.
List, Set, Map .... etc).
Flexible Data Model
ColumnFamily: Rockets
Key Value
1
2
3
Name Value
toon
inventoryQty
brakes
Rocket-Powered Roller Skates
Ready, Set, Zoom
5
false
name
Name Value
toon
inventoryQty
brakes
Little Giant Do-It-Yourself Rocket-Sled Kit
Beep Prepared
4
false
Name Value
toon
inventoryQty
wheels
Acme Jet Propelled Unicycle
Hot Rod and Reel
1
1
name
name
NoSQL Data Storage: Classification
 Uninterpreted key/value or ‘the big hash
table’.
 Amazon S3 (Dynamo)
 Flexible schema
 BigTable, Cassandra, HBase (ordered keys,
semi-structured data),
 Sherpa/PNuts (unordered keys, JSON)
 MongoDB (based on JSON)
 CouchDB (name/value in text)
PNUTS Data Storage Architecture
CAP Theorem
 Three properties of a system
 Consistency (all copies have same value)
 Availability (system can run even if parts have failed)
 Partitions (network can break into two or more parts,
each with active systems that can’t talk to other parts)
 Brewer’s CAP “Theorem”: You can have at most
two of these three properties for any system
 Very large systems will partition at some point
 Choose one of consistency or availablity
 Traditional database choose consistency
 Most Web applications choose availability
 Except for specific parts such as order processing
Availability
 Traditionally, thought of as the
server/process available five 9’s (99.999
%).
 However, for large node system, at almost
any point in time there’s a good chance that
a node is either down or there is a network
disruption among the nodes.
 Want a system that is resilient in the face of
network disruption
Eventual Consistency
 When no updates occur for a long period of time, eventually
all updates will propagate through the system and all the
nodes will be consistent
 For a given accepted update and a given node, eventually
either the update reaches the node or the node is removed
from service
 Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
 Soft state: copies of a data item may be inconsistent
 Eventually Consistent – copies becomes consistent at
some later time if there are no more updates to that data
item
Common Advantages
 Cheap, easy to implement (open source)
 Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be
partitioned
 When data is written, the latest version is on at least
one node and then replicated to other nodes
 Down nodes easily replaced
 No single point of failure
 Easy to distribute
 Don't require a schema
What does NoSQL Not Provide?
 Joins
 Group by
 But PNUTS provides interesting
materialized view approach to
joins/aggregation.
 ACID transactions
 SQL
 Integration with applications that are based
on SQL
Should I be using NoSQL Databases?
 NoSQL Data storage systems makes sense for
applications that need to deal with very very large
semi-structured data
 Log Analysis
 Social Networking Feeds
 Most of us work on organizational databases,
which are not that large and have low
update/query rates
 regular relational databases are THE correct
solution for such applications
Further Reading
 Lots of material on the Web
 E.g. nice presentation on NoSQL by Perry Hoekstra
(Perficient)
 Some material in this talk is from above
presentation
 Use a search engine to find information on data
storage systems such as
 BigTable (Google), Dynamo (Amazon), Cassandra
(Facebook/Apache), Pnuts/Sherpa (Yahoo),
CouchDB, MongoDB, …
 Several of above are open source

More Related Content

Similar to 05 No SQL Sudarshan.ppt (20)

PPTX
No sql database
vishal gupta
 
PPTX
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
plvdravikumarit
 
PPTX
NOSQL
akbarashaikh
 
PPTX
Master.pptx
KarthikR780430
 
PPT
6269441.ppt
Swapna Jk
 
PDF
NOsql Presentation.pdf
AkshayDwivedi31
 
PPTX
NoSQL(NOT ONLY SQL)
Rahul P
 
PDF
NOSQL- Presentation on NoSQL
Ramakant Soni
 
PDF
NOSQL in big data is the not only structure langua.pdf
ajajkhan16
 
PPTX
NoSQL databases - An introduction
Pooyan Mehrparvar
 
PPTX
No sq lv2
Nusrat Sharmin
 
PPTX
Sql vs NoSQL
RTigger
 
PPT
No sql
Murat Çakal
 
PPTX
NoSQL Basics and MongDB
Shamima Yeasmin Mukta
 
PDF
NoSQL Basics - A Quick Tour
Bikram Sinha. MBA, PMP
 
PPTX
Introduction to asdfghjkln b vfgh n v
23mz02
 
PDF
the rising no sql technology
INFOGAIN PUBLICATION
 
PDF
Nosql Presentation.pdf for DBMS understanding
HUSNAINAHMAD39
 
PPSX
A Seminar on NoSQL Databases.
Navdeep Charan
 
PPT
No sql
Shruti_gtbit
 
No sql database
vishal gupta
 
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
plvdravikumarit
 
Master.pptx
KarthikR780430
 
6269441.ppt
Swapna Jk
 
NOsql Presentation.pdf
AkshayDwivedi31
 
NoSQL(NOT ONLY SQL)
Rahul P
 
NOSQL- Presentation on NoSQL
Ramakant Soni
 
NOSQL in big data is the not only structure langua.pdf
ajajkhan16
 
NoSQL databases - An introduction
Pooyan Mehrparvar
 
No sq lv2
Nusrat Sharmin
 
Sql vs NoSQL
RTigger
 
No sql
Murat Çakal
 
NoSQL Basics and MongDB
Shamima Yeasmin Mukta
 
NoSQL Basics - A Quick Tour
Bikram Sinha. MBA, PMP
 
Introduction to asdfghjkln b vfgh n v
23mz02
 
the rising no sql technology
INFOGAIN PUBLICATION
 
Nosql Presentation.pdf for DBMS understanding
HUSNAINAHMAD39
 
A Seminar on NoSQL Databases.
Navdeep Charan
 
No sql
Shruti_gtbit
 

Recently uploaded (20)

PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Ad

05 No SQL Sudarshan.ppt

  • 1. Distributed Data Stores and No SQL Databases S. Sudarshan IIT Bombay
  • 2. Parallel Databases and Data Stores  Relational Databases – mainstay of business  Web-based applications caused spikes  Especially true for public-facing e-Commerce sites  Many application servers, one database  Easy to parallelize application servers to 1000s of servers, harder to parallelize databases to same scale  First solution: memcache or other caching mechanisms to reduce database access
  • 3. Scaling Up  What if the dataset is huge, and very high number of transactions per second  Use multiple servers to host database  Parallel databases have been around for a while  But expensive, and designed for decision support not OLTP
  • 4. Scaling RDBMS – Master/Slave  Master-Slave  All writes are written to the master. All reads performed against the replicated slave databases  Good for mostly read, very few update applications  Critical reads may be incorrect as writes may not have been propagated down  Large data sets can pose problems as master needs to duplicate data to slaves
  • 5. Scaling RDBMS - Partitioning  Partitioning  Divide the database across many machines  E.g. hash or range partitioning  Handled transparently by parallel databases  but they are expensive  “Sharding”  Divide data amongst many cheap databases (MySQL/PostgreSQL)  Manage parallel access in the application  Scales well for both reads and writes  Not transparent, application needs to be partition-aware
  • 6. What is NoSQL?  Stands for Not Only SQL  Class of non-relational data storage systems  E.g. BigTable, Dynamo, PNUTS/Sherpa, ..  Usually do not require a fixed table schema nor do they use the concept of joins  All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)  Not a backlash/rebellion against RDBMS  SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings
  • 7. Why Now?  Explosion of social media sites (Facebook, Twitter) with large data needs  Explosion of storage needs in large web sites such as Google, Yahoo  Much of the data is not files  Rise of cloud-based solutions such as Amazon S3 (simple storage solution)  Shift to dynamically-typed data with frequent schema changes  Open-source community
  • 8. Distributed Key-Value Data Stores  Distributed key-value data storage systems allow key-value pairs to be stored (and retrieved on key) in a massively parallel system  E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon Dynamo, ..  Partitioning, high availability etc completely transparent to application  Sharding systems and key-value stores don’t support many relational features  No join operations (except within partition)  No referential integrity constraints across partitions  etc.
  • 9. Typical NoSQL API  Basic API access:  get(key) -- Extract the value given a key  put(key, value) -- Create or update the value given its key  delete(key) -- Remove the key and its associated value  execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).
  • 10. Flexible Data Model ColumnFamily: Rockets Key Value 1 2 3 Name Value toon inventoryQty brakes Rocket-Powered Roller Skates Ready, Set, Zoom 5 false name Name Value toon inventoryQty brakes Little Giant Do-It-Yourself Rocket-Sled Kit Beep Prepared 4 false Name Value toon inventoryQty wheels Acme Jet Propelled Unicycle Hot Rod and Reel 1 1 name name
  • 11. NoSQL Data Storage: Classification  Uninterpreted key/value or ‘the big hash table’.  Amazon S3 (Dynamo)  Flexible schema  BigTable, Cassandra, HBase (ordered keys, semi-structured data),  Sherpa/PNuts (unordered keys, JSON)  MongoDB (based on JSON)  CouchDB (name/value in text)
  • 12. PNUTS Data Storage Architecture
  • 13. CAP Theorem  Three properties of a system  Consistency (all copies have same value)  Availability (system can run even if parts have failed)  Partitions (network can break into two or more parts, each with active systems that can’t talk to other parts)  Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system  Very large systems will partition at some point  Choose one of consistency or availablity  Traditional database choose consistency  Most Web applications choose availability  Except for specific parts such as order processing
  • 14. Availability  Traditionally, thought of as the server/process available five 9’s (99.999 %).  However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.  Want a system that is resilient in the face of network disruption
  • 15. Eventual Consistency  When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent  For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service  Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID  Soft state: copies of a data item may be inconsistent  Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item
  • 16. Common Advantages  Cheap, easy to implement (open source)  Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned  When data is written, the latest version is on at least one node and then replicated to other nodes  Down nodes easily replaced  No single point of failure  Easy to distribute  Don't require a schema
  • 17. What does NoSQL Not Provide?  Joins  Group by  But PNUTS provides interesting materialized view approach to joins/aggregation.  ACID transactions  SQL  Integration with applications that are based on SQL
  • 18. Should I be using NoSQL Databases?  NoSQL Data storage systems makes sense for applications that need to deal with very very large semi-structured data  Log Analysis  Social Networking Feeds  Most of us work on organizational databases, which are not that large and have low update/query rates  regular relational databases are THE correct solution for such applications
  • 19. Further Reading  Lots of material on the Web  E.g. nice presentation on NoSQL by Perry Hoekstra (Perficient)  Some material in this talk is from above presentation  Use a search engine to find information on data storage systems such as  BigTable (Google), Dynamo (Amazon), Cassandra (Facebook/Apache), Pnuts/Sherpa (Yahoo), CouchDB, MongoDB, …  Several of above are open source