SlideShare a Scribd company logo
Database Applications (15-415)
Part I- NoSQL Databases
Lecture 26, April 21, 2015
Mohammad Hammoud
Today…
 Last Session:
 Recovery Management
 Today’s Session:
 NoSQL databases
 Announcements:
 PS4 grades are out
 On Thursday, April 23rd we will practice on Hive (during recitation)
 PS5 (the “last” assignment) is due on Thursday, April 23rd by midnight
 P4: Write a survey on SQL vs. NoSQL databases (optional)- due on
Friday, April 24th by midnight
 The final exam is on Monday April 27th, from 8:30AM to 11:30AM in
room 1190 (all materials are included- open book, open notes)
Outline
Types of Data
Scaling Databases & the 2PC Protocol
The CAP Theorem and the BASE
Properties
NoSQL Databases

Types of Data
 Data can be broadly classified into four types:
1. Structured Data:
 Have a predefined model, which organizes data into a
form that is relatively easy to store, process, retrieve
and manage
 E.g., relational data
2. Unstructured Data:
 Opposite of structured data
 E.g., Flat binary files containing text, video or audio
 Note: data is not completely devoid of a structure (e.g.,
an audio file may still have an encoding structure and
some metadata associated with it)
Types of Data
 Data can be broadly classified into four types:
3. Dynamic Data:
 Data that changes relatively frequently
 E.g., office documents and transactional entries in a
financial database
4. Static Data:
 Opposite of dynamic data
 E.g., Medical imaging data from MRI or CT scans
Why Classifying Data?
 Segmenting data into one of the following 4 quadrants can help in
designing and developing a pertaining storage solution
 Relational databases are usually used for structured data
 File systems or NoSQL databases can be used for (static),
unstructured data (more on these later)
Media Production, eCAD,
mCAD, Office Docs
Media Archive, Broadcast,
Medical Imaging
Transaction Systems, ERP,
CRM
BI, Data Warehousing
Dynamic
Unstructured
Structured
Static
Outline
Types of Data
Scaling Databases & the 2PC Protocol
The CAP Theorem and the BASE
Properties
NoSQL Databases

Scaling Traditional Databases
 Traditional RDBMSs can be either scaled:
 Vertically (or Up)
 Can be achieved by hardware upgrades (e.g., faster CPU,
more memory, or larger disk)
 Limited by the amount of CPU, RAM and disk that can be
configured on a single machine
 Horizontally (or Out)
 Can be achieved by adding more machines
 Requires database sharding and probably replication
 Limited by the Read-to-Write ratio and communication
overhead
Why Sharding Data?
 Data is typically sharded (or striped) to allow for
concurrent/parallel accesses
Input data: A large file
Machine 1
Chunk1 of input data
Machine 2
Chunk3 of input data
Machine 3
Chunk5 of input data
Chunk2 of input data Chunk4 of input data Chunk5 of input data
E.g., Chunks 1, 3 and 5 can be accessed in parallel
Amdahl’s Law
 How much faster will a parallel program run?
 Suppose that the sequential execution of a program takes T1 time
units and the parallel execution on p processors/machines takes
Tp time units
 Suppose that out of the entire execution of the program, s
fraction of it is not parallelizable while 1-s fraction is parallelizable
 Then the speedup (Amdahl’s formula):
10
Amdahl’s Law: An Example
 Suppose that:
 80% of your program can be parallelized
 4 machines are used to run your parallel version of
the program
 The speedup you can get according to Amdahl’s law is:
11
Although you use 4 processors you cannot get a speedup more
than 2.5 times!
Real Vs. Actual Cases
 Amdahl’s argument is too simplified
 In reality, communication overhead and potential workload
imbalance exist upon running parallel programs
20 80
20 20
Process 1
Process 2
Process 3
Process 4
Serial
Parallel
1. Parallel Speed-up: An Ideal Case
Cannot be parallelized
Can be parallelized
20 80
20 20
Process 1
Process 2
Process 3
Process 4
Serial
Parallel
2. Parallel Speed-up: An Actual Case
Cannot be parallelized
Can be parallelized
Load Unbalance
Communication overhead
Some Guidelines
 Here are some guidelines to effectively benefit
from parallelization:
1. Maximize the fraction of your program that can
be parallelized
2. Balance the workload of parallel processes
3. Minimize the time spent for communication
13
Why Replicating Data?
 Replicating data across servers helps in:
 Avoiding performance bottlenecks
 Avoiding single point of failures
 And, hence, enhancing scalability and availability
Why Replicating Data?
 Replicating data across servers helps in:
 Avoiding performance bottlenecks
 Avoiding single point of failures
 And, hence, enhancing scalability and availability
Main Server
Replicated Servers
But, Consistency Becomes a Challenge
 An example:
 In an e-commerce application, the bank database has
been replicated across two servers
 Maintaining consistency of replicated data is a challenge
Bal=1000 Bal=1000
Replicated Database
Event 1 = Add $1000 Event 2 = Add interest of 5%
Bal=2000
1 2
Bal=1050
3 Bal=2050
4
Bal=2100
The Two-Phase Commit Protocol
 The two-phase commit protocol (2PC) can be used to
ensure atomicity and consistency
Database Server 1
Participant 1
Coordinator Database Server 2
Participant 2
Database Server 3
Participant 3
VOTE_REQUEST
VOTE_REQUEST
VOTE_REQUEST
Phase I: Voting
VOTE_COMMIT
VOTE_COMMIT
VOTE_COMMIT
The Two-Phase Commit Protocol
 The two-phase commit protocol (2PC) can be used to
ensure atomicity and consistency
Database Server 1
Participant 1
Coordinator Database Server 2
Participant 2
Database Server 3
Participant 3
GLOBAL_COMMIT
GLOBAL_COMMIT
GLOBAL_COMMIT
Phase II: Commit
LOCAL_COMMIT
LOCAL_COMMIT
LOCAL_COMMIT
“Strict” consistency, which
limits scalability!
Outline
Types of Data
Scaling Databases & the 2PC Protocol
The CAP Theorem and the BASE
Properties
NoSQL Databases

The CAP Theorem
 The limitations of distributed databases can be described
in the so called the CAP theorem
 Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)
 Availability: the system continues to operate, even if nodes
in a cluster crash, or some hardware or software parts are
down due to upgrades
 Partition Tolerance: the system continues to operate in the
presence of network partitions
CAP theorem: any distributed database with shared data, can have at most two
of the three desirable properties, C, A or P
The CAP Theorem (Cont’d)
 Let us assume two nodes on opposite sides of a
network partition:
 Availability + Partition Tolerance forfeit Consistency
 Consistency + Partition Tolerance entails that one side of
the partition must act as if it is unavailable, thus
forfeiting Availability
 Consistency + Availability is only possible if there is no
network partition, thereby forfeiting Partition Tolerance
Large-Scale Databases
 When companies such as Google and Amazon were
designing large-scale databases, 24/7 Availability was a key
 A few minutes of downtime means lost revenue
 When horizontally scaling databases to 1000s of machines,
the likelihood of a node or a network failure
increases tremendously
 Therefore, in order to have strong guarantees on
Availability and Partition Tolerance, they had to sacrifice
“strict” Consistency (implied by the CAP theorem)
Trading-Off Consistency
 Maintaining consistency should balance between the
strictness of consistency versus availability/scalability
 Good-enough consistency depends on your application
Trading-Off Consistency
 Maintaining consistency should balance between the
strictness of consistency versus availability/scalability
 Good-enough consistency depends on your application
Strict Consistency
Generally hard to implement,
and is inefficient
Loose Consistency
Easier to implement,
and is efficient
The BASE Properties
 The CAP theorem proves that it is impossible to guarantee
strict Consistency and Availability while being able to
tolerate network partitions
 This resulted in databases with relaxed ACID guarantees
 In particular, such databases apply the BASE properties:
 Basically Available: the system guarantees Availability
 Soft-State: the state of the system may change over time
 Eventual Consistency: the system will eventually
become consistent
Eventual Consistency
 A database is termed as Eventually Consistent if:
 All replicas will gradually become consistent in the
absence of updates
Eventual Consistency
 A database is termed as Eventually Consistent if:
 All replicas will gradually become consistent in the
absence of updates
Webpage-A
Event: Update Webpage-
A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Eventual Consistency:
A Main Challenge
 But, what if the client accesses the data from
different replicas?
Webpage-A
Event: Update Webpage-
A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Webpage-A
Protocols like Read Your Own Writes (RYOW) can be applied!
Outline
Types of Data
Scaling Databases & the 2PC Protocol
The CAP Theorem and the BASE
Properties
NoSQL Databases

NoSQL Databases
 To this end, a new class of databases emerged, which
mainly follow the BASE properties
 These were dubbed as NoSQL databases
 E.g., Amazon’s Dynamo and Google’s Bigtable
 Main characteristics of NoSQL databases include:
 No strict schema requirements
 No strict adherence to ACID properties
 Consistency is traded in favor of Availability
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:
NoSQL Databases
Document
Stores
Graph
Databases
Key-Value
Stores
Columnar
Databases
Document Stores
 Documents are stored in some standard format or
encoding (e.g., XML, JSON, PDF or Office Documents)
 These are typically referred to as Binary Large Objects
(BLOBs)
 Documents can be indexed
 This allows document stores to outperform traditional
file systems
 E.g., MongoDB and CouchDB (both can be queried
using MapReduce)
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:
NoSQL Databases
Document
Stores
Graph
Databases
Key-Value
Stores
Columnar
Databases
Graph Databases
 Data are represented as vertices and edges
 Graph databases are powerful for graph-like queries (e.g., find
the shortest path between two elements)
 E.g., Neo4j and VertexDB
Id: 1
Name: Alice
Age: 18
Id: 2
Name: Bob
Age: 22
Id: 3
Name: Chess
Type: Group
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:
NoSQL Databases
Document
Stores
Graph
Databases
Key-Value
Stores
Columnar
Databases
Key-Value Stores
 Keys are mapped to (possibly) more complex value
(e.g., lists)
 Keys can be stored in a hash table and can be
distributed easily
 Such stores typically support regular CRUD (create,
read, update, and delete) operations
 That is, no joins and aggregate functions
 E.g., Amazon DynamoDB and Apache Cassandra
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:
NoSQL Databases
Document
Stores
Graph
Databases
Key-Value
Stores
Columnar
Databases
Columnar Databases
 Columnar databases are a hybrid of RDBMSs and Key-
Value stores
 Values are stored in groups of zero or more columns, but in
Column-Order (as opposed to Row-Order)
 Values are queried by matching keys
 E.g., HBase and Vertica
Alice 3 25 Bob
4 19 Carol 0
45
Record 1
Row-Order
Alice
3 25
Bob
4
19
Carol
0
45
Column A
Columnar (or Column-Order)
Alice
3 25
Bob
4 19
Carol
0 45
Columnar with Locality Groups
Column A = Group A
Column Family {B, C}
Summary
 Data can be classified into 4 types, structured,
unstructured, dynamic and static
 Different data types usually entail different database
designs
 Databases can be scaled up or out
 The 2PC protocol can be used to ensure strict
consistency
 Strict consistency limits scalability
Summary (Cont’d)
 The CAP theorem states that any distributed
database with shared data can have at most two
of the three desirable properties:
 Consistency
 Availability
 Partition Tolerance
 The CAP theorem lead to various designs of
databases with relaxed ACID guarantees
Summary (Cont’d)
 NoSQL (or Not-Only-SQL) databases follow the BASE
properties:
 Basically Available
 Soft-State
 Eventual Consistency
 NoSQL databases have different types:
 Document Stores
 Graph Databases
 Key-Value Stores
 Columnar Databases

More Related Content

PPTX
17-NoSQL.pptx
levichan1
 
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
PPTX
No sql databases
Ankit Dubey
 
PPTX
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
PDF
Lecture-04-Principles of data management.pdf
manimozhi98
 
PDF
1. Lecture1_NOSQL_Introduction.pdf
ShaimaaMohamedGalal
 
PPTX
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
PPTX
HbaseHivePigbyRohitDubey
Rohit Dubey
 
17-NoSQL.pptx
levichan1
 
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
No sql databases
Ankit Dubey
 
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Lecture-04-Principles of data management.pdf
manimozhi98
 
1. Lecture1_NOSQL_Introduction.pdf
ShaimaaMohamedGalal
 
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
HbaseHivePigbyRohitDubey
Rohit Dubey
 

Similar to NoSQL (20)

PPTX
Hbase hive pig
Xuhong Zhang
 
PPTX
NoSQL and Couchbase
Sangharsh agarwal
 
PPTX
Lecture 1-Introduction of NoSQL in DBMS.pptx
agrawalmonikacomp
 
PPTX
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
plvdravikumarit
 
PDF
System design fundamentals CAP.pdf
UsmanAhmed269749
 
PPT
6269441.ppt
Swapna Jk
 
PDF
Lightning talk: highly scalable databases and the PACELC theorem
Vishal Bardoloi
 
PPTX
To SQL or NoSQL, that is the question
Krishnakumar S
 
PDF
CM2-Data model for Big Data chapter2.pdf
ArsimKrasniqi5
 
PPTX
Nosql databases
ateeq ateeq
 
PPT
NoSQL - 05March2014 Seminar
Jainul Musani
 
PPT
No SQL Databases as modern database concepts
debasisdas225831
 
PPTX
Master.pptx
KarthikR780430
 
PPTX
Relational and non relational database 7
abdulrahmanhelan
 
PPT
No sql databases
Ashish Kumar Thakur
 
PPSX
A Seminar on NoSQL Databases.
Navdeep Charan
 
PPTX
ch02models.pptx
dreamboy6060
 
PPTX
ch02models.pptx
dreamboy6060
 
PPTX
Big Data Introduction
Tiago Knoch
 
PPTX
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Felix Gessert
 
Hbase hive pig
Xuhong Zhang
 
NoSQL and Couchbase
Sangharsh agarwal
 
Lecture 1-Introduction of NoSQL in DBMS.pptx
agrawalmonikacomp
 
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
plvdravikumarit
 
System design fundamentals CAP.pdf
UsmanAhmed269749
 
6269441.ppt
Swapna Jk
 
Lightning talk: highly scalable databases and the PACELC theorem
Vishal Bardoloi
 
To SQL or NoSQL, that is the question
Krishnakumar S
 
CM2-Data model for Big Data chapter2.pdf
ArsimKrasniqi5
 
Nosql databases
ateeq ateeq
 
NoSQL - 05March2014 Seminar
Jainul Musani
 
No SQL Databases as modern database concepts
debasisdas225831
 
Master.pptx
KarthikR780430
 
Relational and non relational database 7
abdulrahmanhelan
 
No sql databases
Ashish Kumar Thakur
 
A Seminar on NoSQL Databases.
Navdeep Charan
 
ch02models.pptx
dreamboy6060
 
ch02models.pptx
dreamboy6060
 
Big Data Introduction
Tiago Knoch
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Felix Gessert
 
Ad

More from RithikRaj25 (17)

PPT
html1.ppt
RithikRaj25
 
PPT
Data
RithikRaj25
 
PPTX
Data
RithikRaj25
 
PPT
Introduction To Database.ppt
RithikRaj25
 
PPT
Data.ppt
RithikRaj25
 
PPT
DataTypes.ppt
RithikRaj25
 
PPTX
NoSQL.pptx
RithikRaj25
 
PPT
text classification_NB.ppt
RithikRaj25
 
PPT
html1.ppt
RithikRaj25
 
PPTX
slide-keras-tf.pptx
RithikRaj25
 
PPT
Intro_OpenCV.ppt
RithikRaj25
 
PPT
lec1b.ppt
RithikRaj25
 
PPT
PR7.ppt
RithikRaj25
 
PPT
objectdetect_tutorial.ppt
RithikRaj25
 
PPTX
14_ReinforcementLearning.pptx
RithikRaj25
 
PPTX
datamining-lect11.pptx
RithikRaj25
 
PPT
week6a.ppt
RithikRaj25
 
html1.ppt
RithikRaj25
 
Introduction To Database.ppt
RithikRaj25
 
Data.ppt
RithikRaj25
 
DataTypes.ppt
RithikRaj25
 
NoSQL.pptx
RithikRaj25
 
text classification_NB.ppt
RithikRaj25
 
html1.ppt
RithikRaj25
 
slide-keras-tf.pptx
RithikRaj25
 
Intro_OpenCV.ppt
RithikRaj25
 
lec1b.ppt
RithikRaj25
 
PR7.ppt
RithikRaj25
 
objectdetect_tutorial.ppt
RithikRaj25
 
14_ReinforcementLearning.pptx
RithikRaj25
 
datamining-lect11.pptx
RithikRaj25
 
week6a.ppt
RithikRaj25
 
Ad

Recently uploaded (20)

PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Chad Readey - An Independent Thinker
Chad Readey
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 

NoSQL

  • 1. Database Applications (15-415) Part I- NoSQL Databases Lecture 26, April 21, 2015 Mohammad Hammoud
  • 2. Today…  Last Session:  Recovery Management  Today’s Session:  NoSQL databases  Announcements:  PS4 grades are out  On Thursday, April 23rd we will practice on Hive (during recitation)  PS5 (the “last” assignment) is due on Thursday, April 23rd by midnight  P4: Write a survey on SQL vs. NoSQL databases (optional)- due on Friday, April 24th by midnight  The final exam is on Monday April 27th, from 8:30AM to 11:30AM in room 1190 (all materials are included- open book, open notes)
  • 3. Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases 
  • 4. Types of Data  Data can be broadly classified into four types: 1. Structured Data:  Have a predefined model, which organizes data into a form that is relatively easy to store, process, retrieve and manage  E.g., relational data 2. Unstructured Data:  Opposite of structured data  E.g., Flat binary files containing text, video or audio  Note: data is not completely devoid of a structure (e.g., an audio file may still have an encoding structure and some metadata associated with it)
  • 5. Types of Data  Data can be broadly classified into four types: 3. Dynamic Data:  Data that changes relatively frequently  E.g., office documents and transactional entries in a financial database 4. Static Data:  Opposite of dynamic data  E.g., Medical imaging data from MRI or CT scans
  • 6. Why Classifying Data?  Segmenting data into one of the following 4 quadrants can help in designing and developing a pertaining storage solution  Relational databases are usually used for structured data  File systems or NoSQL databases can be used for (static), unstructured data (more on these later) Media Production, eCAD, mCAD, Office Docs Media Archive, Broadcast, Medical Imaging Transaction Systems, ERP, CRM BI, Data Warehousing Dynamic Unstructured Structured Static
  • 7. Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases 
  • 8. Scaling Traditional Databases  Traditional RDBMSs can be either scaled:  Vertically (or Up)  Can be achieved by hardware upgrades (e.g., faster CPU, more memory, or larger disk)  Limited by the amount of CPU, RAM and disk that can be configured on a single machine  Horizontally (or Out)  Can be achieved by adding more machines  Requires database sharding and probably replication  Limited by the Read-to-Write ratio and communication overhead
  • 9. Why Sharding Data?  Data is typically sharded (or striped) to allow for concurrent/parallel accesses Input data: A large file Machine 1 Chunk1 of input data Machine 2 Chunk3 of input data Machine 3 Chunk5 of input data Chunk2 of input data Chunk4 of input data Chunk5 of input data E.g., Chunks 1, 3 and 5 can be accessed in parallel
  • 10. Amdahl’s Law  How much faster will a parallel program run?  Suppose that the sequential execution of a program takes T1 time units and the parallel execution on p processors/machines takes Tp time units  Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable  Then the speedup (Amdahl’s formula): 10
  • 11. Amdahl’s Law: An Example  Suppose that:  80% of your program can be parallelized  4 machines are used to run your parallel version of the program  The speedup you can get according to Amdahl’s law is: 11 Although you use 4 processors you cannot get a speedup more than 2.5 times!
  • 12. Real Vs. Actual Cases  Amdahl’s argument is too simplified  In reality, communication overhead and potential workload imbalance exist upon running parallel programs 20 80 20 20 Process 1 Process 2 Process 3 Process 4 Serial Parallel 1. Parallel Speed-up: An Ideal Case Cannot be parallelized Can be parallelized 20 80 20 20 Process 1 Process 2 Process 3 Process 4 Serial Parallel 2. Parallel Speed-up: An Actual Case Cannot be parallelized Can be parallelized Load Unbalance Communication overhead
  • 13. Some Guidelines  Here are some guidelines to effectively benefit from parallelization: 1. Maximize the fraction of your program that can be parallelized 2. Balance the workload of parallel processes 3. Minimize the time spent for communication 13
  • 14. Why Replicating Data?  Replicating data across servers helps in:  Avoiding performance bottlenecks  Avoiding single point of failures  And, hence, enhancing scalability and availability
  • 15. Why Replicating Data?  Replicating data across servers helps in:  Avoiding performance bottlenecks  Avoiding single point of failures  And, hence, enhancing scalability and availability Main Server Replicated Servers
  • 16. But, Consistency Becomes a Challenge  An example:  In an e-commerce application, the bank database has been replicated across two servers  Maintaining consistency of replicated data is a challenge Bal=1000 Bal=1000 Replicated Database Event 1 = Add $1000 Event 2 = Add interest of 5% Bal=2000 1 2 Bal=1050 3 Bal=2050 4 Bal=2100
  • 17. The Two-Phase Commit Protocol  The two-phase commit protocol (2PC) can be used to ensure atomicity and consistency Database Server 1 Participant 1 Coordinator Database Server 2 Participant 2 Database Server 3 Participant 3 VOTE_REQUEST VOTE_REQUEST VOTE_REQUEST Phase I: Voting VOTE_COMMIT VOTE_COMMIT VOTE_COMMIT
  • 18. The Two-Phase Commit Protocol  The two-phase commit protocol (2PC) can be used to ensure atomicity and consistency Database Server 1 Participant 1 Coordinator Database Server 2 Participant 2 Database Server 3 Participant 3 GLOBAL_COMMIT GLOBAL_COMMIT GLOBAL_COMMIT Phase II: Commit LOCAL_COMMIT LOCAL_COMMIT LOCAL_COMMIT “Strict” consistency, which limits scalability!
  • 19. Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases 
  • 20. The CAP Theorem  The limitations of distributed databases can be described in the so called the CAP theorem  Consistency: every node always sees the same data at any given instance (i.e., strict consistency)  Availability: the system continues to operate, even if nodes in a cluster crash, or some hardware or software parts are down due to upgrades  Partition Tolerance: the system continues to operate in the presence of network partitions CAP theorem: any distributed database with shared data, can have at most two of the three desirable properties, C, A or P
  • 21. The CAP Theorem (Cont’d)  Let us assume two nodes on opposite sides of a network partition:  Availability + Partition Tolerance forfeit Consistency  Consistency + Partition Tolerance entails that one side of the partition must act as if it is unavailable, thus forfeiting Availability  Consistency + Availability is only possible if there is no network partition, thereby forfeiting Partition Tolerance
  • 22. Large-Scale Databases  When companies such as Google and Amazon were designing large-scale databases, 24/7 Availability was a key  A few minutes of downtime means lost revenue  When horizontally scaling databases to 1000s of machines, the likelihood of a node or a network failure increases tremendously  Therefore, in order to have strong guarantees on Availability and Partition Tolerance, they had to sacrifice “strict” Consistency (implied by the CAP theorem)
  • 23. Trading-Off Consistency  Maintaining consistency should balance between the strictness of consistency versus availability/scalability  Good-enough consistency depends on your application
  • 24. Trading-Off Consistency  Maintaining consistency should balance between the strictness of consistency versus availability/scalability  Good-enough consistency depends on your application Strict Consistency Generally hard to implement, and is inefficient Loose Consistency Easier to implement, and is efficient
  • 25. The BASE Properties  The CAP theorem proves that it is impossible to guarantee strict Consistency and Availability while being able to tolerate network partitions  This resulted in databases with relaxed ACID guarantees  In particular, such databases apply the BASE properties:  Basically Available: the system guarantees Availability  Soft-State: the state of the system may change over time  Eventual Consistency: the system will eventually become consistent
  • 26. Eventual Consistency  A database is termed as Eventually Consistent if:  All replicas will gradually become consistent in the absence of updates
  • 27. Eventual Consistency  A database is termed as Eventually Consistent if:  All replicas will gradually become consistent in the absence of updates Webpage-A Event: Update Webpage- A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A
  • 28. Eventual Consistency: A Main Challenge  But, what if the client accesses the data from different replicas? Webpage-A Event: Update Webpage- A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Webpage-A Protocols like Read Your Own Writes (RYOW) can be applied!
  • 29. Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases 
  • 30. NoSQL Databases  To this end, a new class of databases emerged, which mainly follow the BASE properties  These were dubbed as NoSQL databases  E.g., Amazon’s Dynamo and Google’s Bigtable  Main characteristics of NoSQL databases include:  No strict schema requirements  No strict adherence to ACID properties  Consistency is traded in favor of Availability
  • 31. Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases
  • 32. Document Stores  Documents are stored in some standard format or encoding (e.g., XML, JSON, PDF or Office Documents)  These are typically referred to as Binary Large Objects (BLOBs)  Documents can be indexed  This allows document stores to outperform traditional file systems  E.g., MongoDB and CouchDB (both can be queried using MapReduce)
  • 33. Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases
  • 34. Graph Databases  Data are represented as vertices and edges  Graph databases are powerful for graph-like queries (e.g., find the shortest path between two elements)  E.g., Neo4j and VertexDB Id: 1 Name: Alice Age: 18 Id: 2 Name: Bob Age: 22 Id: 3 Name: Chess Type: Group
  • 35. Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases
  • 36. Key-Value Stores  Keys are mapped to (possibly) more complex value (e.g., lists)  Keys can be stored in a hash table and can be distributed easily  Such stores typically support regular CRUD (create, read, update, and delete) operations  That is, no joins and aggregate functions  E.g., Amazon DynamoDB and Apache Cassandra
  • 37. Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases
  • 38. Columnar Databases  Columnar databases are a hybrid of RDBMSs and Key- Value stores  Values are stored in groups of zero or more columns, but in Column-Order (as opposed to Row-Order)  Values are queried by matching keys  E.g., HBase and Vertica Alice 3 25 Bob 4 19 Carol 0 45 Record 1 Row-Order Alice 3 25 Bob 4 19 Carol 0 45 Column A Columnar (or Column-Order) Alice 3 25 Bob 4 19 Carol 0 45 Columnar with Locality Groups Column A = Group A Column Family {B, C}
  • 39. Summary  Data can be classified into 4 types, structured, unstructured, dynamic and static  Different data types usually entail different database designs  Databases can be scaled up or out  The 2PC protocol can be used to ensure strict consistency  Strict consistency limits scalability
  • 40. Summary (Cont’d)  The CAP theorem states that any distributed database with shared data can have at most two of the three desirable properties:  Consistency  Availability  Partition Tolerance  The CAP theorem lead to various designs of databases with relaxed ACID guarantees
  • 41. Summary (Cont’d)  NoSQL (or Not-Only-SQL) databases follow the BASE properties:  Basically Available  Soft-State  Eventual Consistency  NoSQL databases have different types:  Document Stores  Graph Databases  Key-Value Stores  Columnar Databases