Chapter 21: Parallel and Distributed Storage
Database System Concepts, 7th Ed.
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Parallel Storage
Database System Concepts - 7th Edition 21.2 ©Silberschatz, Korth and Sudarshan
Data Partitioning (1)
In its simplest form, I/O parallelism refers to reducing the time required to
retrieve relations from disk by partitioning the relations on multiple disks, on
multiple nodes (servers).
• We focus on parallelism across nodes.
• Same techniques can be used across disks on a node.
Two Main Approaches
• Horizontal partitioning: tuples of a relation are divided among many nodes
such that some subset of tuple resides on each node.
• Vertical partitioning: e.g. r(A,B,C,D) with primary key A into r1(A,B) and
r2(A,C,D) (discussed in Chapter 13).
• By default, the word partitioning refers to horizontal partitioning.
Database System Concepts - 7th Edition 21.3 ©Silberschatz, Korth and Sudarshan
Data Partitioning (2)
Partitioning techniques (number of nodes = ):
Round-robin:
Send the th tuple inserted in the relation
to node mod .
Hash partitioning:
• Choose one or more attributes as the
partitioning attributes.
• Choose hash function with range of .
• Let denote result of hash function applied
to the partitioning attribute value of a
tuple. Send tuple to node .
Database System Concepts - 7th Edition 21.4 ©Silberschatz, Korth and Sudarshan
Data Partitioning (3)
Range partitioning:
• Choose an attribute as the partitioning
attribute.
• A partitioning vector is chosen.
• If , then .
• Consider a tuple where is the
partitioning attribute.
If , then goes to node .
If , then goes to node .
If , then goes to node .
Database System Concepts - 7th Edition 21.5 ©Silberschatz, Korth and Sudarshan
Comparison of Partitioning Techniques (1)
Evaluate how well partitioning techniques (round robin,
hash partitioning, range partitioning) support the following
types of data access:
1. Scanning the entire relation.
SQL: select * from r
2. Locating a tuple associatively (point queries).
E.g., .
SQL: select * from r where r.A = 25
3. Locating all tuples such that the value of a given
attribute lies within a specified range (range queries).
E.g., .
SQL: select * from r where 10 <= r.A and r.A < 25
Database System Concepts - 7th Edition 21.6 ©Silberschatz, Korth and Sudarshan
Comparison of Partitioning Techniques (2)
Round robin:
Best suited for sequential scan of entire relation
on each query.
• All nodes have almost an equal number of
tuples; retrieval work is thus well balanced
between nodes.
All queries must be processed at all nodes
Hash partitioning:
Good for sequential access
• Assuming hash function is good, and
partitioning attributes form a key, tuples will
be equally distributed between nodes
Good for point queries on partitioning attribute
• Can lookup single node, leaving others
available for answering other queries.
Range queries inefficient, must be processed at
all nodes
Database System Concepts - 7th Edition 21.7 ©Silberschatz, Korth and Sudarshan
Comparison of Partitioning Techniques (3)
Range partitioning:
Provides data clustering by partitioning attribute
value.
• Good for sequential access
• Good for point queries on partitioning
attribute: only one node needs to be
accessed.
For range queries on partitioning attribute, one
to a few nodes may need to be accessed
• Remaining nodes are available for other
queries.
Database System Concepts - 7th Edition 21.8 ©Silberschatz, Korth and Sudarshan
Types of Skew
Data-distribution skew: some nodes have many tuples,
while others may have fewer tuples. Could occur due to
• Attribute-value skew.
Some partitioning-attribute values appear in many
tuples.
All the tuples with the same value for the partitioning
attribute end up in the same partition.
Can occur with range-partitioning and hash-
partitioning.
• Partition skew.
Imbalance, even without attribute-value skew
Badly chosen range-partition vector may assign too
many tuples to some partitions and too few to others.
Less likely with hash-partitioning
Execution skew can occur even without data distribution
skew
• E.g. relation range-partitioned on date, and most queries
access tuples with recent dates
Database System Concepts - 7th Edition 21.9 ©Silberschatz, Korth and Sudarshan
Virtual Node Partitioning
Key idea: pretend there are several times (10x to 20x) as many virtual nodes as
real nodes
• Virtual nodes are mapped to real nodes.
• Tuples are partitioned across virtual nodes.
• It can be used to support any of the partitioning techniques discussed before.
Mapping of virtual nodes to real nodes
• Round-robin: virtual node i mapped to real node .
• Mapping table: mapping table virtual_to_real_map[] tracks which virtual node
is on which real node
Allows skew to be handled by moving virtual nodes from more loaded
nodes to less loaded nodes.
Both data distribution skew and execution skew can be handled.
Database System Concepts - 7th Edition 21.10 ©Silberschatz, Korth and Sudarshan