Distributed Database
Management
Systems
Chapter 6
21/11/12 Pravicha.M.T
Distributed Database Management Systems
→ Governs the storage and processing of
logically related data over
interconnected computer systems in
which both data and processing are
distributed among several sites.
→ Software system that manages a
distributed database while making the
distribution transparent to the user.
Centralized Database Management
Systems
CDBMS presented Structured information as
regularly issued formal reports in a
standard format.
Centralized database stored corporate data in
a single central site, usually a mainframe
computer. Data access was provided through
dumb terminals.
But it fell short when quickly moving events
required faster response times.
Centralized Database Management
System
We need
Quick
Unstructured Access to Database
Using ad-hoc queries to generate the on spot
information
Need for DDBMS
Centralized database management is subject to problems
as:
1.Performance degradation because of a
growing number of remote locations over greater
distances.
2.High costs associated with maintaining and
operating large central (mainframe) database
systems.
3.Reliability problems created by dependence on a
central site (single point of failure syndrome) and the
need for data replication.
4.Scalability problems associated with the physical
limits imposed by a single location (temperature
conditioning and power consumption etc......)
Demand for applications based on
accessing data from different sources
at multiple locations….led to the
development of DDBM.
Multiple source/Multiple location
database environment is best
managed by a DDBMS .
DDBMS ADVANTAGES AND
DISADVANTAGES
DDBMS
6
21/11/2012
DDBMS ADVANTAGES AND
DISADVANTAGES
ADVANTAGES DIS-ADVANTAGES
Data are located near Complexity of
the greatest demand management & control.
site
Faster data access Technological Difficulty
Faster data processing Security
Growth facilitation Lack of standards
Improved Increased storage and
Communication infrastructure
requirements
Reduced Operating Increased training
Costs costs.
User friendly interface Costs
Less danger of a single
point failure
Basic Components and Concepts
of
Distributed Database
Distributed Processing
&
Distributed Database
DISTRIBUTED PROCESSING
Query Processing
9
21/11/2012
DISTRIBUTED PROCESSING
In distributed processing, a database’s logical processing is
shared among two or more physically independent sites that are
connected through a network.
For example, the data input/output (I/O), data selection, and data
validation might be performed on one computer, and a report
based on that data might be created on another computer.
Distributed processing system uses only a single-site database
but shares the processing chores.
DISTRIBUTED
DATABASES
1
DDBMS 21/11/2012
DISTRIBUTED DATABASES
A distributed database stores a logically related
database over two or more physically independent
sites.
The sites are connected via a computer network.
A database is composed of several parts known as
database fragments.
The database fragments are located at different sites and
can be replicated among various sites. Each database
fragment is, in turn, managed by its local database process.
Note:
Distributed processing does not require a distributed
database, but a distributed database requires
distributed processing (each database fragment is
managed by its own local database process).
Distributed processing may be based on a single
database located on a single computer.
Both distributed processing and distributed
databases require a network to connect all
components.
DDBMS
1.
COMPONENTS
Computer workstations or remote devices.
2. Network hardware and software components that reside
in each workstation or device. It is best to ensure that
distributed database functions can be run on multiple
platforms.
3. Communications media. The DDBMS must be
communication media-independent;
4. The transaction processor (TP), s/w in each computer
or device that requests data. The transaction processor
receives and processes the application’s data requests
(remote and local). Also known as the application
processor (AP) or the transaction manager (TM).
5. The data processor (DP), s/w in each computer or
device that stores and retrieves data located at the
site. Also known as the data manager (DM). A data
processor may even be a centralized DBMS.
DDBMS
COMPONENTS
Query Processing 21/11/2012
DDBMS
COMPONENTS
The communication among TPs and DPs is based on a
set of protocols.
The protocols determine how the distributed database
system will:
Interface with the network to transport data and
commands between data processors (DPs) and
transaction processors (TPs).
Synchronize all data received from DPs (TP side) and
route retrieved data to the appropriate TPs (DP side).
Ensure common database functions in a distributed
system. Such functions include security, concurrency
control, backup, and recovery.
LEVELS OF DATA AND PROCESS
DISTRIBUTION
On the basis of how process distribution and data
distribution are supported, current database systems
can be classified as: SPSD, MPSD, MPMD
Single-Site Processing, Single-Site Data
(SPSD)
Single-Site Processing, Single-Site Data
(SPSD)
1. All processing is done on a single host computer
(single-processor server, multiprocessor server,
mainframe system). Processing cannot be done on
the end user’s side of the system.
2. All data are stored on the host computer’s local
disk system.
3. DBMS is located on the host computer.
Such a scenario is typical of most mainframe,
midrange server computer DBMSs and the first
generation of single-user microcomputer
databases.
Multiple-Site Processing, Single-
Site Data (MPSD)
2
DDBMS 21/11/2012
Multiple-Site Processing, Single-
Site Data
Multiple processes run on different computers sharing a
single data repository.
Requires a network file server running conventional
applications that are accessed through a network.
The TP on each workstation acts only as a redirector
to route all network data requests to the file server.
The end user must make a direct reference to the file server
in order to access remote data.
All record- and file-locking activities, data selection,
search, and update functions take place at the
workstation, thus requiring that entire files travel
through the network for processing at the
workstation.
Such a requirement increases network traffic,slows
response time, and increases communication costs.
Multiple-Site Processing, Multiple-Site Data
(MPMD)
Describes a fully distributed DBMS with support
for multiple data processors and transaction
processors at multiple sites.
Depending on the level of support for various
types of centralized DBMSs, DDBMSs are classified
as either homogeneous or heterogeneous.
Homogeneous DDBMSs integrate only one type
of centralized DBMS over a network. Thus, the
same DBMS will be running on different server
platforms (single processor server, multiprocessor
server).
Heterogeneous DDBMSs integrate different
types of centralized DBMSs over a network.
Multiple-Site Processing, Multiple-Site Data
(MPMD)
A fully heterogeneous DDBMS will support
different DBMSs that may even support different
data models (relational, hierarchical, or network)
running under different computer systems, such as
mainframes and PCs.
Some DDBMS implementations support several
platforms, operating systems, and networks and
allow remote data access to another DBMS.
DISTRIBUTED DATABASE TRANSPARENCY FEATURES
1. Distribution Transparency
2. Transaction Transparency
3. Failure Transparency
4. Performance Transparency
5. Heterogeneity Transparency
DISTRIBUTED DATABASE TRANSPARENCY FEATURES
1. Distribution transparency
Allows a distributed database to be treated as a
single logical database.
If a DDBMS exhibits distribution transparency, the
user does not need to know:
That the data are partitioned—meaning the
table’s rows and columns are split vertically or
horizontally and stored among multiple sites.
That the data can be replicated at several sites.
The data location.
DISTRIBUTED DATABASE TRANSPARENCY FEATURES
2. Transaction transparency
Allows a transaction to update data at more
than one network site.
Transaction transparency ensures that the
transaction will be either entirely completed or
aborted, thus maintaining database integrity.
3. Failure transparency
Ensures that the system will continue to
operate in the event of a node failure.
Functions that were lost because of the failure
will be picked up by another network node.
DISTRIBUTED DATABASE TRANSPARENCY FEATURES
4. Performance transparency
Allows the system to perform as if it were a centralized
DBMS.
No performance degradation due to its use on a network
or due to the network’s platform differences.
Ensures that the system will find the most cost-effective
path to access remote data.
5. Heterogeneity transparency
o Allows the integration of several different local DBMSs
(relational, network, and hierarchical) under a common,
or global, schema.
o The DDBMS is responsible for translating the data
requests from the global schema to the local DBMS
schema.
DISTRIBUTION TRANSPARENCY
Physically dispersed database managed as though
it were a centralized database.
Three levels of distribution transparency:
1. Fragmentation transparency is the highest level of
transparency. The end user or programmer does not
need to know that a database is partitioned. Therefore,
neither fragment names nor fragment locations are
specified prior to data access.
2. Location transparency exists when the end user or
programmer must specify the database fragment
names but does not need to specify where those
fragments are located.
3. Local mapping transparency exists when the end
user or programmer must specify both the fragment
names and their locations.
DISTRIBUTION TRANSPARENCY
DISTRIBUTION TRANSPARENCY
EMPLOYEE (EMP_NAME, EMP_DOB, EMP_ADDRESS,
EMP_DEPARTMENT, EMP_SALARY).
The EMPLOYEE data are distributed over three different
locations: New York, Atlanta, and Miami.
New York employee data are stored in fragment E1, Atlanta
employee data are stored in fragment E2, and Miami
employee data are stored in fragment E3.
Each fragment is unique. The unique fragment condition
indicates that each row is unique, regardless of the fragment
in which it is located. Assume that no portion of the database
is replicated at any other site on the network.
DISTRIBUTION TRANSPARENCY
Select all employees whose EMP_DOB<'01-
JAN-196';
Database Supports Fragmentation
Transparency
SELECT * FROM EMPLOYEE WHERE EMP_DOB <
'01-JAN-196';
The query conforms to a nondistributed database
query format; that is, it does not specify fragment
names or locations.
DISTRIBUTION TRANSPARENCY
The Database Supports Location Transparency
SELECT * FROM E1 WHERE EMP_DOB < '01-JAN-1960';
UNION
SELECT * FROM E2 WHERE EMP_DOB < '01-JAN-1960';
UNION
SELECT * FROM E3 WHERE EMP_DOB < '01-JAN-1960';
Fragment names must be specified in the query,
but the fragment’s location is not specified.
DISTRIBUTION TRANSPARENCY
Database Supports Local Mapping
Transparency
Both the fragment name and its location must be
specified in the query. Using pseudo-SQL:
SELECT * FROM El NODE NY WHERE EMP_DOB < '01-JAN-
1960';
UNION
SELECT * FROM E2 NODE ATL WHERE EMP_DOB < '01-
JAN-1960';
UNION
SELECT *FROM E3 NODE MIA WHERE EMP_DOB < '01-
JAN-1960';
DISTRIBUTION TRANSPARENCY
Distribution transparency is supported by a
distributed data dictionary (DDD), or a
distributed data catalogue (DDC).
Contains the description of the entire database
(distributed global schema )as seen by the
database administrator.
Is the common database schema used by local TPs
to translate user requests into sub queries (remote
requests) that will be processed by different DPs.
The DDC is itself distributed, and it is replicated at
the network nodes. Therefore, the DDC must
maintain consistency through updating at all sites.
TRANSACTION TRANSPARENCY
Ensures that database transactions will maintain
the distributed database’s integrity and
consistency.
DDBMS database transaction can update data
stored in many different computers connected in
a network.
Transaction transparency ensures that the
transaction will be completed only when all
database sites involved in the transaction
complete their part of the transaction.
TRANSACTION TRANSPARENCY
Remote Request
Remote Transaction
Distributed Requests
Distributed Transactions
REMOTE REQUEST
A remote request lets a single SQL
statement access the data that are to be
processed by a single remote database
processor.
The SQL statement (or request) can
reference data at only one remote site.
REMOTE TRANSACTION
A remote transaction, composed of several
requests, accesses data at a single remote site.
Each SQL statement (or request) can reference
only one (the same) remote DP at a time, and
the entire transaction can reference and be
executed at only one remote DP.
DISTRIBUTED TRANSACTION
A distributed transaction allows a transaction
to reference several different local or remote DP
sites.
Each single request can reference only one local
or remote DP site, the transaction as a whole can
reference multiple DP sites because each request
can reference a different site.
DISTRIBUTED TRANSACTION
Each request can access only remote
site at a time.
Suppose the table PRODUCT is divided into
two fragments, PRODl and PROD2, located
at sites B and C, respectively.
SELECT * FROM PRODUCT WHERE
PROD_NUM = '231785'; cannot access data
from more than one remote site.
DISTRIBUTED REQUEST
A distributed request lets a single SQL statement
reference data located at several different local or
remote DP sites.
Because each request (SQL statement) can access
data from more than one local or remote DP site, a
transaction can access several sites.
The ability to execute a distributed request provides
fully distributed database processing
capabilities because of the ability to:
Partition a database table into several fragments.
Reference one or more of those fragments with only
one request. In other words, there is fragmentation
transparency.
Full fragmentation transparency support is
provided only by a DDBMS that supports
distributed requests.
DISTRIBUTED REQUEST
A single SELECT statement to reference two
tables, CUSTOMER and INVOICE. The two tables
are located at two different sites, B and C.
DISTRIBUTED REQUEST
A CUSTOMER table is divided into two
fragments, C1 and C2, located at sites B and
C, respectively.
Distributed request feature also allows a
single request to reference a physically
partitioned table.
DISTRIBUTED CONCURRENCY
CONTROL
Multisite, multiple-process operations are more likely to
create data inconsistencies and deadlocked transactions
than single-site systems are.
For e.g: If transaction operation was committed by 2 local
DP, but one of the DPs could not commit the transaction’s
results. the transaction(s) would yield an inconsistent
database, with its inevitable integrity problems.
The solution for the problem is a two-phase commit
protocol.
Two-Phase Commit
Protocol
Distributed databases make it possible for a transaction
to access data at several sites.
The two-phase commit protocol guarantees
that if a portion of a transaction operation
cannot be committed; all changes made at the
other sites participating in the transaction will
be undone to maintain a consistent database
state.
Each DP maintains its own transaction log.
Two-Phase Commit
Protocol
Uses a DO-UNDO-REDO protocol and a
write-ahead protocol.
Transaction log for each DP be written before
the database fragment is actually updated.
The DO-UNDO-REDO protocol is used by
the DP to roll back and/or roll forward
transactions with the help of the system’s
transaction log entries.
The DO-UNDO-REDO protocol defines three
types of operations: DO , UNDO, REDO
Two-Phase Commit
Protocol
DO performs the operation and records the “before”
and “after” values in the transaction log.
UNDO reverses an operation, using the log entries
written by the DO portion of the sequence.
REDO redoes an operation, using the log entries written
by the DO portion of the sequence.
To ensure that the DO, UNDO, and REDO operations can
survive a system crash while they are being executed, a
write-ahead protocol is used.
The write-ahead-log protocol ensures that
transaction logs are always written before any database
data are actually updated. This protocol ensures that, in
case of a failure, the database can later be recovered to
a consistent state, using the data in the transaction log.
Two-Phase Commit
Prototcol
The two-phase commit protocol defines the
operations between two types of nodes:
The coordinator.
One or more subordinates.
The participating nodes agree on a coordinator.
The coordinator role is assigned to the node that
initiates the transaction.
The protocol is implemented in two phases.
Two-Phase Commit
Protocol
Phase 1: Preparation
The coordinator sends a PREPARE TO COMMIT
message to all subordinates.
The subordinates receive the message; write the
transaction log, using the write-ahead protocol; and
send an acknowledgment (YES/PREPARED TO
COMMIT or NO/NOT PREPARED) message to the
coordinator.
The coordinator makes sure that all nodes are ready
to commit, or it aborts the action.
If all nodes are PREPARED TO COMMIT, the
transaction goes to Phase 2. If one or more nodes
reply NO or NOT PREPARED, the coordinator
broadcasts an ABORT message to all subordinates.
Two-Phase Commit
Protocol
Phase 2: The Final COMMIT
The coordinator broadcasts a COMMIT message
to all subordinates and waits for the replies.
Each subordinate receives the COMMIT message,
and then updates the database using the DO
protocol.
The subordinates reply with a COMMITTED or
NOT COMMITTED message to the coordinator.
If one or more subordinates did not commit, the
coordinator sends an ABORT message, thereby
forcing them to UNDO all changes.
PERFORMANCE TRANSPARENCY
Fragmented Database makes the query translation
more complicated, because the DDBMS must
decide which fragment of the database to
access.
Data replication makes the access problem
complex, because the database must decide
which copy of the data to access.
The DDBMS uses query optimization techniques to
deal with such problems and to ensure acceptable
database performance.
PERFORMANCE
TRANSPARENCY
The objective of a query optimization routine is to
minimize the total cost associated with the
execution of a request.
The costs associated with a request are a function
of the: Access time (I/O) cost, Communication
cost, CPU time cost.
Some algorithms minimize total time; others
minimize the communication time, and still others
do not factor in the CPU time, considering it
insignificant relative to other cost sources.
PERFORMANCE
TRANSPARENCY
Query optimization in distributed database systems
must provide distribution transparency + replica
transparency.
Replica transparency refers to the DDBMS’s ability
to hide the existence of multiple copies of data from
the user.
Most of the algorithms proposed for query
optimization are based on two principles:
The selection of the optimum execution order.
The selection of sites to be accessed to minimize
communication costs.
Performance
Transparency
A query optimization algorithm can be
evaluated on the basis
Of its operation mode
The timing of its optimization.
Type of information.
Classification based on Operation
Modes
Operation modes can be classified as manual or
automatic.
Automatic query optimization means that the DDBMS
finds the most cost-effective access path without user
intervention.
Manual query optimization requires that the
optimization be selected and scheduled by the end user or
programmer.
Automatic query optimization is clearly more desirable
from the end user’s point of view, but the cost of such
convenience is the increased overhead that it
imposes on the DDBMS.
Classification based on Timing
Timing Classification of algorithms: static or dynamic.
(when the optimization is done)
1. Static query optimization takes place at compilation time.
When the program is submitted to the DBMS for
compilation, it creates the plan necessary to access the
database. When the program is executed, the DBMS uses
that plan to access the database.
2. Dynamic query optimization takes place at execution time.
Access strategy is dynamically determined by the DBMS at
run time, using the most up-to-date information about the
database.
Classification based on Type of
information
Optimization technique based on the type of information
that is used to optimize the query.
1. A statistically based query optimization algorithm uses
statistical information about the database to determine the
best access strategy.
The statistical information generated in one of two different
modes: dynamic or manual. In the dynamic statistical
generation mode, the DDBMS automatically evaluates
and updates the statistics after each access. In the manual
statistical generation mode, the statistics must be
updated periodically through a user-selected utility.
Classification based on Type of
information
2. A rule-based query optimization
algorithm is based on a set of user-defined
rules to determine the best query access
strategy. The rules are entered by the end
user or database administrator, and they are
typically very general in nature.
DISTRIBUTED DATABASE
DESIGN
Issues Managed by
1. How to partition the database Data Fragmentation & Data
into fragments. Replication
2. Which fragments to replicate.
3. Where to locate those Data Allocation
fragments and replicas.
DATA FRAGMENTATION
Data fragmentation allows you to break a single
object into two or more segments, or fragments.
The object might be a user’s database, a system
database, or a table. Each fragment can be stored
at any site over a computer network. Information
about data fragmentation is stored in the
distributed data catalog (DDC), from which it is
accessed by the TP to process user requests.
Three types of data fragmentation at table level
are horizontal, vertical, and mixed. (A
fragmented table can always be re-created from
its fragmented parts by a combination of unions
and joins.)
Horizontal
Fragmentation
Refers to the division of a relation into subsets
(fragments) of tuples (rows).
Each fragment is stored at a different node, and
each fragment has unique rows.
Each fragment represents the equivalent of a
SELECT statement, with the WHERE clause on a
single attribute.
Vertical fragmentation
Refers to the division of a relation into attribute
(column) subsets.
Each subset (fragment) is stored at a different
node,
Each fragment has unique columns—with the
exception of the key column, which is common to
all fragments.
This is the equivalent of the PROJECT statement in
SQL.
Mixed Fragmentation
Refers to a combination of horizontal
and vertical strategies.
A table may be divided into several
horizontal subsets (rows), each one
having a subset of the attributes
(columns).
DATA REPLICATION
Data replication refers to the storage of data copies
at multiple sites served by a computer network.
Fragment copies can be stored at several sites to
serve specific information requirements.
Suppose database A is divided into two fragments, A1
and A2. Within a replicated distributed database,
fragment A1 is stored at sites S1 and S2, while
fragment A2 is stored at sites S2 and S3.
Replicated data are subject to the mutual consistency
rule. The mutual consistency rule requires that all
copies of data fragments be identical. Therefore, to
maintain data consistency among the replicas, the
DDBMS must ensure that a database update is
performed at all sites where replicas exist.
DATA REPLICATION
Three replication scenarios exist.
A fully replicated database stores multiple
copies of each database fragment at multiple
sites. In this case, all database fragments
are replicated. (A fully replicated database can be
impractical due to the amount of overhead it imposes on the
system.)
A partially replicated database stores
multiple copies of some database fragments
at multiple sites. Most DDBMSs are able to
handle the partially replicated database well.
An unreplicated database stores each
database fragment at a single site. Therefore,
there are no duplicated database fragments.
DATA REPLICATION
Advantages:
Improved data availability, better load
distribution, improved data failure-tolerance, and
reduced query costs.
Disadvantages:
Additional DDBMS processing overhead—
because each data copy must be maintained by
the system.
Because the data are replicated at another site,
there are associated storage costs and increased
transaction times (as data must be updated at
several sites concurrently to comply with the
mutual consistency rule).
DATA ALLOCATION
Data allocation describes the process of
deciding where to locate data. Data allocation
strategies are as follows:
With centralized data allocation, the entire
database is stored at one site.
With partitioned data allocation, the database
is divided into two or more disjointed parts
(fragments) and stored at two or more sites.
With replicated data allocation, copies of one
or more database fragments are stored at
several sites. Most data allocation studies focus
on one issue: which data to locate where.
DATA ALLOCATION
Data allocation algorithms take into
consideration a variety of factors, including:
1. Performance and data availability goals.
2. Size, number of rows, and number of
relations that an entity maintains with other
entities.
3. Types of transactions to be applied to the
database and the attributes accessed by
each of those transactions.
?