0% found this document useful (0 votes)
18 views73 pages

Adbms Chapter 7 Ddbms

The document discusses Distributed Database Management Systems (DDBMS), which manage logically related data across multiple interconnected sites, contrasting them with Centralized Database Management Systems (CDBMS). It outlines the advantages and disadvantages of DDBMS, including improved data access and processing speed, but also highlights complexities and increased costs. Additionally, it covers key components, transparency features, transaction management, and the two-phase commit protocol essential for maintaining data integrity in distributed environments.

Uploaded by

matias bahiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views73 pages

Adbms Chapter 7 Ddbms

The document discusses Distributed Database Management Systems (DDBMS), which manage logically related data across multiple interconnected sites, contrasting them with Centralized Database Management Systems (CDBMS). It outlines the advantages and disadvantages of DDBMS, including improved data access and processing speed, but also highlights complexities and increased costs. Additionally, it covers key components, transparency features, transaction management, and the two-phase commit protocol essential for maintaining data integrity in distributed environments.

Uploaded by

matias bahiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 73

Distributed Database

Management
Systems
Chapter 6

21/11/12 Pravicha.M.T
Distributed Database Management Systems

→ Governs the storage and processing of

logically related data over


interconnected computer systems in
which both data and processing are
distributed among several sites.
→ Software system that manages a

distributed database while making the


distribution transparent to the user.
Centralized Database Management
Systems
CDBMS presented Structured information as
regularly issued formal reports in a
standard format.

Centralized database stored corporate data in


a single central site, usually a mainframe
computer. Data access was provided through
dumb terminals.

But it fell short when quickly moving events


required faster response times.
Centralized Database Management
System
We need

Quick

Unstructured Access to Database

Using ad-hoc queries to generate the on spot


information
Need for DDBMS
Centralized database management is subject to problems
as:

1.Performance degradation because of a


growing number of remote locations over greater
distances.

2.High costs associated with maintaining and


operating large central (mainframe) database
systems.

3.Reliability problems created by dependence on a


central site (single point of failure syndrome) and the
need for data replication.

4.Scalability problems associated with the physical


limits imposed by a single location (temperature
conditioning and power consumption etc......)
Demand for applications based on
accessing data from different sources
at multiple locations….led to the
development of DDBM.

Multiple source/Multiple location


database environment is best
managed by a DDBMS .
DDBMS ADVANTAGES AND
DISADVANTAGES

DDBMS
6
21/11/2012
DDBMS ADVANTAGES AND
DISADVANTAGES
ADVANTAGES DIS-ADVANTAGES
Data are located near Complexity of
the greatest demand management & control.
site
Faster data access Technological Difficulty
Faster data processing Security
Growth facilitation Lack of standards
Improved Increased storage and
Communication infrastructure
requirements
Reduced Operating Increased training
Costs costs.
User friendly interface Costs
Less danger of a single
point failure
Basic Components and Concepts

of

Distributed Database
Distributed Processing
&
Distributed Database
DISTRIBUTED PROCESSING

Query Processing
9
21/11/2012
DISTRIBUTED PROCESSING

In distributed processing, a database’s logical processing is


shared among two or more physically independent sites that are
connected through a network.

For example, the data input/output (I/O), data selection, and data
validation might be performed on one computer, and a report
based on that data might be created on another computer.

Distributed processing system uses only a single-site database


but shares the processing chores.
DISTRIBUTED
DATABASES

1
DDBMS 21/11/2012
DISTRIBUTED DATABASES
A distributed database stores a logically related
database over two or more physically independent
sites.

The sites are connected via a computer network.

A database is composed of several parts known as


database fragments.

The database fragments are located at different sites and


can be replicated among various sites. Each database
fragment is, in turn, managed by its local database process.
Note:

Distributed processing does not require a distributed


database, but a distributed database requires
distributed processing (each database fragment is
managed by its own local database process).

Distributed processing may be based on a single


database located on a single computer.

Both distributed processing and distributed


databases require a network to connect all
components.
DDBMS
1.
COMPONENTS
Computer workstations or remote devices.

2. Network hardware and software components that reside


in each workstation or device. It is best to ensure that
distributed database functions can be run on multiple
platforms.

3. Communications media. The DDBMS must be


communication media-independent;

4. The transaction processor (TP), s/w in each computer


or device that requests data. The transaction processor
receives and processes the application’s data requests
(remote and local). Also known as the application
processor (AP) or the transaction manager (TM).

5. The data processor (DP), s/w in each computer or


device that stores and retrieves data located at the
site. Also known as the data manager (DM). A data
processor may even be a centralized DBMS.
DDBMS
COMPONENTS

Query Processing 21/11/2012


DDBMS
COMPONENTS
The communication among TPs and DPs is based on a
set of protocols.

The protocols determine how the distributed database


system will:

Interface with the network to transport data and


commands between data processors (DPs) and
transaction processors (TPs).

Synchronize all data received from DPs (TP side) and


route retrieved data to the appropriate TPs (DP side).

Ensure common database functions in a distributed


system. Such functions include security, concurrency
control, backup, and recovery.
LEVELS OF DATA AND PROCESS
DISTRIBUTION

On the basis of how process distribution and data


distribution are supported, current database systems
can be classified as: SPSD, MPSD, MPMD
Single-Site Processing, Single-Site Data
(SPSD)
Single-Site Processing, Single-Site Data
(SPSD)
1. All processing is done on a single host computer

(single-processor server, multiprocessor server,


mainframe system). Processing cannot be done on
the end user’s side of the system.

2. All data are stored on the host computer’s local

disk system.

3. DBMS is located on the host computer.

Such a scenario is typical of most mainframe,


midrange server computer DBMSs and the first
generation of single-user microcomputer
databases.
Multiple-Site Processing, Single-
Site Data (MPSD)

2
DDBMS 21/11/2012
Multiple-Site Processing, Single-
Site Data
Multiple processes run on different computers sharing a
single data repository.

Requires a network file server running conventional


applications that are accessed through a network.

The TP on each workstation acts only as a redirector


to route all network data requests to the file server.

The end user must make a direct reference to the file server
in order to access remote data.

All record- and file-locking activities, data selection,


search, and update functions take place at the
workstation, thus requiring that entire files travel
through the network for processing at the
workstation.

Such a requirement increases network traffic,slows


response time, and increases communication costs.
Multiple-Site Processing, Multiple-Site Data
(MPMD)
Describes a fully distributed DBMS with support
for multiple data processors and transaction
processors at multiple sites.

Depending on the level of support for various


types of centralized DBMSs, DDBMSs are classified
as either homogeneous or heterogeneous.

Homogeneous DDBMSs integrate only one type


of centralized DBMS over a network. Thus, the
same DBMS will be running on different server
platforms (single processor server, multiprocessor
server).

Heterogeneous DDBMSs integrate different


types of centralized DBMSs over a network.
Multiple-Site Processing, Multiple-Site Data
(MPMD)

A fully heterogeneous DDBMS will support


different DBMSs that may even support different
data models (relational, hierarchical, or network)
running under different computer systems, such as
mainframes and PCs.

Some DDBMS implementations support several


platforms, operating systems, and networks and
allow remote data access to another DBMS.
DISTRIBUTED DATABASE TRANSPARENCY FEATURES

1. Distribution Transparency

2. Transaction Transparency

3. Failure Transparency

4. Performance Transparency

5. Heterogeneity Transparency
DISTRIBUTED DATABASE TRANSPARENCY FEATURES
1. Distribution transparency

Allows a distributed database to be treated as a


single logical database.

If a DDBMS exhibits distribution transparency, the


user does not need to know:

That the data are partitioned—meaning the


table’s rows and columns are split vertically or
horizontally and stored among multiple sites.

That the data can be replicated at several sites.

The data location.


DISTRIBUTED DATABASE TRANSPARENCY FEATURES
2. Transaction transparency

Allows a transaction to update data at more


than one network site.

Transaction transparency ensures that the


transaction will be either entirely completed or
aborted, thus maintaining database integrity.

3. Failure transparency

Ensures that the system will continue to


operate in the event of a node failure.

Functions that were lost because of the failure


will be picked up by another network node.
DISTRIBUTED DATABASE TRANSPARENCY FEATURES
4. Performance transparency

Allows the system to perform as if it were a centralized


DBMS.

No performance degradation due to its use on a network


or due to the network’s platform differences.

Ensures that the system will find the most cost-effective


path to access remote data.

5. Heterogeneity transparency

o Allows the integration of several different local DBMSs


(relational, network, and hierarchical) under a common,
or global, schema.
o The DDBMS is responsible for translating the data
requests from the global schema to the local DBMS
schema.
DISTRIBUTION TRANSPARENCY
Physically dispersed database managed as though
it were a centralized database.

Three levels of distribution transparency:


1. Fragmentation transparency is the highest level of
transparency. The end user or programmer does not
need to know that a database is partitioned. Therefore,
neither fragment names nor fragment locations are
specified prior to data access.

2. Location transparency exists when the end user or


programmer must specify the database fragment
names but does not need to specify where those
fragments are located.

3. Local mapping transparency exists when the end


user or programmer must specify both the fragment
names and their locations.
DISTRIBUTION TRANSPARENCY
DISTRIBUTION TRANSPARENCY
EMPLOYEE (EMP_NAME, EMP_DOB, EMP_ADDRESS,
EMP_DEPARTMENT, EMP_SALARY).

The EMPLOYEE data are distributed over three different


locations: New York, Atlanta, and Miami.

New York employee data are stored in fragment E1, Atlanta


employee data are stored in fragment E2, and Miami
employee data are stored in fragment E3.

Each fragment is unique. The unique fragment condition


indicates that each row is unique, regardless of the fragment
in which it is located. Assume that no portion of the database
is replicated at any other site on the network.
DISTRIBUTION TRANSPARENCY

Select all employees whose EMP_DOB<'01-


JAN-196';

Database Supports Fragmentation


Transparency

SELECT * FROM EMPLOYEE WHERE EMP_DOB <


'01-JAN-196';

The query conforms to a nondistributed database


query format; that is, it does not specify fragment
names or locations.
DISTRIBUTION TRANSPARENCY

The Database Supports Location Transparency

SELECT * FROM E1 WHERE EMP_DOB < '01-JAN-1960';

UNION

SELECT * FROM E2 WHERE EMP_DOB < '01-JAN-1960';

UNION

SELECT * FROM E3 WHERE EMP_DOB < '01-JAN-1960';

Fragment names must be specified in the query,


but the fragment’s location is not specified.
DISTRIBUTION TRANSPARENCY
Database Supports Local Mapping
Transparency

Both the fragment name and its location must be


specified in the query. Using pseudo-SQL:
SELECT * FROM El NODE NY WHERE EMP_DOB < '01-JAN-
1960';

UNION

SELECT * FROM E2 NODE ATL WHERE EMP_DOB < '01-


JAN-1960';

UNION

SELECT *FROM E3 NODE MIA WHERE EMP_DOB < '01-


JAN-1960';
DISTRIBUTION TRANSPARENCY
Distribution transparency is supported by a
distributed data dictionary (DDD), or a
distributed data catalogue (DDC).

Contains the description of the entire database


(distributed global schema )as seen by the
database administrator.

Is the common database schema used by local TPs


to translate user requests into sub queries (remote
requests) that will be processed by different DPs.

The DDC is itself distributed, and it is replicated at


the network nodes. Therefore, the DDC must
maintain consistency through updating at all sites.
TRANSACTION TRANSPARENCY

Ensures that database transactions will maintain


the distributed database’s integrity and
consistency.

DDBMS database transaction can update data


stored in many different computers connected in
a network.

Transaction transparency ensures that the


transaction will be completed only when all
database sites involved in the transaction
complete their part of the transaction.
TRANSACTION TRANSPARENCY

Remote Request

Remote Transaction

Distributed Requests

Distributed Transactions
REMOTE REQUEST
A remote request lets a single SQL
statement access the data that are to be
processed by a single remote database
processor.

The SQL statement (or request) can


reference data at only one remote site.
REMOTE TRANSACTION
A remote transaction, composed of several
requests, accesses data at a single remote site.

Each SQL statement (or request) can reference


only one (the same) remote DP at a time, and
the entire transaction can reference and be
executed at only one remote DP.
DISTRIBUTED TRANSACTION
A distributed transaction allows a transaction
to reference several different local or remote DP
sites.

Each single request can reference only one local


or remote DP site, the transaction as a whole can
reference multiple DP sites because each request
can reference a different site.
DISTRIBUTED TRANSACTION

Each request can access only remote


site at a time.

Suppose the table PRODUCT is divided into


two fragments, PRODl and PROD2, located
at sites B and C, respectively.
SELECT * FROM PRODUCT WHERE
PROD_NUM = '231785'; cannot access data
from more than one remote site.
DISTRIBUTED REQUEST
A distributed request lets a single SQL statement
reference data located at several different local or
remote DP sites.

Because each request (SQL statement) can access


data from more than one local or remote DP site, a
transaction can access several sites.

The ability to execute a distributed request provides


fully distributed database processing
capabilities because of the ability to:
Partition a database table into several fragments.
Reference one or more of those fragments with only
one request. In other words, there is fragmentation
transparency.
Full fragmentation transparency support is
provided only by a DDBMS that supports
distributed requests.
DISTRIBUTED REQUEST
A single SELECT statement to reference two
tables, CUSTOMER and INVOICE. The two tables
are located at two different sites, B and C.
DISTRIBUTED REQUEST
A CUSTOMER table is divided into two
fragments, C1 and C2, located at sites B and
C, respectively.

Distributed request feature also allows a


single request to reference a physically
partitioned table.
DISTRIBUTED CONCURRENCY
CONTROL
Multisite, multiple-process operations are more likely to
create data inconsistencies and deadlocked transactions
than single-site systems are.

For e.g: If transaction operation was committed by 2 local


DP, but one of the DPs could not commit the transaction’s
results. the transaction(s) would yield an inconsistent
database, with its inevitable integrity problems.

The solution for the problem is a two-phase commit


protocol.
Two-Phase Commit
Protocol
Distributed databases make it possible for a transaction
to access data at several sites.

The two-phase commit protocol guarantees


that if a portion of a transaction operation
cannot be committed; all changes made at the
other sites participating in the transaction will
be undone to maintain a consistent database
state.
Each DP maintains its own transaction log.
Two-Phase Commit
Protocol
Uses a DO-UNDO-REDO protocol and a
write-ahead protocol.

Transaction log for each DP be written before


the database fragment is actually updated.

The DO-UNDO-REDO protocol is used by


the DP to roll back and/or roll forward
transactions with the help of the system’s
transaction log entries.

The DO-UNDO-REDO protocol defines three


types of operations: DO , UNDO, REDO
Two-Phase Commit
Protocol
DO performs the operation and records the “before”
and “after” values in the transaction log.

UNDO reverses an operation, using the log entries


written by the DO portion of the sequence.

REDO redoes an operation, using the log entries written


by the DO portion of the sequence.

To ensure that the DO, UNDO, and REDO operations can


survive a system crash while they are being executed, a
write-ahead protocol is used.

The write-ahead-log protocol ensures that


transaction logs are always written before any database
data are actually updated. This protocol ensures that, in
case of a failure, the database can later be recovered to
a consistent state, using the data in the transaction log.
Two-Phase Commit
Prototcol
The two-phase commit protocol defines the
operations between two types of nodes:
The coordinator.

One or more subordinates.

The participating nodes agree on a coordinator.

The coordinator role is assigned to the node that


initiates the transaction.

The protocol is implemented in two phases.


Two-Phase Commit
Protocol
Phase 1: Preparation
The coordinator sends a PREPARE TO COMMIT
message to all subordinates.

The subordinates receive the message; write the


transaction log, using the write-ahead protocol; and
send an acknowledgment (YES/PREPARED TO
COMMIT or NO/NOT PREPARED) message to the
coordinator.

The coordinator makes sure that all nodes are ready


to commit, or it aborts the action.

If all nodes are PREPARED TO COMMIT, the


transaction goes to Phase 2. If one or more nodes
reply NO or NOT PREPARED, the coordinator
broadcasts an ABORT message to all subordinates.
Two-Phase Commit
Protocol
Phase 2: The Final COMMIT

The coordinator broadcasts a COMMIT message


to all subordinates and waits for the replies.

Each subordinate receives the COMMIT message,


and then updates the database using the DO
protocol.

The subordinates reply with a COMMITTED or


NOT COMMITTED message to the coordinator.

If one or more subordinates did not commit, the


coordinator sends an ABORT message, thereby
forcing them to UNDO all changes.
PERFORMANCE TRANSPARENCY

Fragmented Database makes the query translation


more complicated, because the DDBMS must
decide which fragment of the database to
access.

Data replication makes the access problem


complex, because the database must decide
which copy of the data to access.

The DDBMS uses query optimization techniques to


deal with such problems and to ensure acceptable
database performance.
PERFORMANCE
TRANSPARENCY
The objective of a query optimization routine is to
minimize the total cost associated with the
execution of a request.

The costs associated with a request are a function


of the: Access time (I/O) cost, Communication
cost, CPU time cost.

Some algorithms minimize total time; others


minimize the communication time, and still others
do not factor in the CPU time, considering it
insignificant relative to other cost sources.
PERFORMANCE
TRANSPARENCY
Query optimization in distributed database systems
must provide distribution transparency + replica
transparency.

Replica transparency refers to the DDBMS’s ability


to hide the existence of multiple copies of data from
the user.

Most of the algorithms proposed for query


optimization are based on two principles:
The selection of the optimum execution order.

The selection of sites to be accessed to minimize


communication costs.
Performance
Transparency
A query optimization algorithm can be
evaluated on the basis
Of its operation mode

The timing of its optimization.

Type of information.
Classification based on Operation
Modes
Operation modes can be classified as manual or
automatic.
Automatic query optimization means that the DDBMS
finds the most cost-effective access path without user
intervention.

Manual query optimization requires that the


optimization be selected and scheduled by the end user or
programmer.

Automatic query optimization is clearly more desirable


from the end user’s point of view, but the cost of such
convenience is the increased overhead that it
imposes on the DDBMS.
Classification based on Timing
Timing Classification of algorithms: static or dynamic.
(when the optimization is done)

1. Static query optimization takes place at compilation time.

When the program is submitted to the DBMS for


compilation, it creates the plan necessary to access the
database. When the program is executed, the DBMS uses
that plan to access the database.

2. Dynamic query optimization takes place at execution time.

Access strategy is dynamically determined by the DBMS at


run time, using the most up-to-date information about the
database.
Classification based on Type of
information
Optimization technique based on the type of information
that is used to optimize the query.

1. A statistically based query optimization algorithm uses

statistical information about the database to determine the


best access strategy.

The statistical information generated in one of two different


modes: dynamic or manual. In the dynamic statistical
generation mode, the DDBMS automatically evaluates
and updates the statistics after each access. In the manual
statistical generation mode, the statistics must be
updated periodically through a user-selected utility.
Classification based on Type of
information

2. A rule-based query optimization


algorithm is based on a set of user-defined
rules to determine the best query access
strategy. The rules are entered by the end
user or database administrator, and they are
typically very general in nature.
DISTRIBUTED DATABASE
DESIGN

Issues Managed by
1. How to partition the database Data Fragmentation & Data
into fragments. Replication
2. Which fragments to replicate.

3. Where to locate those Data Allocation


fragments and replicas.
DATA FRAGMENTATION
Data fragmentation allows you to break a single
object into two or more segments, or fragments.
The object might be a user’s database, a system
database, or a table. Each fragment can be stored
at any site over a computer network. Information
about data fragmentation is stored in the
distributed data catalog (DDC), from which it is
accessed by the TP to process user requests.

Three types of data fragmentation at table level


are horizontal, vertical, and mixed. (A
fragmented table can always be re-created from
its fragmented parts by a combination of unions
and joins.)
Horizontal
Fragmentation
Refers to the division of a relation into subsets
(fragments) of tuples (rows).

Each fragment is stored at a different node, and


each fragment has unique rows.

Each fragment represents the equivalent of a


SELECT statement, with the WHERE clause on a
single attribute.
Vertical fragmentation
Refers to the division of a relation into attribute
(column) subsets.

Each subset (fragment) is stored at a different


node,

Each fragment has unique columns—with the


exception of the key column, which is common to
all fragments.

This is the equivalent of the PROJECT statement in


SQL.
Mixed Fragmentation

Refers to a combination of horizontal


and vertical strategies.

A table may be divided into several


horizontal subsets (rows), each one
having a subset of the attributes
(columns).
DATA REPLICATION
Data replication refers to the storage of data copies
at multiple sites served by a computer network.

Fragment copies can be stored at several sites to


serve specific information requirements.

Suppose database A is divided into two fragments, A1


and A2. Within a replicated distributed database,
fragment A1 is stored at sites S1 and S2, while
fragment A2 is stored at sites S2 and S3.

Replicated data are subject to the mutual consistency


rule. The mutual consistency rule requires that all
copies of data fragments be identical. Therefore, to
maintain data consistency among the replicas, the
DDBMS must ensure that a database update is
performed at all sites where replicas exist.
DATA REPLICATION
Three replication scenarios exist.

A fully replicated database stores multiple


copies of each database fragment at multiple
sites. In this case, all database fragments
are replicated. (A fully replicated database can be
impractical due to the amount of overhead it imposes on the
system.)

A partially replicated database stores


multiple copies of some database fragments
at multiple sites. Most DDBMSs are able to
handle the partially replicated database well.

An unreplicated database stores each


database fragment at a single site. Therefore,
there are no duplicated database fragments.
DATA REPLICATION
Advantages:
Improved data availability, better load
distribution, improved data failure-tolerance, and
reduced query costs.

Disadvantages:
Additional DDBMS processing overhead—
because each data copy must be maintained by
the system.
Because the data are replicated at another site,
there are associated storage costs and increased
transaction times (as data must be updated at
several sites concurrently to comply with the
mutual consistency rule).
DATA ALLOCATION
Data allocation describes the process of
deciding where to locate data. Data allocation
strategies are as follows:

With centralized data allocation, the entire


database is stored at one site.

With partitioned data allocation, the database


is divided into two or more disjointed parts
(fragments) and stored at two or more sites.

With replicated data allocation, copies of one


or more database fragments are stored at
several sites. Most data allocation studies focus
on one issue: which data to locate where.
DATA ALLOCATION
Data allocation algorithms take into
consideration a variety of factors, including:

1. Performance and data availability goals.

2. Size, number of rows, and number of


relations that an entity maintains with other
entities.

3. Types of transactions to be applied to the


database and the attributes accessed by
each of those transactions.
?

You might also like