SlideShare a Scribd company logo
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/1
Outline
• Introduction
• Background
• Distributed Database Design
• Database Integration
➡ Schema Matching
➡ Schema Mapping
• Semantic Data Control
• Distributed Query Processing
• Multimedia Query Processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
• Current Issues
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/2
Problem Definition
• Given existing databases with their Local Conceptual Schemas
(LCSs), how to integrate the LCSs into a Global Conceptual Schema (GCS)
➡ GCS is also called mediated schema
• Bottom-up design process
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/3
Integration Alternatives
• Physical integration
➡ Source databases integrated and the integrated database is materialized
➡ Data warehouses
• Logical integration
➡ Global conceptual schema is virtual and not materialized
➡ Enterprise Information Integration (EII)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/4
Data Warehouse Approach
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/5
Bottom-up Design
• GCS (also called mediated schema) is defined first
➡ Map LCSs to this schema
➡ As in data warehouses
• GCS is defined as an integration of parts of LCSs
➡ Generate GCS and map LCSs to this GCS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/6
GCS/LCS Relationship
• Local-as-view
➡ The GCS definition is assumed to exist, and each LCS is treated as a view
definition over it
• Global-as-view
➡ The GCS is defined as a set of views over the LCSs
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/7
Database Integration Process
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/8
Recall Access Architecture
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/9
Database Integration Issues
• Schema translation
➡ Component database schemas translated to a common intermediate canonical
representation
• Schema generation
➡ Intermediate schemas are used to create a global conceptual schema
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/10
Schema Translation
• What is the canonical data model?
➡ Relational
➡ Entity-relationship
✦ DIKE
➡ Object-oriented
✦ ARTEMIS
➡ Graph-oriented
✦ DIPE, TranScm, COMA, Cupid
✦ Preferable with emergence of XML
✦ No common graph formalism
• Mapping algorithms
➡ These are well-known
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/11
Schema Generation
• Schema matching
➡ Finding the correspondences between multiple schemas
• Schema integration
➡ Creation of the GCS (or mediated schema) using the correspondences
• Schema mapping
➡ How to map data from local databases to the GCS
• Important: sometimes the GCS is defined first and schema matching and
schema mapping is done against this target GCS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/12
Running Example
EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET, LOC, CNAME)
ASG(ENO, PNO, RESP, DUR)
PAY(TITLE, SAL)
Relational
E-R Model
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/13
Schema Matching
• Schema heterogeneity
➡ Structural heterogeneity
✦ Type conflicts
✦ Dependency conflicts
✦ Key conflicts
✦ Behavioral conflicts
➡ Semantic heterogeneity
✦ More important and harder to deal with
✦ Synonyms, homonyms, hypernyms
✦ Different ontology
✦ Imprecise wording
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/14
Schema Matching (cont’d)
• Other complications
➡ Insufficient schema and instance information
➡ Unavailability of schema documentation
➡ Subjectivity of matching
• Issues that affect schema matching
➡ Schema versus instance matching
➡ Element versus structure level matching
➡ Matching cardinality
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/15
Schema Matching Approaches
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/16
Linguistic Schema Matching
• Use element names and other textual information (textual
descriptions, annotations)
• May use external sources (e.g., Thesauri)
• 〈SC1.element-1 ≈ SC2.element-2, p,s〉
➡ Element-1 in schema SC1 is similar to element-2 in schema SC2 if predicate p
holds with a similarity value of s
• Schema level
➡ Deal with names of schema elements
➡ Handle cases such as synonyms, homonyms, hypernyms, data type
similarities
• Instance level
➡ Focus on information retrieval techniques (e.g., word frequencies, key terms)
➡ “Deduce” similarities from these
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/17
Linguistic Matchers
• Use a set of linguistic (terminological) rules
• Basic rules can be hand-crafted or may be discovered from outside sources
(e.g., WordNet)
• Predicate p and similarity value s
➡ hand-crafted ⇒ specified,
➡ discovered ⇒ may be computed or specified by an expert after discovery
• Examples
➡ 〈uppercase names ≈ lower case names, true, 1.0〉
➡ 〈uppercase names ≈ capitalized names, true, 1.0〉
➡ 〈capitalized names ≈ lower case names, true, 1.0〉
➡ 〈DB1.ASG ≈ DB2.WORKS_IN, true, 0.8〉
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/18
Automatic Discovery of Name
Similarities
• Affixes
➡ Common prefixes and suffixes between two element name strings
• N-grams
➡ Comparing how many substrings of length n are common between the two
name strings
• Edit distance
➡ Number of character modifications (additions, deletions, insertions) that
needs to be performed to convert one string into the other
• Soundex code
➡ Phonetic similarity between names based on their soundex codes
• Also look at data types
➡ Data type similarity may suggest stronger relationship than the computed
similarity using these methods or to differentiate between multiple strings
with same value
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/19
N-gram Example
• 3-grams of string “Responsibility” are the following:
Res  sib
ibi  esp
bip  spo
ili  pon
lit  ons
ity  nsi
• 3-grams of string “Resp” are
➡ Res
➡ esp
• 3-gram similarity: 2/12 = 0.17
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/20
Edit Distance Example
• Again consider “Responsibility” and “Resp”
• To convert “Responsibility” to “Resp”
➡ Delete characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y”
• To convert “Resp” to “Responsibility”
➡ Add characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y”
• The number of edit operations required is 10
• Similarity is 1 − (10/14) = 0.29
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/21
Constraint-based Matchers
• Data always have constraints – use them
➡ Data type information
➡ Value ranges
➡ …
• Examples
➡ RESP and RESPONSIBILITY: n-gram similarity = 0.17, edit distance similarity
= 0.19 (low)
➡ If they come from the same domain, this may increase their similarity value
➡ ENO in relational, WORKER.NUMBER and PROJECT.NUMBER in E-R
➡ ENO and WORKER.NUMBER may have type INTEGER while
PROJECT.NUMBER may have STRING
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/22
Constraint-based Structural
Matching
• If two schema elements are structurally similar, then there is a higher
likelihood that they represent the same concept
• Structural similarity:
➡ Same properties (attributes)
➡ “Neighborhood” similarity
✦ Using graph representation
✦ The set of nodes that can be reached within a particular path length from a node
are the neighbors of that node
✦ If two concepts (nodes) have similar set of neighbors, they are likely to represent
the same concept
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/23
Learning-based Schema
Matching
• Use machine learning techniques to determine schema matches
• Classification problem: classify concepts from various schemas into classes
according to their similarity. Those that fall into the same class represent
similar concepts
• Similarity is defined according to features of data instances
• Classification is “learned” from a training set
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/24
Learning-based Schema
Matching
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/25
Combined Schema Matching
Approaches
• Use multiple matchers
➡ Each matcher focuses on one area (name, etc)
• Meta-matcher integrates these into one prediction
• Integration may be simple (take average of similarity values) or more
complex (see Fagin’s work)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/26
Schema Integration
• Use the correspondences to create a GCS
• Mainly a manual process, although rules can help
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/27
Binary Integration Methods
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/28
N-ary Integration Methods
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/29
Schema Mapping
• Mapping data from each local database (source) to GCS (target) while
preserving semantic consistency as defined in both source and target.
• Data warehouses ⇒ actual translation
• Data integration systems ⇒ discover mappings that can be used in the
query processing phase
• Mapping creation
• Mapping maintenance
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/30
Mapping Creation
Given
➡ A source LCS
➡ A target GCS
➡ A set of value correspondences discovered
during schema matching phase
Produce a set of queries that, when executed, will create GCS data instances
from the source data.
We are looking, for each Tk, a query Qk that is defined on a (possibly proper)
subset of the relations in S such that, when executed, will generate data for
Ti from the source relations
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/31
Mapping Creation Algorithm
General idea:
• Consider each Tk in turn. Divide Vk into subsets such that
each specifies one possible way that values of Tk can be computed.
• Each can be mapped to a query that, when executed, would generate
some of Tk’s data.
• Union of these queries gives

More Related Content

What's hot (20)

PPTX
Distributed design alternatives
Pooja Dixit
 
PPTX
Distributed Query Processing
Mythili Kannan
 
PPTX
08. networking
Muhammad Ahad
 
PPSX
Parallel Database
VESIT/University of Mumbai
 
PDF
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Gyanmanjari Institute Of Technology
 
PPTX
Introduction to distributed database
Sonia Panesar
 
PPT
Query Decomposition and data localization
Hafiz faiz
 
PPTX
Distributed database management system
Pooja Dixit
 
PPT
Distributed Systems
Rupsee
 
PDF
management of distributed transactions
Nilu Desai
 
PPTX
Database , 12 Reliability
Ali Usman
 
PPTX
Distributed DBMS - Unit 6 - Query Processing
Gyanmanjari Institute Of Technology
 
PDF
DDBMS Paper with Solution
Gyanmanjari Institute Of Technology
 
PPTX
Introduction to Distributed System
Sunita Sahu
 
PPTX
Distributed concurrency control
Binte fatima
 
PPT
Distributed Database System
Sulemang
 
PPTX
05. performance-concepts-26-slides
Muhammad Ahad
 
PPT
Distributed Deadlock Detection.ppt
Babar Kamran Ahmed (LION)
 
PPTX
Load Balancing in Parallel and Distributed Database
Md. Shamsur Rahim
 
PPT
Lecture 01 introduction to database
emailharmeet
 
Distributed design alternatives
Pooja Dixit
 
Distributed Query Processing
Mythili Kannan
 
08. networking
Muhammad Ahad
 
Parallel Database
VESIT/University of Mumbai
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Gyanmanjari Institute Of Technology
 
Introduction to distributed database
Sonia Panesar
 
Query Decomposition and data localization
Hafiz faiz
 
Distributed database management system
Pooja Dixit
 
Distributed Systems
Rupsee
 
management of distributed transactions
Nilu Desai
 
Database , 12 Reliability
Ali Usman
 
Distributed DBMS - Unit 6 - Query Processing
Gyanmanjari Institute Of Technology
 
DDBMS Paper with Solution
Gyanmanjari Institute Of Technology
 
Introduction to Distributed System
Sunita Sahu
 
Distributed concurrency control
Binte fatima
 
Distributed Database System
Sulemang
 
05. performance-concepts-26-slides
Muhammad Ahad
 
Distributed Deadlock Detection.ppt
Babar Kamran Ahmed (LION)
 
Load Balancing in Parallel and Distributed Database
Md. Shamsur Rahim
 
Lecture 01 introduction to database
emailharmeet
 

Viewers also liked (20)

PDF
Jarrar: Data Schema Integration
Mustafa Jarrar
 
PDF
Data integration
Umar Alharaky
 
PDF
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
PPT
Data Integration (ETL)
easysoft
 
PPT
DBMS Canonical cover
Saurabh Tandel
 
PDF
Data integration ppt-bhawani nandan prasad - iim calcutta
Bhawani N Prasad
 
PPTX
Database ,7 query localization
Ali Usman
 
PPTX
Database, 3 Distribution Design
Ali Usman
 
PPTX
Database ,11 Concurrency Control
Ali Usman
 
PPTX
Database , 15 Object DBMS
Ali Usman
 
PPTX
Database ,18 Current Issues
Ali Usman
 
PPTX
Database ,2 Background
Ali Usman
 
PPTX
Database , 6 Query Introduction
Ali Usman
 
PDF
Pal gov.tutorial2.session13 1.data schema integration
Mustafa Jarrar
 
PPT
test
eduard_c
 
PPT
Modul 04 ta1_ metodologi penelitian
Fokgusta
 
PPT
Media ajarelektronik
Fokgusta
 
DOCX
Processor Specifications
Ali Usman
 
PDF
SysML as a Common Integration Platform for Co-Simulations – Example of a Cybe...
Andrey Sadovykh
 
PDF
Pal gov.tutorial2.session15 1.linkeddata
Mustafa Jarrar
 
Jarrar: Data Schema Integration
Mustafa Jarrar
 
Data integration
Umar Alharaky
 
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
Data Integration (ETL)
easysoft
 
DBMS Canonical cover
Saurabh Tandel
 
Data integration ppt-bhawani nandan prasad - iim calcutta
Bhawani N Prasad
 
Database ,7 query localization
Ali Usman
 
Database, 3 Distribution Design
Ali Usman
 
Database ,11 Concurrency Control
Ali Usman
 
Database , 15 Object DBMS
Ali Usman
 
Database ,18 Current Issues
Ali Usman
 
Database ,2 Background
Ali Usman
 
Database , 6 Query Introduction
Ali Usman
 
Pal gov.tutorial2.session13 1.data schema integration
Mustafa Jarrar
 
test
eduard_c
 
Modul 04 ta1_ metodologi penelitian
Fokgusta
 
Media ajarelektronik
Fokgusta
 
Processor Specifications
Ali Usman
 
SysML as a Common Integration Platform for Co-Simulations – Example of a Cybe...
Andrey Sadovykh
 
Pal gov.tutorial2.session15 1.linkeddata
Mustafa Jarrar
 
Ad

Similar to Database , 4 Data Integration (20)

PPTX
Database , 8 Query Optimization
Ali Usman
 
PPT
Lec2_Information Integration.ppt
NaglaaFathy42
 
PPT
Cs583 information-integration
Borseshweta
 
PPTX
1 introduction
Amrit Kaur
 
PPT
Lecture09
praveen kumar yechuri
 
PDF
NoSQL
Yousof Alsatom
 
PPTX
Chapter 5 - Distributed Database and QODD.pptx
ahmed518927
 
PPTX
AUERY.pptxHDSOILDKCJSIDVCBIDCSDCJNSOIDCNSOD
AhmadSajjad34
 
PPT
semantic integration.ppt
NaglaaFathy42
 
PDF
Relaxing global-as-view in mediated data integration from linked data
Alessandro Adamou
 
PDF
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
Evaldas Taroza
 
PPTX
Token
amooool2000
 
PPTX
Database , 1 Introduction
Ali Usman
 
PDF
6-Query_Intro (5).pdf
JaveriaShoaib4
 
PPTX
1 introduction DDBS
naimanighat
 
PDF
Lecture05sql 110406195130-phpapp02
Lalit009kumar
 
PPTX
Database ,14 Parallel DBMS
Ali Usman
 
PPTX
Database ,16 P2P
Ali Usman
 
Database , 8 Query Optimization
Ali Usman
 
Lec2_Information Integration.ppt
NaglaaFathy42
 
Cs583 information-integration
Borseshweta
 
1 introduction
Amrit Kaur
 
Chapter 5 - Distributed Database and QODD.pptx
ahmed518927
 
AUERY.pptxHDSOILDKCJSIDVCBIDCSDCJNSOIDCNSOD
AhmadSajjad34
 
semantic integration.ppt
NaglaaFathy42
 
Relaxing global-as-view in mediated data integration from linked data
Alessandro Adamou
 
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
Evaldas Taroza
 
Database , 1 Introduction
Ali Usman
 
6-Query_Intro (5).pdf
JaveriaShoaib4
 
1 introduction DDBS
naimanighat
 
Lecture05sql 110406195130-phpapp02
Lalit009kumar
 
Database ,14 Parallel DBMS
Ali Usman
 
Database ,16 P2P
Ali Usman
 
Ad

More from Ali Usman (17)

PPT
Cisco Packet Tracer Overview
Ali Usman
 
PDF
Islamic Arts and Architecture
Ali Usman
 
PPTX
Database , 17 Web
Ali Usman
 
PPTX
Database , 13 Replication
Ali Usman
 
PPTX
Database ,10 Transactions
Ali Usman
 
PPTX
Database , 5 Semantic
Ali Usman
 
DOCX
Processor Specifications
Ali Usman
 
PDF
Fifty Year Of Microprocessor
Ali Usman
 
PDF
Discrete Structures lecture 2
Ali Usman
 
PDF
Discrete Structures. Lecture 1
Ali Usman
 
PDF
Muslim Contributions in Medicine-Geography-Astronomy
Ali Usman
 
PDF
Muslim Contributions in Geography
Ali Usman
 
PDF
Muslim Contributions in Astronomy
Ali Usman
 
PDF
Ptcl modem (user manual)
Ali Usman
 
PDF
Nimat-ul-ALLAH shah wali
Ali Usman
 
PDF
Muslim Contributions in Mathematics
Ali Usman
 
PDF
Osi protocols
Ali Usman
 
Cisco Packet Tracer Overview
Ali Usman
 
Islamic Arts and Architecture
Ali Usman
 
Database , 17 Web
Ali Usman
 
Database , 13 Replication
Ali Usman
 
Database ,10 Transactions
Ali Usman
 
Database , 5 Semantic
Ali Usman
 
Processor Specifications
Ali Usman
 
Fifty Year Of Microprocessor
Ali Usman
 
Discrete Structures lecture 2
Ali Usman
 
Discrete Structures. Lecture 1
Ali Usman
 
Muslim Contributions in Medicine-Geography-Astronomy
Ali Usman
 
Muslim Contributions in Geography
Ali Usman
 
Muslim Contributions in Astronomy
Ali Usman
 
Ptcl modem (user manual)
Ali Usman
 
Nimat-ul-ALLAH shah wali
Ali Usman
 
Muslim Contributions in Mathematics
Ali Usman
 
Osi protocols
Ali Usman
 

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 

Database , 4 Data Integration

  • 1. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/1 Outline • Introduction • Background • Distributed Database Design • Database Integration ➡ Schema Matching ➡ Schema Mapping • Semantic Data Control • Distributed Query Processing • Multimedia Query Processing • Distributed Transaction Management • Data Replication • Parallel Database Systems • Distributed Object DBMS • Peer-to-Peer Data Management • Web Data Management • Current Issues
  • 2. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/2 Problem Definition • Given existing databases with their Local Conceptual Schemas (LCSs), how to integrate the LCSs into a Global Conceptual Schema (GCS) ➡ GCS is also called mediated schema • Bottom-up design process
  • 3. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/3 Integration Alternatives • Physical integration ➡ Source databases integrated and the integrated database is materialized ➡ Data warehouses • Logical integration ➡ Global conceptual schema is virtual and not materialized ➡ Enterprise Information Integration (EII)
  • 4. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/4 Data Warehouse Approach
  • 5. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/5 Bottom-up Design • GCS (also called mediated schema) is defined first ➡ Map LCSs to this schema ➡ As in data warehouses • GCS is defined as an integration of parts of LCSs ➡ Generate GCS and map LCSs to this GCS
  • 6. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/6 GCS/LCS Relationship • Local-as-view ➡ The GCS definition is assumed to exist, and each LCS is treated as a view definition over it • Global-as-view ➡ The GCS is defined as a set of views over the LCSs
  • 7. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/7 Database Integration Process
  • 8. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/8 Recall Access Architecture
  • 9. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/9 Database Integration Issues • Schema translation ➡ Component database schemas translated to a common intermediate canonical representation • Schema generation ➡ Intermediate schemas are used to create a global conceptual schema
  • 10. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/10 Schema Translation • What is the canonical data model? ➡ Relational ➡ Entity-relationship ✦ DIKE ➡ Object-oriented ✦ ARTEMIS ➡ Graph-oriented ✦ DIPE, TranScm, COMA, Cupid ✦ Preferable with emergence of XML ✦ No common graph formalism • Mapping algorithms ➡ These are well-known
  • 11. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/11 Schema Generation • Schema matching ➡ Finding the correspondences between multiple schemas • Schema integration ➡ Creation of the GCS (or mediated schema) using the correspondences • Schema mapping ➡ How to map data from local databases to the GCS • Important: sometimes the GCS is defined first and schema matching and schema mapping is done against this target GCS
  • 12. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/12 Running Example EMP(ENO, ENAME, TITLE) PROJ(PNO, PNAME, BUDGET, LOC, CNAME) ASG(ENO, PNO, RESP, DUR) PAY(TITLE, SAL) Relational E-R Model
  • 13. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/13 Schema Matching • Schema heterogeneity ➡ Structural heterogeneity ✦ Type conflicts ✦ Dependency conflicts ✦ Key conflicts ✦ Behavioral conflicts ➡ Semantic heterogeneity ✦ More important and harder to deal with ✦ Synonyms, homonyms, hypernyms ✦ Different ontology ✦ Imprecise wording
  • 14. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/14 Schema Matching (cont’d) • Other complications ➡ Insufficient schema and instance information ➡ Unavailability of schema documentation ➡ Subjectivity of matching • Issues that affect schema matching ➡ Schema versus instance matching ➡ Element versus structure level matching ➡ Matching cardinality
  • 15. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/15 Schema Matching Approaches
  • 16. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/16 Linguistic Schema Matching • Use element names and other textual information (textual descriptions, annotations) • May use external sources (e.g., Thesauri) • 〈SC1.element-1 ≈ SC2.element-2, p,s〉 ➡ Element-1 in schema SC1 is similar to element-2 in schema SC2 if predicate p holds with a similarity value of s • Schema level ➡ Deal with names of schema elements ➡ Handle cases such as synonyms, homonyms, hypernyms, data type similarities • Instance level ➡ Focus on information retrieval techniques (e.g., word frequencies, key terms) ➡ “Deduce” similarities from these
  • 17. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/17 Linguistic Matchers • Use a set of linguistic (terminological) rules • Basic rules can be hand-crafted or may be discovered from outside sources (e.g., WordNet) • Predicate p and similarity value s ➡ hand-crafted ⇒ specified, ➡ discovered ⇒ may be computed or specified by an expert after discovery • Examples ➡ 〈uppercase names ≈ lower case names, true, 1.0〉 ➡ 〈uppercase names ≈ capitalized names, true, 1.0〉 ➡ 〈capitalized names ≈ lower case names, true, 1.0〉 ➡ 〈DB1.ASG ≈ DB2.WORKS_IN, true, 0.8〉
  • 18. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/18 Automatic Discovery of Name Similarities • Affixes ➡ Common prefixes and suffixes between two element name strings • N-grams ➡ Comparing how many substrings of length n are common between the two name strings • Edit distance ➡ Number of character modifications (additions, deletions, insertions) that needs to be performed to convert one string into the other • Soundex code ➡ Phonetic similarity between names based on their soundex codes • Also look at data types ➡ Data type similarity may suggest stronger relationship than the computed similarity using these methods or to differentiate between multiple strings with same value
  • 19. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/19 N-gram Example • 3-grams of string “Responsibility” are the following: Res  sib ibi  esp bip  spo ili  pon lit  ons ity  nsi • 3-grams of string “Resp” are ➡ Res ➡ esp • 3-gram similarity: 2/12 = 0.17
  • 20. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/20 Edit Distance Example • Again consider “Responsibility” and “Resp” • To convert “Responsibility” to “Resp” ➡ Delete characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y” • To convert “Resp” to “Responsibility” ➡ Add characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y” • The number of edit operations required is 10 • Similarity is 1 − (10/14) = 0.29
  • 21. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/21 Constraint-based Matchers • Data always have constraints – use them ➡ Data type information ➡ Value ranges ➡ … • Examples ➡ RESP and RESPONSIBILITY: n-gram similarity = 0.17, edit distance similarity = 0.19 (low) ➡ If they come from the same domain, this may increase their similarity value ➡ ENO in relational, WORKER.NUMBER and PROJECT.NUMBER in E-R ➡ ENO and WORKER.NUMBER may have type INTEGER while PROJECT.NUMBER may have STRING
  • 22. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/22 Constraint-based Structural Matching • If two schema elements are structurally similar, then there is a higher likelihood that they represent the same concept • Structural similarity: ➡ Same properties (attributes) ➡ “Neighborhood” similarity ✦ Using graph representation ✦ The set of nodes that can be reached within a particular path length from a node are the neighbors of that node ✦ If two concepts (nodes) have similar set of neighbors, they are likely to represent the same concept
  • 23. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/23 Learning-based Schema Matching • Use machine learning techniques to determine schema matches • Classification problem: classify concepts from various schemas into classes according to their similarity. Those that fall into the same class represent similar concepts • Similarity is defined according to features of data instances • Classification is “learned” from a training set
  • 24. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/24 Learning-based Schema Matching
  • 25. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/25 Combined Schema Matching Approaches • Use multiple matchers ➡ Each matcher focuses on one area (name, etc) • Meta-matcher integrates these into one prediction • Integration may be simple (take average of similarity values) or more complex (see Fagin’s work)
  • 26. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/26 Schema Integration • Use the correspondences to create a GCS • Mainly a manual process, although rules can help
  • 27. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/27 Binary Integration Methods
  • 28. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/28 N-ary Integration Methods
  • 29. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/29 Schema Mapping • Mapping data from each local database (source) to GCS (target) while preserving semantic consistency as defined in both source and target. • Data warehouses ⇒ actual translation • Data integration systems ⇒ discover mappings that can be used in the query processing phase • Mapping creation • Mapping maintenance
  • 30. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/30 Mapping Creation Given ➡ A source LCS ➡ A target GCS ➡ A set of value correspondences discovered during schema matching phase Produce a set of queries that, when executed, will create GCS data instances from the source data. We are looking, for each Tk, a query Qk that is defined on a (possibly proper) subset of the relations in S such that, when executed, will generate data for Ti from the source relations
  • 31. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.4/31 Mapping Creation Algorithm General idea: • Consider each Tk in turn. Divide Vk into subsets such that each specifies one possible way that values of Tk can be computed. • Each can be mapped to a query that, when executed, would generate some of Tk’s data. • Union of these queries gives