SlideShare a Scribd company logo
INDEXING TECHNIQUES FOR
ADVANCED DATABASE SYSTEMS
The Kluwer International Series on
ADVANCES IN DATABASE SYSTEMS
Series Editor
Ahmed K. Elmagarmid
Purdue University
West Lafayette, IN 47907
Other books in the Series:
DATABASE CONCURRENCY CONTROL: Methods, Performance, and Analysis
by Alexander Thomasian
ISBN: 0-7923-9741-X
TIME-CONSTRAINED TRANSACTION MANAGEMENT:
Real-Time Constraints in Database Transaction Systems
by Nandit R. Soparkar, Henry F. Korth, Abraham Silberschatz
ISBN: 0-7923-9752-5
SEARCHING MULTIMEDIA DATABASES BY CONTENT
by Christos Faloutsos
ISBN: 0-7923-9777-0
REPLICATION TECHNIQUES IN DISTRIBUTED SYSTEMS
by Abdelsalam A. Helal, Abdelsalam A. Heddaya, Bharat B. Bhargava
ISBN: 0-7923-9800-9
VIDEO DATABASE SYSTEMS: Issues, Products, and Applications
by Ahmed K. Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmed
ISBN: 0-7923-9872-6
DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS
by Nabu R. Adam andAryya Gangopadhyay
ISBN: 0-7923-9924-2
INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES
by Thomas A. Mueck and Martin L. Polaschek
ISBN: 0-7923-9971-4
INDEXING TECHNIQUES FOR
ADVANCED DATABASE SYSTEMS
by
Elisa Bertino
University of Milano, Italy
Beng Chin Ooi
National University of Singapore, Singapore
Ron Sacks-Davis
RMIT, Australia
Kian-Lee Tan
National University of Singapore, Singapore
Justin Zobel
RMIT, Australia
Boris Shidlovsky
Grenoble Laboratory, France
Barbara Catania
University of Milano, Italy
SPRINGER SCIENCE+BUSINESS MEDIA, LLC
Library of Congress Cataloging-in-Publication Data
A CLP. Catalogue record for this book is available
from the Library of Congress.
ISBN 978-1-4613-7856-3 ISBN 978-1-4615-6227-6 (eBook)
DOI 10.1007/978-1-4615-6227-6
Copyright © 1997 Springer Science+Business Media New York
Originally published by Kluwer Academic Publishers in 1997
Softcover reprint of the hardcover 1st edition 1997
All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system or transmitted in any form or by any means, mechanical, photo-
copying, recording, or otherwise, without the prior written permission of the
publisher, Springer Science+Business Media, LLC.
Printed on acid-free paper.
Contents
Preface VII
1. OBJECT-ORIENTED DATABASES 1
1.1 Object-oriented data model and query language 3
1.2 Index organizations for aggregation graphs 7
13 Index organizations for in heritance hierarchies 20
1.4 Integrated organizations 29
1.5 Caching and pointer swizzling 36
1.6 Summary 38
2. SPATIAL DATABASES 39
2.1 Query processing using approximations 40
2.2 A taxonomy of spatial indexes 42
2.3 Binary-tree based indexing techniques 46
2.4 B-tree based indexing techniques 56
2.5 Cell methods based on dynamic hashing 64
2.6 Spatial objects ordering 70
2.7 Comparative evaluation 71
2.8 Summary 73
3. IMAGE DATABASES 77
3.1 Image database systems 78
3.2 Indexing issues and basic mechanisms 80
3.3 A taxonomy on image indexes 84
3.4 Color-spatial hierarchical indexes 91
3.5 Signatu re-based color-spatial retrieval 105
3.6 Summary 109
INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
4. TEMPORAL DATABASES
4.1 Temporal databases
4.2 Temporal queries
4.3 Temporal indexes
4.4 Experimental study
4.5 Summary
5. TEXT DATABASES
5.1 Querying text databases
5.2 Indexing
5.3 Query evaluation
5.4 Refinements to text databases
5.5 Summary
6. EMERGING APPLICATIONS
6.1 Indexing techniques for parallel and distributed databases
6.2 Indexing issues in mobile computing
6.3 Indexing techniques for data warehousing systems
6.4 Indexing techniques for the Web
6.5 Indexing techniques for constraint databases
References
Index
113
114
119
121
142
148
151
152
157
169
175
181
185
186
194
203
210
214
225
247
Preface
Database management systems are widely accepted as a standard tool for ma-
nipulating large volumes of data on secondary storage. To enable fast access
to stored data according to its content, databases use structures known as in-
dexes. While indexes are optional, as data can always be located by exhaustive
search, they are the primary means of reducing the volume of data that must
be fetched and processed in response to a query. In practice large database files
must be indexed to meet performance requirements.
Recent years have seen explosive growth in use of new database applications
such as CAD/CAM systems, spatial information systems, and multimedia in-
formation systems. The needs of these applications are far more complex than
traditional business applications. They call for support of objects with complex
data types, such as images and spatial objects, and for support of objects with
wildly varying numbers of index terms, such as documents. Traditional index-
ing techniques such as the B-tree and its variants do not efficiently support
these applications, and so new indexing mechanisms have been developed. As
a result of the demand for database support for new applications, there has
been a proliferation of new indexing techniques.
The need for a book addressing indexing problems in advanced applications
is evident. For practitioners and database and application developers, this
book explains best practice, guiding selection of appropriate indexes for each
application. For researchers, this book provides a foundation for development
of new and more robust indexes. For newcomers, this book is an overview of
the wide range of advanced indexing techniques.
The book consists of six self-contained chapters, each handled by area ex-
perts: Chapters 1 and 6 by Bertino, Catania, and Shidlovsky, Chapters 2, 3
and 4 by Ooi and Tan, and Chapter 5 by Sacks-Davis and Zobel. Each of the
first five chapters discusses indexing problems and techniques for a different
VII
VIII INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
database application; the last chapter discusses indexing problems in emerging
applications.
In Chapter 1 we discuss indexes and query evaluation for object-oriented
databases. Complex objects, variable-length objects, large objects, versions,
and long transactions cannot be supported efficiently by relational database
systems. The inadequacy of relational databases for these applications has pro-
vided the impetus for database researchers to develop object-oriented database
systems, which capture sophisticated semantics and provide a close model of
real-world applications. Object-oriented databases are a confluence of two tech-
nologies: databases and object-oriented programming languages. However, the
concepts of object, method, message, aggregation and generalization introduce
new problems to query evaluation. For example, aggregation allows an object
to be retrieved through its composite objects or based on the attribute values
of its component objects, while generalization allows an object to be retrieved
as an instance of its superclass.
Spatial data is large in volume and rich in structures and relationships.
Queries that involve the use of spatial operators (such as spatial intersection
and containment) are common. Operations involving these operators are ex-
pensive to compute, compared to operations such as join, and indexes are
essential to reduction of query processing costs. Indexing in a spatial database
is problematic because spatial objects can have non-zero extent and are asso-
ciated with spatial coordinates, and many-to-many spatial relationships exist
between spatial objects. Search is based, not only on attribute values, but on
spatial properties. In Chapter 2, we address issues related to spatial indexing
and analyze several promising indexing methods.
Conventional databases only store the current facts of the organization they
model. Changes in the real world are reflected by overwriting out-of-date data
with new facts. Monitoring these changes and past values of the data is, how-
ever, useful for tracking historical trends and time-varying events. In temporal
databases, facts are not deleted but instead are associated with times, which
are stored with the data to allow retrieval based on temporal relationships. To
support efficient retrieval based on time, temporal indexes have been proposed.
In Chapter 3, we describe and review temporal indexing mechanisms.
In large collections of images, a natural and useful way to retrieve image
data is by queries based on the contents of images. Such image-based queries
can be specified symbolically by describing their contents in terms of image
features such as color, shape, texture, objects, and spatial relationship between
them; or pictorially using sketches or example images. Supporting content-
based retrieval of image data is a difficult problem and embraces technologies
including image processing, user interface design, and database management.
PREFACE IX
To provide efficient content-based retrieval, indexes based on image features
are required. We consider feature-based indexing techniques in Chapter 4.
Text data without uniform structure forms the main bulk of data in corpo-
rate repositories, digital libraries, legal and court databases, and document
archives such as newspaper databases. Retrieval of documents is achieved
through matching words and phrases in document and query, but for docu-
ments Boolean-style matching is not usually effective. Instead, approximate
querying techniques are used to identify the documents that are most likely to
be relevant to the query. Effectiveness can be enhanced by use of transforma-
tions such as stemming and methodologies such as feedback. To support fast
text searching, however, indexing techniques such as special-purpose inverted
files are required. In Chapter 5, we examine indexes and query evaluation for
document databases.
In the first five chapters we cover the indexing topics of greatest importance
today. There are however many database applications that make use of indexing
but do not fall into one of the above five areas, such as data warehousing, which
has recently become an active research topic due to both its complexity and
its commercial potential. Queries against warehouses requires large number
of joins and calculation of aggregate functions. Another example is the use
of indexes to minimize energy consumption in portable equipment used in a
highly mobile environment. In Chapter 6 we discuss indexing mechanisms for
several such emerging database applications.
We are grateful to the many people and organizations who helped with
this book, and with the research that made it possible. In particular we thank
Timothy Arnold-Moore, Tat Seng Chua, Winston Chua, Cheng Hian Goh, Peng
Jiang, Marcin Kaszkiel, Alan Kent, Ramamohanarao Kotagiri, Wan-Meng Lee,
Alistair Moffat, Michael Persin, Yong Tai Tan, and Ross Wilkinson. Dave Abel,
Jiawei Han and Jung Nievergelt read earlier drafts of several chapters, and
provided helpful comments. We are also grateful to the Multimedia Database
Systems group at RMIT, the RMIT Department of Computer Science, the
Australian Research Council and the Department of Information Systems and
Computer Science at the National University of Singapore.
Elisa Bertino
Barbara Catania
Beng Chin Ooi
Ron Sacks-Davis
Boris Shidlovsky
Kian-Lee Tan
Justin Zobel
1 OBJECT-ORIENTED DATABASES
There has been a growing acceptance of the object-oriented data model as
the basis of next generation database management systems (DBMSs). Both
pure object-oriented DBMS (OODBMSs) and object-relational DBMS (OR-
DBMSs) have been developed based on object-oriented concepts. Object-
relational DBMS, in particular, extend the SQL language by incorporating
all the concepts of the object-oriented data model. A large number of products
for both categories of DBMS is today available. In particular, all major vendors
of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].
The widespread adoption of the object-oriented data model in the database
area has been driven by the requirements posed by advanced applications, such
as CAD/CAM, software engineering, workflow systems, geographic information
systems, telecommunications, multimedia information systems, just to name a
few. These applications require effective support for the management of com-
plex objects. For example, a typical advanced application requires handling
text, graphics, bitmap pictures, sounds and animation files. Other crucial re-
quirements derive from the evolutionary nature of applications and include
multiple versions of the same data and long-lived transactions. The use of
an object-oriented data model satisfies many of the above requirements. For
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997
2 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
example, an application's complex objects can be directly represented by the
model, and therefore there is no need to flatten them into tuples, as when re-
lational DBMSs are used. Moreover, the encapsulation property supports the
integration of packages for handling complex objects. However, because of the
increased complexity of the data model, and of the additional operational re-
quirements, such as versions or long transactions, the design of an OODBMS
or an ORDBMS poses several issues, both on the data model and languages,
and on the architecture [Kim et al., 1989, Nori, 1996, Zdonik and Maier, 1989].
An important issue is related to the efficient support of both navigational
and set-oriented accesses. Both types of accesses occur in applications typical
of OODBMS and ORDBMS and both must efficiently supported. Navigational
access is based on traversing object references; a typical example is represented
by graph traversal. Set-oriented access is based on the use of a high-level,
declarative query language. Object query languages have today reached a
certain degree of consolidation. A standard query language, known as OQL
(Object Query Language), has been proposed as part of the ODMG standard-
ization effort [Bartels, 1996, Cattell, 1993], whereas the SQL-3 standard, still
under development, is expected to include all major object modeling concepts
[Melton, 1996]. The two means of access are often complementary. A query
selects a set of objects. The retrieved objects and their components are then
accessed by using navigational capabilities [Bertino and Martino, 1993]. A brief
summary of query languages is presented in Section 1.1.
Different strategies and techniques are required to support the two above ac-
cess modalities. Efficient navigational access is based on caching techniques and
transformation of navigation pointers into main-memory addresses (swizzling),
whereas efficient execution of queries is achieved by the allocation of suitable
access structure and the use of sophisticated query optimizers. Access struc-
tures typically used in relational DBMSs are based on variations of the B-tree
structure [Comer, 1979] or on hashing techniques. An index is maintained on an
attribute or combination of attributes of a relation. Since an object-oriented
data model has many differences from the relational model, suitable index-
ing techniques must be developed to efficiently support object-oriented query
languages. In this chapter we survey some of the issues associated with index-
ing techniques and we describe proposed approaches. Also, we briefly discuss
caching and pointer swizzling techniques, for more details on these techniques
we refer the reader to [Kemper and Kossmann, 1995]. In the remainder of this
chapter, we cast our discussion in terms of the object-oriented data model typ-
ical of OODBMSs, because most of the work on indexing techniques have been
developed in the framework of OODBMSs. However, most of the discussion
applies to ORDBMSs as well.
OBJECT·ORIENTED DATABASES 3
The remainder of the chapter is organized as follows. Section 1.1 presents
an overview of the basic concepts of object-oriented data models, query lan-
guages, and query processing. For the purpose of the discussion, we consider
an object-oriented database organized along two dimensions: aggregation, and
inheritance. Indexing techniques for each of those dimensions are discussed in
Sections 1.2 and 1.3, respectively. Section 1.4 presents integrated organizations,
supporting queries along both aggregation and inheritance graphs. Section 1.5
briefly discusses method precomputation, caching and swizzling. Finally, Sec-
tion 1.6 presents some concluding remarks.
1.1 Object-oriented data model and query language
An object-oriented data model is based on a number of concepts [Bertino and
Martino, 1993, Cattell, 1993, Zdonik and Maier, 1989]:
• Each real-world entity is modeled by an object. Each object is associated
with a unique identifier (called an OlD) that makes the object distinguish-
able from any other object in the database. OODBMSs provide objects with
persistent and immutable identifiers: an object's identifier does not change
even if the object modifies its state.
• Each object has a set of instance attributes and methods (operations). The
value of an attribute can be an object or a set of objects. The set of at-
tributes of an object and the set of methods represent the object structure
and behavior, respectively.
• The attribute values represent the object's state. This state is accessed
or modified by sending messages to the object to invoke the corresponding
methods.
• Objects sharing the same structure and behavior are grouped into classes.
A class represents a template for a set of similar objects. Each object is an
instance of some class. A class definition consists a set of instance attributes
(or simply attributes) and methods. The domain of an attribute may be an
arbitrary class. The definition of a class C results in a directed-graph (called
aggregation graph) of the classes rooted at C. An attribute of any class on
an aggregation graph is a nested attribute of the class root of the graph.
Objects, instances of a given class, have a value for each attribute defined
by the class. All methods defined in a class can be invoked on the objects,
instances of the class.
• A class can be defined as a specialization of one or more classes. A class
defined as specialization is called a subclass and inherits attributes and meth-
4 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Figure 1.1. An object-oriented database schema.
ods from its superclasses. The specialization relationship among classes or-
ganizes them in an inheritance graph which is orthogonal to the aggregation
graph.
An example of an object-oriented database schema, which will be used as
running example, is graphically represented in Figure 1.1. In the graphical
representation, a box represents a class. Within each box there are the names
::If the attributes of the class. Names labeled with a star denote multi-valued
attributes. Two types of arcs are used in the representation. A simple arc from
a class C to a class C' denotes that C' is domain of an attribute of C. A bold
arc from a class C to a class C' indicates that C is a superclass of C' .
In the remainder of the discussion, we make the following assumptions. First,
we consider classes as having the extensional notion of the set of their instances.
Second, we make the assumption that the extent of a class does not include the
instances of its subclasses. Queries are therefore made against classes. Note
that in several systems, such as for example GemStone [Bretl et aI., 1989], O2
[Deux, 1990], and ObjectStore [Obj~ctStore, 1995] classes do not have manda-
tory associated extensions. Therefore, applications have to use collections, or
sets, to group instances of the same class. Different collections may be defined
on the same class. Therefore, increased flexibility is achieved, even if the data
model becomes more complex. When collections are the basis for queries, in-
dexes are allocated on collections and not on classes [Maier and Stein, 1986].
In some cases, even though indexes are on collections, the definitions of the
classes of the indexed objects must verify certain constraints for the index to
be allocated on the collections. For example, in GemStone an attribute with
OBJECT-ORIENTED DATABASES 5
an index allocated on must be defined as a constrained attribute in the class
definition, that is, a domain must be specified for the attributel . Similarly,
ObjectStore requires that an attribute on which an index has to be allocated
be declared as indexable in the class definition.
As we discussed earlier, most OODBMSs provide an associative query lan-
guage [Bancilhon and Ferran, 1994, Cluet et al., 1989, Kim, 1989, Shaw and
Zdonik, 1989]. Here we summarize those features that most influence indexing
techniques:
• Nested predicates
Because of object's nested structures, most object-oriented query languages
allow objects to be restricted by predicates on both nested and non-nested
attributes of objects. An example of a query against the database schema
of Figure 1.1 is:
Retrieve the authors of books published by f(luwer. (Q1)
This query contains the nested predicate "published by Kluwer". Nested
predicates are usually expressed using path-expressions. For example, the
nested predicate in the above query can be expressed as
Author.books.publisher.name = "Kluwer".
• Inheritance
A query may apply to just a class, or to a class and to all its subclasses. An
example of a query against the database schema of Figure 1.1 is:
Retrieve all instances of class Book and all its subclasses published in
1991. (Q2)
The above query applies to all the classes in the hierarchy rooted at class
Book.
• Methods
A method can used in a query as a derived attribute method or a predicate
method. A derived attribute method has a function comparable to that of
an attribute, in that it returns an object (or a value) to which comparisons
can be applied. A predicate method returns the logical constants True or
False. The value returned by a predicate method can then participate in
the evaluation of the Boolean expression that determines whether the object
satisfies the query.
A distinction often made in object-oriented query languages is between im-
plicit join (called also functional joins), deriving from the hierarchical nesting
of objects, and explicit join, similar to the relational join, where two objects are
6 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
explicitly compared on the values of their attributes. Note that some query lan-
guages only support implicit joins. The motivation for this limitation is based
on the argument that in relational systems joins are mostly used to recom-
pose entities that were decomposed for normalization [Bretl et 301., 1989] and
to support relationships among entities. In object-oriented data models there
is no need to normalize objects, since these models directly support complex
objects and multivalued attributes. Moreover, relationships among entities are
supported through object references; thus the same function that joins provide
in the relational model to support relationships is provided more naturally by
path-expressions. It therefore appears that in OODBMSs there is no strong
need for explicit joins, especially if path-expressions are provided. An example
of a path-expression (or simply path) is "Book.publisher.name" denoting the
nested attribute "publisher.name" of class Book. The evaluation of a query
with nested predicates may require the traversal of objects along aggregation
graphs [Bertino, 1990, Jenq et 301.,1990, Kim et 301., 1988, Graefe, 1993, Straube
and Ozsu, 1995]. Because in OODBMSs most joins are implicit joins along ag-
gregation graphs, it is possible to exploit this fact by defining techniques that
precompute implicit joins. We discuss these techniques in Section 1.2.
In order to discuss the various index organizations, we need to summarize
some topics concerning query processing and execution strategies. A query can
be conveniently represented by a query graph [Kim et 301., 1989]. The query ex-
ecution strategies vary along two dimensions. The first dimension concerns the
strategy used to traverse the query graph. Two basic class traversal strategies
can be devised:
• Forward traversal: the first class visited is the target class of the query (root
of the query graph). The remaining classes are traversed starting from the
target class in any depth-first order. The forward traversal strategy for query
Ql is (Author Book Publisher).
• Reverse traversal: the traversal of the query graph begins at the leaves and
proceeds bottom-up along the graph. The reverse traversal strategy for
query Ql is (Publisher Book Author).
The second dimension concerns the technique used to retrieve instances of
the classes that are traversed for evaluating the query. There are two ba-
sic strategies for retrieving data from a visited class. The first strategy, called
nested-loop, consists of instantiating separately each qualified instance of a class.
The instance attributes are examined for qualification, if there are simple pred-
icates on the instance attributes. If the instance qualifies, it is passed to its
parent node (in the case of reverse traversal) or to its child node (in case of
forward traversal). The second strategy, called sort-domain, consists of instan-
tiating all qualified instances of a class at once. Then all qualifying instances
OBJECT-ORIENTED DATABASES 7
are passed to their parent or child node (depending on the traversal strategy
used). The combination of the graph traversal strategies with instance retrieval
strategies results in different query execution strategies. We refer the reader
to [Bertino, 1990, Graefe, 1993, Jenq et al., 1990, Kim et al., 1988, Straube
and Ozsu, 1995] for details on query processing stl'ategies for object-oriented
databases.
1.2 Index organizations for aggregation graphs
In this section, we first present some preliminary definitions. We then present
a number of indexing techniques that support efficient executions of implicit
joins along aggregation graphs. Therefore, these indexing techniques can be
used to efficiently implement class traversal strategies.
Definition. Given an aggregation graph H, a path P is defined as C1.A1.A2 . ...
An(n 2:: 1) where:
• C1 is a class in H;
• A1 is an attribute of class C1 ;
• Ai is an attribute of a class Ci in H, such that Ci is the domain of attribute
Ai-1 of class Ci-1, 1 < i :S n;
len(P) = n denotes the length of the path;
class(P) = C1U{CdCj is the domain of attribute Aj - 1 of class Cj- 1, 1 < i :S
n} denotes the set of the classes along the path;
dom(P) denotes the class domain of attribute An of class Cn;
two classes Cj and CH1, 1 :S i :S n - 1, are called neighbor classes in the path.
o
A path is simply a branch in a given aggregation graph. Examples of paths
in the database schema in Figure 1.1 are:
• P1 : Author.books.publisher.name
len(Pt}=3, class(Pd={Author, Book, Publisher}, dom(Pt}=string
• P2: Book.year len(P2)=I, class(P2 )={Book}, dom(P2)=integer
• P3 : Organization.staff.books.publisher.name
len(P3 )=4, class(P3 )={Organization, Author, Publication, Publisher},
dom(P3)=string
The concept of path is closely associated with that of path instantiation. A
path instantiation is a sequence of objects found by instantiating a given path.
8 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
The objects in Figure 1.2 are instances of the classes shown in Figure 1.1. The
following are example instantiations of the path Pa:
• Ph= 0[1].A[4].B[1].P[2].Addison-Wesley
(Ph is shown in Figure 1.2 by arrows connecting the instances in Ph)
• P12= 0[2].A[3].B[2].P[4].Kluwer
• P1a= 0[2].A[3].B[3].P[4].Kluwer
Ort;:unizatilill Author Puhlisher
MadlllllSh C Pnlgramming
BI2J
PI5j
PIJJ
PI4J
I KJuwer I
IMimlSuft
I Elsevier
Manual
Hil1UJhutik
MIIJ
c++ Pnlgramming Languages
Eflkicnl Parsing fur Naluml LunguagO'
BIIJ
I c++ RcJercllt.:eMallUal ~
M12)
IT11cGUIGuide ~
BIIJ. MIl]
AlII
A15)
A12)
IJ. van LeeuwenIHIIII
10 Mark IBI4q
014)
IWisconsin u·l]
Figure 1.2. Instances of classes of the database schema in Figure 1.1.
The above path instantiations are all complete, that is, they start with an
instance belonging to the first class of path Pa (that is, Organization), con-
tain an instance for each class found along the path, and end with an in-
stance of the class domain of the path (Publisher.name). Besides the com-
plete instantiations, a path may have also partial instantiations. For example,
A[2] .B[4] .P[2].Addison-Wesley is a left-partial instantiation, that is, its first
component is not an instance of the first class of the path (Organization in the
example), but rather an instance of a class following the first class along the
path (Author in the example).
Similarly, a right-partial instantiation of a path ends with an object which
is not an instance of the class domain of the path. In other words, a right-
partial instantiation is such that the last object in the instantiation contains
a null value for the attribute referenced in the path. 0[4] is a right-partial
instantiation of path Pa.
OBJECT-ORIENTED DATABASES 9
The last relevant concept we introduce here is the concept of indexing graph.
The concept of indexing graphs (IG) was introduced in [Shidlovsky and Bertino,
1996] as an abstract representation of a set of indexes allocated along a path
P. Given a path P = C1 .A1 .A2 .....An , an indexing graph contains n + 1
vertices, one for each class Ci in the path plus an additional vertex denoting
the class domain Cn .An 2 of the path, and a set of directed arcs. A directed arc
from vertex Ci to vertex Cj indicates that the indexing organization supports
a direct associations between each instance of Ci and instances of Cj obtained
by traversing the path from the instance of Ci to class Cj. Note that if Ci and
Cj are neighbor classes, the indexing organization materializes an implicit join
between the classes.
1.2.1 Basic techniques
Multi-index
This organization was the first proposed for indexing aggregation graphs. It is
based on allocating a B+-tree index on each class traversed by the path. There-
fore, given a path P = C1 .A1.A2 .·.· .An , a multi-index [Maier and Stein, 1986]
is defined as a set of n simple indexes (called index components) h, h, ...,In,
where Ii is an index defined on Ci.Ai, 1::; i::; n. All indexes h,I2 , ... ,In - 1
are identity indexes, that is, they have as key values aIDs. Only the compar-
ison operators == (identical to) and rvrv (not identical to) are supported on
an identity index. The last index In can be either an identity index, or an
equality index depending on the domain of An. An equality index is a regular
index, like the ones used in relational DBMSs, whose key values are primitive
objects, such as numbers or characters. An equality index supports comparison
operators such as = (equal to), rv (different from), <, ::;, >, 2:.
As an example consider path P1=Author.books.publisher.name. There will
be three indexes allocated for this path, as illustrated in Figure 1.3. In the
figure, each index is represented in a tabular form. An index entry is represented
as a row in the table. The first element of such a row is a key-value (given
in boldface), and the second element is the set of aIDs of objects holding
this key-value for the indexed attribute. The first index, h, is allocated on
Author.books; similarly indexes h and Is are allocated on Book.publisher and
Publisher.name, respectively.
Note that in the first index (h) the special key-value Null is used to record
a right-partial instantiation. Therefore, the multi-index allows determining all
path instantiations having null values for some attributes along the path. By
contrast, determining left-partial instantiations does not require any special
key-value.
10 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
B[l] A[4]
B[2] Al3]
B[3] A[3]
B[4] A[2]
Null A[4J
P[lJ Null
P[2] B[l], B[4]
P[4] B[2], B[3]
Academic Press PllJ
Addison-Wesley Pl2J
Elsevier P[3]
Kluwer Pl4J
Microsoft Pl5]
figure 1.3. Multi-index for path P1 =Author.books.publisher.name.
Under this organization, solving a nested predicate requires scanning a num-
ber of indexes equal to the path-length. For example to select all authors whose
books were published by Kluwer (query Ql), the following steps are executed:
1. A look-up of index Is with key-value "Kluwer"; the result is {P[4]}.
2. A look-up of index h with key-value P[4]; the result is {B[2], B[3]}.
3. A look-up of index It with key-values B[2] and B[3]; the result is {A[3]}
which is the result of the query.
Therefore, under this organization the retrieval operation is performed by
first scanning the last index allocated on the path. Then the results of this
index lookup are used as keys for a search on the index preceding the last
one in the path, and so forth until the first index is scanned. Therefore, this
organization only supports reverse traversal strategies. Its major advantage,
compared to others we describe later on, is the low update cost.
The indexing graph for the multi-index is as follows. Let P be a path of
length 7l. The graph contains an arc from class Gi+1 to class Gi, for i =
1, ... ,7l. The IG for P3 =Organization.staff.books.publisher.name is shown in
Figure lA.a.
Join index
The notion ofjoin index was introduced to efficiently perform joins in relational
databases [Valduriez, 1987]. However, the join index has also been used to
efficiently implement complex objects. A binary equijoin index is defined as
follows:
Given two relations Rand 5 and attributes A and B, respectively from R
and 5, a binary equijoin index is
where
OBJECT-ORIENTED DATABASES 11
Ofl:anl1.:ltillll Autlltlr Book Publisher Puhlishcr.U:llIlC Or~allii',alioll Au!lwr Book Puhlisher Puhlishcf.namc
h)
Author
organizatillll
"
BllOk
o
Pul'llishcr
oPuhlishcr.nan Organization AuU)or
~)
Bouk Put'llisllCf Puhlisher.namc
Figure 1.4. Indexing graphs: a) multi-index; b) join indexes; c) nested index; d) path-
index; e) access support relation.
• ri (sd denotes the surrogate of a tuple of R (5);
• tuple 1'i (tuple Sk) refers to the tuple having ri (Sk) as surrogate.O
A Bll is implemented as a binary relation and two copies may be kept,
one clustered on ri and the other on Sk; each copy is implemented as a B+-
tree. In aggregation graphs, a sequence of Blls can be used in a multi-index
organization to implement the various index components along a given path.
We refer to such sequence of join indexes as II organization. Consider path
Pl=Author.books.publisher.name. The join indexes allocated for such path are
listed below. They are illustrated together with some example index entries in
Figure 1.5.
• The first join index BJIt is on Author.books. The copy denoted as BJIt (a)
in Figure 1.5 is clustered on aIDs of instances of Author, whereas the copy
denoted as BlIt (b) is clustered on aIDs of instances of Book.
• The second join index Blh on Book.publisher. The copy denoted as BJh(a)
in Figure 1.5 is clustered Oil aIDs of instances of Book, whereas the copy
denoted as BJ12 (b) is clustered on aIDs of instances of Publisher.
• The third join index BJ13 is on the attribute Publisher.name. The copy de-
noted as BJIs (a) in Figure 1.5 is clustered on aIDs of instances of Publisher,
12 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
BIh
A[2] B[4]
A[3] B[2]
A[3] B[3]
A[4] Bll]
BJ Ida)
B[l] P[2]
B[2] Pl4]
B[3] P[4]
B[4] P[2]
P[l] Academic Press
P[2] Addison-Wesley
P[3] Elsevier
P[4] I<Iuwer
P[5] Microsoft
BIh
BIh
B[l] Ar41
B[2] A[3]
B[3] A[3]
B[4] A[2]
BJh(b)
P[2] Brl
P[2] B[4
P[4] Bf2
P[4] B[3]
BJh(b)
Academic Press P[l]
Addison-Wesley P[2]
Elsevier pr31
Kluwer P[4]
Microsoft pr5]
BJ Is (b)
Figure 1.5. JI organization for path PI =Author.books.publisher.name.
whereas the copy denoted as BJI3 (b) is clustered on values of attribute
"nan1e" .
A JI organization supports both forward and reverse traversal strategies
when both copies are allocated for each join index. Reverse traversal is suitable
for solving queries such as query Ql ("Retrieve the authors of books published
by Kluwer."). Forward traversal arises when given an object, all objects must be
determined that are referenced directly or indirectly by this object. An example
is the query "Determine the publishers of the books written by author A[3]".
Reverse traversal is already supported by the multi-index. However, such
technique does not support forward traversal that must, therefore, executed by
directly accessing the objects. The usage of a sequence of JIs may make forward
traversal faster when object accesses are expensive (for example, very large
objects or non-optimal clustering). Moreover, forward traversal supported by a
OBJECT·ORIENTED DATABASES 13
sequence of JIs may be useful in complex queries when objects at the beginning
of the path have already been selected as the effect of another predicate in the
query. An example of more complex query is "Select all books written by an
author from AT&T Lab". Suppose that an index is allocated on attribute
"Organization.name" and moreover a JI organization is allocated on the path
P=Organization.staff.books. A possible query strategy could be to first select
the OlD of the organization named "AT&T Lab" using the index on attribute
"Organization.name", and then use the JI organization in forward traversal to
determine the books written by authors of the organization 0[1) selected by
the first index scan.
The IG for a JI organization along a path P is constructed as follows. For
each pair of neighbor classes Ci and Ci+l along path P, the graph contains two
arcs (Ci,Ci+l) and (Ci+l,C;). The former arc corresponds to the copy of the
binary join index between C; and Ci +1 clustered on class C; while the latter
arc corresponds to the copy clustered on class Ci+l. The IG for the path P3 is
presented in Figure l.4.b.
Note that when the JI organization is used for forward traversal, the se-
quence of B+-trees searched in the traversal corresponds to a chain of arcs in
the IG. Moreover, such chain consists of left-to-right directed arcs only. By
contrast, the use of the JI organization in a reverse traversal corresponds to a
chain of arcs in the IG containing only right-to-left directed arcs.
The usage of join indexes in optimizing complex queries has been discussed
in [Valduriez, 1986). A major conclusion is that the most complex part (that is,
the joins) of a query can be executed through join indexes, without accessing
the base data. However, there are cases when traditional indexing (selection
indexes on join attributes) is more efficient than the usage of a join index. For
example, a traditional index is more efficient than a join index when the query
simply consists of a join preceded by a highly selective selection. The major
conclusion is that join indexes are more suitable for complex queries, that is,
queries involving several joins.
The update costs for the JI organization are in general the double of the
costs for the multi-index organization, since in the JI organization there are
two copies of each join index. The update costs of the JI organization can be,
however, reduced by allocating a single copy for one or more join indexes in
the organization, rather than two copies. Allocating a single copy, however,
makes forward or reverse traversal more expensive, depending on which copy
is allocated, and therefore the correct allocation decision must be based on the
expected query and updates patterns and frequencies.
14 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Nested index
Both the previous organizations require, when solving a nested predicate, to
access a number of indexes proportional to the path length. Different orga-
nizations have been proposed to reduce the number of indexes accessed. The
first of these organizations is the nested index [Bertino and Kim, 1989] provid-
ing a direct association between an object of a class at the end of a path and
the corresponding instances of the class at the beginning of the path. Consider
path Pl =Author.books.publisher.name. A nested index allocated on this path
contains as key-values names of publishers. It associates with each publisher
name the OIDs of authors that have written a book published by this publisher.
Figure 1.6 shows some example entries for a nested index allocated on path Pl.
Academic Press Null
Addison-Wesley Al2j, Al4J
Elsevier Null
Kluwer Al3J
Microsoft Nul)
Figure 1.6. Nested index for path Pl = Author.books.publisher.name.
Retrieval under this organization is quite efficient. A query such as Q1
is solved with only one index lookup. The major problem of this indexing
technique is update operations that require access to several objects in order
to determine the index entries to be updated. For example, suppose that book
B[4] is removed from the database. To update the index, the following steps
must be executed:
1. Access object B[4] and determine the value of nested attribute "Book.pub-
lisher.name"; result: "Addison-Wesley".
2. Determine all instances of class Author having B[4] in the list of authored
books; result: {A[2]}.
3. Remove A[2] from the index entry with key-value equal "'Addison-Wesley";
after the removal the index entry for "Addison-Wesley" is {A[4]}.
As this example shows, update operations in general require both forward
and backward traversals of objects. Forward traversal is required to determine
the value of the indexed attribute (that is, the value of the attribute at the end
of the path) for the modified object. Reverse traver~al is required to determine
the instances at the beginning of the path. The OIDs of those instances will
OBJECT-ORIENTED DATABASES 15
be removed (added) to the entry associated with the key value determined
by the forward traversal. Note that reverse traversal is very expensive when
there are no reverse references among objects. In such case, the nested index
organization may not be usable.
Note that a nested index as defined above can only be used for reverse traver-
sal. However, it would be possible, as for the J1 organization, to allocate two
copies of a nested index: the first having as key-values the values of attribute
An at the end of the path (examples of entries of this copy for path PI are the
ones we have shown earlier); the second having as key-values the OIDs of the
instances at the class at the beginning of the path. Therefore, for path PI this
second copy would have the entries illustrated in Figure 1.7.
A[l] Null
A[2] Addison-Wesley
A[3] Kluwer
A[4] Addison-Wesley
A[5] Null
Figure 1.7. A nested index for path PI = Author.books.publisher.name clustered on OIDs
of instances of the class at the beginning of the path.
The use of the above nested index would be more efficient than forward
traversal using the object themselves.
The IG for a nested index allocated on a path P contains only two arcs,
namely (CI,Cn+d and (Cn+I,CI). The former are, however, is only inserted
in the IG if the second copy of the nested index, supporting forward retrieval,
is allocated. The IG for a nested index allocated on path Pa is shown in
Figure 1.4.c.
Path index
A path index [Bertino and Kim, 1989] is based on a single index, like the nested
index. The difference is that a path index provides an association between an
object 0 at the end of a path and all instantiations ending with O. For a path
of length n, the leaf-node records of a path index contain the instantiation
implemented as records of n components. Example index entries for path PI
are given in Figure 1.8.
Note that a path index records, in addition to complete instantiations, left-
partial and right-partial instantiations. Unlike the nested index, a path index
can be used to solve nested predicates against all classes along the path. For
example, the path index on PI can be used to determine all authors of books
published by Kluwer, or simply to find the books published by Kluwer.
16 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Publisher.name Path instantiations
Academic Press Null
Addison-Wesley 0[1] .A[4] .B[l] .P[2], A[2] .B[4] .P[2]
Elsevier Null
Kluwer 0[2] .A[3] .B[2] .P[4], 0[2] .A[3] .B[3] .P[4]
Null 0[4]
Figure 1.8. Path index for path P3=Organization.staff.books.publisher.name.
This feature is also very useful when dealing with complex queries. It
supports a special kind of projection, called projection on path instantiation
[Bertino and Guglielmina, 1991, Bertino and Guglielmina, 1993]. This oper-
ation allows retrieving OIDs of several classes along the path with a single
index lookup. For example, suppose we wish to determine all authors who
have their books published by Kluwer in 1991. This query can be solved by
first performing an index lookup with key-value equal Kluwer and then per-
forming a projection on positions of classes Author (pos=l) and Book (pos=2)
on the selected index entries. That is, the first and second elements of each
path instantiation verifying the nested predicate are extracted from the index.
Therefore, the results of this projection in the above example are: {(A[3], B[2]),
(A[3], B[3])}. Then the second element of each pair is extracted. The corre-
sponding object is accessed and the predicate on attribute "year" is evaluated.
If this predicate is satisfied, the first element of the pair is returned as query
result. For example, given the two pairs above, instances B[2] and B[3] of
class Book would be accessed to verify whether the value of attribute "year" is
1991. Since only B[3] verifies the predicate, A[3] is returned as the query result.
An analysis of query processing strategies using this operation is presented in
[Bertino and Guglielmina, 1993].
Updates on a path index are expensive, since forward traversals are required,
as in the case of the nested index. However, no reverse traversals are required.
Therefore, the path index organization can be used even when no reverse ref-
erences among objects on the path are present.
The IG for a path index allocated on a path P contains n arcs, namely
(Cn +1 , Cj ) for all i's in the range 1, ... , n. The IG for a path index allocated
on path P3 is shown in Figure l.4.d.
Access support relation (ASR)
This approach is very similar to the path-index in that it involves calculating
all instantiations along a path and storing them in a relation. Given a path
P = C1 .A1 .A2 . ... .An , all path instantiations are stored as records in an (n+ 1)-
OBJECT-ORIENTED DATABASES 17
ary relation. The ith attribute of that relation corresponds to the class Gi. Also,
both complete and partial instantiations are represented in the table. Example
index entries for path P1 are given in Figure 1.9. Two B+-trees are allocated
on the first and last attributes (classes G1 and Gn+d for the access relation for
accelerating forward and reversal traversals. Like the path index, the ASR has
a low retrieval cost and quite high update cost.
Org Author Book Publisher Publisher.name
O[lJ Al4J Bl1J P[2] Addison-Wesley
0[2] A[3] B[2] P[4 Kluwer
0[2] A[3] B[3] P[4] Kluwer
0[4] Null Null Null Null
Null A[2] B[4J P[2 Addison-Wesley
Null Null Null P[l Academic Press
Null Null Null Pl3 Springer
Figure 1.9. Access support relation for path P3 =Organization.staff.books.publisher.name.
In the IG for an ASR allocated on a path P, any vertex for class Gi , i =
2, ... ,n -1 has two incoming arcs (G1,G;) and (Gn.An,Gi ). Figure 1.4.e
presents the indexing graph for ASR for path P3. It contains arcs outgoing
from the first and last classes in the path on which the two B+-trees are allo-
cated.
Comparison
A comparison among three of the basic indexing techniques, namely multi-
index, nested index and path index, has been presented in [Bertino and Kim,
1989]. An important parameter in the evaluations is represented by the degree
of reference sharing. Two objects share a reference if they reference the same
object as value of an attribute. Therefore, this degree models the topology of
references among objects. A more accurate model of reference topology was
developed in [Bertino and Foscoli, 1995].
The main results of the comparison can be summarized as follows. For re-
trieval the nested index has the lowest cost as expected, and the path index
has lower cost than the multi-index. The nested index has a better perfor-
mance than the path index for retrieval, because a path index contains OIDs
of instances of all classes along the path, while the nested index contains OIDs
of instances of only the first class in the path. However, a single path index
allows predicates to be solved for all classes along the path, while the nested
index does not. For update the multi-index has the lowest cost. The nested
index has a slightly lower cost than the path index for path length 2. For paths
18 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
longer than 2, the nested index has a slightly lower cost than the path index if
updates are on the first two classes of the path; otherwise the nested index has
significantly higher cost than the path index. Note; however, that the update
costs for the nested index are computed under the hypothesis that there are
reverse references among objects. When there are no reverse references, update
operations for the nested index became much more expensive.
1.2.2 Advanced index organizations
Each of the basic organizations described in the previous subsection is biased
towards a specific kind of operation (retrieval or update). No organization
supports equally well retrieval and update operations. In this subsection, we
present some advanced approaches which are characterized by a customization
component. Such component allows tailoring the organizations with respect to
specific query and update patterns and frequencies. The customization requires
detecting an index configuration which is optimal for a given set of operations
along the indexed path.
Path splitting
The path splitting approach [Bertino, 1994, Choenni et 301.,1994] overcomes the
problem of biased performance of three basic techniques, namely high update
costs in the nested and path index and high retrieval costs in the multi-index.
The approach is based on splitting a path into several shorter subpaths, and
allocating on each subpath one among the following basic organizations: multi-
index, nested index, path index. For example, path P3=Organization.staff.
books.publisher.name could be split into two subpaths:
• P31=Organization.staff.books with a multi-index allocated
• P32=Book.publisher.name with a path index allocated.
An algorithm determining optimal configurations for paths has been devel-
oped [Bertino, 1994]. The algorithm takes as input the frequency of retrieval,
insert, and delete operations for classes along the path. Moreover, it takes into
account whether reverse references exist among objects as well as all data logi-
cal and physical characteristics. The algorithm determines the optimal splitting
of a path into subpaths, and the organization to use for each subpath. The al-
gorithm also considers, for each subpath, the choice of allocating no index. An
interesting result obtained by running the algorithm is that when the degrees
of reference sharing along a path are very low (that is, close to 1) and reverse
references are allocated among objects, the best index configuration consists of
allocating no index on the path.
OBJECT-ORIENTED DATABASES 19
The overall index configuration obtained according to the path splitting
approach can be simply represented by an IG. As an example consider the IG
for the configuration of path P3 consisting of subpaths P 3i with a multi-index
allocated, and P32 with a path index allocated, shown in Figure l.10.a.
Organization Author Book Publisher Publisher.name
a)
Organization - _
b)
Book ___--..rublisher.name
c)
Figure 1.10. Indexing graphs for advanced techniques: a) Path splitting; b) ASR decom-
position; c) join index hierarchy.
ASR decomposition
Under the ASR organization one table is maintained for all instantiations along
the path. Similarly to the path splitting approach, a path may be decomposed
and different access relations allocated for each subpath. Even though [Kemper
and Moerkotte, 1992] proves some properties of the ASR decomposition, it does
not provide any criteria or algorithm for "optimal" partitioning.
Figure 1.10.b shows the IG corresponding to a case where the ASR allocated
on path P3 is decomposed into two partitions.
Join index hierarchy
This is another approach based on the join index [Valduriez, 1987]. A complete
join index hierarchy (IJH) consists of basic join indexes and derived join indexes
[Xie and Han, 1994]. Basic indexes which form the base of the Jl hierarchy are
supported for pairs of neighbor classes in a path P, whereas derived indexes are
supported for pairs of non-neighbor classes. Derived join indexes are built from
basic join indexes and, possibly, other derived join indexes. For the path P3 ,
Figure 1.11 shows the derived join between class Author (pos=2) and attribute
Publisher.name (pos=5).
Maintenance of the complete JI hierarchy is expensive in terms of both
storage and update costs. Therefore, a partial JI hierarchy which contains
all basic JIs and only several derived indexes seems to be more efficient for
20 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Author Publisher.name
A[2] Addison-Wesley
A[3] Kluwer
A[4] Addison-Wesley
Figure 1.11. Derived join index between Author and Publisher.name.
most real cases. In the partial hierarchy, any derived join index needed for
executing a query but not included in the partial J1 hierarchy is derived from
the indexes in the partial J1 hierarchy through a sequence of join operations.
The selection of the derived J1s to be included in the partial J1 hierarchy
is driven by some heuristics and metrics. As the performance tests reported
in [Xie and Han, 1994] show, a partial JI hierarchy behaves better than the
complete J1 hierarchy and the ASR organization.
An IG corresponding to a partial 1J hierarchy is characterized by the fol-
lowing property. If it contains an arc from class Cj to class Cj , then it contains
the arc from Cj to Ci as well. Figure 1.10.c shows the IG of a partial J1
hierarchy for path P3. Such partial J1 hierarchy supports basic join indexes
for the following pairs of neighbor classes: (Organization, Author), (Author,
Book), (Book, Publisher), (Publisher, Publisher.name). It moreover supports
an additional derived join index for the pair (Author, Publisher.name).
1.3 Index organizations for inheritance hierarchies
As we discussed in Section 1.1, an object-oriented query may apply to a class
only or to a class and all its direct and indirect subclasses. Since an attribute
of a class C is inherited by all its subclasses, a relevant issue concerns how
to efficiently evaluate a predicate against such an attribute when the scope of
the query is the inheritance hierarchy rooted at C3
. In this section we discuss
indexing techniques addressing such an issue. The various approaches are an-
alyzed with respect to storage overhead, update and retrieval costs. Retrieval
costs, in particular, depend on whether the query is a point query or a range
query. In a B+-tree index, a point query retrieves one leaf node only; the query
predicate is usually an equality predicate. By contrast, a range query specifies
an interval (or a set) of values for the search key and may require retrieving
several leaf nodes.
Consider an attribute A defined in a class C and inherited by all its sub-
classes. A query against attribute A is a single-class query (SC-query) if the
query scope consists of only one class from the inheritance hierarchy rooted at
C. Otherwise, the query is a class-hierarchy query (CH-query) and its scope
OBJECT-ORIENTED DATABASES 21
Book
1986 Bl2J
1990 B[4]
1991 B[1],B[3]
Handbook
II 1990 Q!I!I]
Figure 1.12. SC-index organization for the inheritance hierarchy rooted at class Book.
includes a subhierarchy of the inheritance hierarchy, that is, some class in the
hierarchy with all its subclasses. A CH-query is a rooted CH-query if the root
of the subhierarchy in the scope coincides with the root class C. Otherwise,
the query is a partial CH-query.
Consider the database schema shown in Figure 1.1. Consider the inheritance
hierarchy rooted at class Book and queries against its attribute "year" which
is inherited by classes Manual and Handbook. An example of SC-query is the
query which retrieves instances of one of the classes in the hierarchy (Book,
Manual or Handbook). The query against the attribute "year" which retrieves
instances of all the three classes is a rooted CH-query. If the class Manual
had a subclass called ManuaLon.CD, then a query with classes Manual and
ManuaLon.CD in the scope would be a partial CH-query.
SC-index and CH-tree
The inheritance hierarchy indexing problem was first addressed in [Kim et al.,
1989] where two possible approaches are proposed. The first approach, called
single-class index (SC-index), is based on maintaining a separate B+-tree on
the indexed attribute for each class in the inheritance hierarchy. Therefore, if
the inheritance hierarchy has m classes, the SC-index requires m B+-trees.
As an example, consider the inheritance hierarchy rooted at class Book in
Figure 1.1. If the attribute "year" is frequently referred in queries against this
hierarchy, the SC-index approach requires building three indexes, one for each
class in the hierarchy, namely Book, Manual and Handbook. The evaluation of
a predicate against the attribute "year" would then require scanning the three
indexes and performing the union of the results. The three indexes against
the attribute "year" for the classes in the inheritance hierarchy rooted at class
Book are shown in Figure 1.12.
This approach is very efficient for SC-queries. However, it is not optimal for
CH-queries, because it requires scanning all the indexes allocated on the classes
in the queried inheritance hierarchy.
The second approach, called class-hierarchy index (CH-tree), is based on
maintaining a unique B+-tree for all classes in the hierarchy. An index entry
in a leaf node may thus contain the aIDs of instances of any class in the
22 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Book, Manual, Handbook
1986 (Book,{B[2]})
1990 (Book, {B[4]}) , (Manual, ~Mlll}),(Handbook, {Hll]})
1991 (Book, {B[l],B[3]})
1993 (Manual, {Ml2]})
Figure 1.13. Entries of CH-tree for the inheritance hierarchy rooted at class Book.
indexed inheritance hierarchy. A CH-tree allocated on the attribute "year" for
the inheritance hierarchy rooted at class Book is shown in Figure 1.13. Note,
from the figure, that the entry with key value equal 1990 contains three sets
of aIDs. The first set contains the aIDs of the instances of Book, (B[4] in the
example), whereas the second and third sets contain aIDs of manuals (M[l])
and handbooks (H[l]), respectively. Generally, a leaf node in a CH-tree consists
of a key-value, a key-directory, and for each class in the inheritance hierarchy
the number of elements in the list of aIDs for instances of this class that hold
the key-value in the indexed attribute, and the list of aIDs. The key-directory
contains an entry for each class that has instances with the key-value in the
indexed attribute. An entry for a class consists of the class identifier and the
offset in the index record where the list of aIDs for the class is located.
Under the CH-tree organization, a SC-query is evaluated as follows. Let C
be the class against which the query is issued. The index is scanned to find the
leaf-node record with the key-value satisfying the query predicate. Then the
key-directory is accessed to determine the offset in the index record where the
list of aIDs of instances of C is located. If there is no entry for class C, then
there are no instances of C satisfying the predicate. A CH-query is processed
in the same way, except that the lookup in the key-directory is executed for
each class involved in the query.
In general, the performance of the CH-tree has an inverse trend with respect
to the SC-index. The CH-tree is more efficient for queries whose access scope
involves all classes (or a significant subset of the classes) in the indexed in-
heritance hierarchy, whereas a SC-index is effective for queries against a single
class. By contrast, the CH-tree retrieves many unnecessary leaf node pages
when the query applies to a single class only.
Results of an extensive evaluation of the two indexing techniques have been
reported in [Kim et al., 1989]. An important parameter in the evaluation is
the distribution of key values across the classes in the inheritance hierarchy. In
general, if each key value is taken by instances of only one class C (that is, dis-
joint distribution), the CH-tree is less efficient than the SC-index. Conversely,
if each key value is taken by instances of several classes, the CH-tree performs
OBJECT-ORIENTED DATABASES 23
better. Also, the update cost for the CH-tree is higher that in SC-index because
the size of one B+-tree for one class is expected to be much smaller compared
to the cost of a single index for the entire hierarchy.
H-tree
The skewed performance of the SC-index and CH-tree for SC- and CH-queries
led to more attempts to overcome the problem. The H-tree [Low et al., 1992]
is a variant of the SC-index which aims at improving the performance of the
SC-index for CH-queries. Like the SC-index, a separate B+-tree is maintained
on the indexed attribute for each class in the inheritance hierarchy. However,
unlike the SC-index, in the H-tree the B+-trees are linked based on their class-
subclass relationships by pointers in the internal nodes of the B+-tree. For
each pair of classes C and C' in the inheritance hierarchy, such that class C'
is a direct subclass of C, a set of additional pointers are maintained from the
internal nodes of the B+-tree allocated on class C to internal nodes in the B+-
tree allocated on class C'. The pointers connect internal node's separators for
same values of the indexed attribute. Figure 1.14 shows a fragment of a H-tree
allocated on the inheritance hierarchy rooted at class Book which indexes the
"year" attribute.
B+-tree of
Manual
B+-lree of
Book B+-lree of
Handbook
Figure 1.14. Fragment of the H-tree organization for the inheritance hierarchy rooted at
class Book.
To execute a CH-query, the H-tree performs a complete scan on the B+-
tree allocated on the query class, followed by the partial search on each of the
B+-trees allocated on the other classes in the subhierarchy rooted at the query
class. The partial search is performed by following the additional pointers from
the B+-tree, allocated on the class root of the queried inheritance hierarchy,
to the B+-trees of the subclasses of the class root. Unfortunately, the usage of
those additional pointers solves the problem of low performance only partially.
Although the H-tree reduces the number of accesses to the B+-tree internal
24 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
nodes, it still requires accessing more leaf node pages than those accessed under
the SC-index organization. Moreover, the reduced query cost is achieved at the
expense of the additional storage overhead for the pointers between B+-trees.
As a consequence, the update cost in the H-tree is higher than in the SC-index.
CG-tree
The CG-tree [Kilger and Moerkotte, 1994] enhances the H-tree by collecting
all pointers between different class's indexes in special nodes which create one
additional level located just before the leaf node level of B+-trees.
Given an inheritance hierarchy of m classes, the CG-tree maintains m B+-
trees, one for each class. In each B+-tree, an additional level between the
internal and leaf nodes is included. Each node at this level contains a vector
of m elements (called class directory) of leaf node references. There is one
element in the array for each class in the indexed inheritance hierarchy. The ith
component of the class directory contains a reference to the leaf node containing
those elements of the class Ci whose keys have the same key values. The position
i of class Ci is given by the preorder traverse of the inheritance hierarchy.
The CG-tree has better performance than the H-tree, as it avoids reading
unnecessary internal nodes. However, it may still require reading unnecessary
leaf nodes. Moreover, the CG-tree has a high storage overhead and update cost
because of the class directories.
heC-tree
The hcC-tree [Sreenath and Seshadri, 1994] is another organization attempting
to combine the advantages of the SC-inclex and CH-tl'ee. Like the CH-tree, it
is based on maintaining a single B+-tree-like data structure to index the entire
inheritance hierarchy. In addition to the usual internal and leaf nodes of a
standard B+-tree used for indexing the attribute values, it includes a new type
of nodes, so called OlD nodes. The OlD nodes lie one level below the leaf nodes
and contain the lists of aIDs related to the attribute values.
Given an inheritance hierarchy with m classes, the hcC-tree maintains m+ 1
chains of OlD-nodes with m class chains (one chain for each class) and one
chain of OlD-nodes corresponding to the entire inheritance hierarchy. The
class chain for a class C groups the aIDs belonging to C, and the hierarchy
chain groups all the aIDs of all instances of all the classes in the inheritance
hierarchy. Practically, a class chain looks like the chain of leaf nodes in a SC-
index, whereas the hierarchy chain is similar to the chain of leaf nodes in a
CH-tree. The OlD nodes are referenced by entries in the leaf nodes. Each leaf
node entry, in addition to key values, contains a bitmap with n bits and a set
P of (m + 1) pointers. Each bit in the bitmap corresponds to a class in the
OBJECT·ORIENTED DATABASES 25
inheritance hierarchy such that if ith bit is set, the ith pointer in P points to
the first node in the class chain for the class C containing OIDs with the key
value. Each internal node entry consists of a key value, a node pointer and a
n-bit bitmap.
For SC-queries, the performance of the hcC-tree is comparable to that of the
SC-index as it requires searching only one class chain. For the range rooted
CH-queries the hcC-tree's performance is comparable to that of the CH-tree as
it requires searching only the hierarchy chain. However, for range partial CH-
queries, the hcC-tree behaves like the SC-index because it requires searching a
number of class chains equal to the number of classes in the query class scope.
Furthermore, as the hcC-tree stores each OlD twice (in one class chain and the
hierarchy chain), it incurs a high storage overhead and update cost.
x-tree
All the above approaches basically use one of two mutually exclusive grouping
methods. The SC-index, the H-tree and the CG-tree group attribute values
in the leaf nodes of B+-tree on the base of a class wherein instances with the
value appear. By contrast, the CH-tree and the hcC-tree are based on the
values of the indexed attribute regardless of the class the instances with the
value belong to. Because of this dichotomy, the various indexing techniques
behave differently for different queries. Indexing techniques based on the first
grouping method are always more efficient for SC-queries, whereas techniques
based on the second grouping method are always more efficient for CH-queries.
The above considerations have led researchers to the insight that the search
space for the class-hierarchy indexing is actually 2-dimensional, with the index-
ing attribute values extended along one dimension (attribute-dimension) and
classes in the hierarchy extended along the second dimension (class-dimension).
As a result, grouping of the indexed values should extend in both directions. In
such a case, the several techniques supporting multi-dimensional indexing like
R-tree, quad-tree, grid-file etc. [Ooi, 1990], can be used for indexing an inher-
itance hierarchies. Figure 1.15 represents data from the inheritance hierarchy
rooted at class Book as 2-dimensional search space. Using such a representa-
tion, the query Q2 "Retrieve all instances of Class Book and all its subclasses
printed in 1991" becomes a rectangular domain in the data plane.
x-tree [Chan et al., 1997] is a dynamic indexing technique similar to the
R-tree [Guttman, 1984] and R*-tree [Beckmann et al., 1990]. Data are stored
in the leaf nodes which appear at the same level of the tree. Each leaf node
entry consists of the key value J(, the object identifier aid and the identifier
cid of the class the object belongs to. If all entries with the same key value J(
do not fit one leaf node, two or more nodes are allocated and all node entries
with same class identifier are grouped together.
26 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Attribute
dimension
1993
Q2
1990 19911986
Class
dimension (
B[I]
B[2] B[4]
B[3]
M[I] M[2]
H[I]
'---
Book
Manual
Handbook
Figure 1.15. Objects from hierarchy rooted at Book as a 2-dimensional search plane.
The internal nodes contain entries of the form (cidSet, Kmin, Kmax , P),
where cidSet is a subset of the classes in the indexed inheritance hierarchy,
[Kmin, Kmax] is a subrange of the attribute domain, and P is a pointer to a
child node on the next level. In the internal nodes of the x-tree, all node entries
with the same set of classes are clustered together into the same record.
As the node splitting strategy in the R-tree is more complicated than in
a B+-tree and often depends on the data shape and distribution, x-tree uses
some heuristics for node splitting based on a special proximity cost metric.
The heuristic generates a list of candidate node splits along both the class-
dimension and the attribute-dimension. The candidates are generated on the
base of a low proximity cost of the split. After the generation step, the best
candidate is selected as a final node split.
As performance tests show, the x-tree outperforms the CH-tree for most
types of query. As it can be expected, the only exception is for queries against
all the classes in the indexed inheritance hierarchy. In such case, the x-tree
fetches about 80% more pages than the CH-tree. Also, like the R-tree which
has a lower space utilization than the B+-tree, the x-tree is higher and requires
larger storage space than the CH-tree.
Good worst case indexing techniques
The x-tree is more efficient than all the previous index organizations for a wide
range of queries and data distributions. Yet, it does not have a good worst-
case performance because it uses R-tree as underlying data structure and some
heuristics for node splitting.
An approach with a proven good worst-case performance was proposed in
[Kanellakis and Ramaswamy, 1996, Ramaswamy and Kanellakis, 1995]. A key
assumption is that the class-dimension in the 2-dimensional data space is static,
OBJECT-ORIENTED DATABASES 27
A
A'
a)
"'CD' /0:;:::',
"~'~'
{AI {BI ICi {D} {EI {F}
h)
class CH-qllcry against d.l.o;s C
dimension tiJA ..... -- .... -----.- --.-
B ---- ---~'-:'.--'-. ---
C ---+ ,- ..... ------. ----
D ----.-------------.-
E - ..... ----.----.----+--
F ----.-----.---- ...... ---
allrihutc
lIimcnsiun
c)
Figure 1.16. Class-division: a) Example hierarchy; b) Binary tree on the class-dimension;
c) A CH-query against class C in the 2-dimensional data space.
that is, no classes in the hierarchy may be removed or inserted even though
objects of the classes may be updated. This redu·ces indexing the inheritance
hierarchy to a special case of the external dynamic 2-dimensional range search-
ing when data in the 2-dimensional space are points with their y-coordinates
being in a static set corresponding to the set of classes.
A given class hierarchy H is preprocessed as follows. We create a family G
where each member is a set of classes from H. After the preprocessing, B+-tree
indexes are maintained for the union of the classes in each member of G. If a
CH-query is against class C in the hierarchy H, a subset of indexes is queried,
which exactly covers C's subclasses and which involves at most q indexes, where
q is a small integer. On the other hand, a class is allowed to appear in at most
a small number r members of G, so an object can have at most r replicas.
Updates are processed by changing all replicas.
In other words, the preprocessing solves the following combinatorial problem,
which is named class-division of H according to maximal replication factor r
and maximal query factor q:
Input: Class hierarchy H with m classes, and positive integers rand q.
Output: A family G, whose members are sets of classes from H such that
(1) No class appears in more than r members of G.
(2) For any class C' in Hand C' its set of subclasses in H including C itself,
there are at most q members of G that exactly cover C' (the union of at most
q members of G is C').
SC-tree is an example of the class-division withq =m and r =1. Similarly,
class-division is possible for and q = 1 and l' = m, when B+-tree indexes
are maintained for all subhierarchies in H and each object can have up to
m replicas. In the general case, there exists the following efficient space-time
tradeoff:
28 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
For any class hierarchy H with m classes, it is possible to perform class-division
of H according to r = flog2 m1+ 1 and q =2flog2 m1.
To prove this, we recall that every CH-query can be represented as a 2-
dimensional range query, with two ranges extended along the attribute-dimen-
sion and the class-dimension. To make all classes from a subhierarchy be con-
tiguous along the class-dimension, classes of H should be sorted due to the
preorder hierarchy traversal. When performing such traversal of H, we build a
binary tree on the class-dimension. Leaves in the tree scanned from the left to
right, contain classes due to the preorder traversal of H while an internal node
contains the union of classes in all leaves of its subtree. Therefore, the tree has
m leaves and flog2 m1+ 1 levels. In Figure 1.16.a the class hierarchy consists
of six classes and the preorder traversal of the hierarchy is given by {A, B, C,
D, E, F}. The binary tree built for the hierarchy is given in Figure 1.16.b.
Once the tree is built, the family G is obtained by generating family members
for all nodes of the tree. Because each class is present in at most one node on
each level and the binary tree has flog2 c1+ 1 levels, no object has more than
flog2 c1+ 1 replicas in G.
A CH-query corresponds to a range along the class-dimension in the preorder
sort. To minimize the number of members of G (or nodes of the tree) covering
the query class range, we select those nodes Vi of the binary tree which are
completely contained in the query range while their parents do not. The query
issued against class C (see Figure 1.16.c) gives the class range {A, B, C} and
the minimal cover for the range is given by nodes {A, B} and {C} (see shadow
nodes in Figure 1.16.b). In the worst case, the query class range has two such
nodes Vi on each level of the tree and 2flog2 C1nodes in total. That is, one can
answer class indexing queries on any class by looking at no more than 2flog2 C1
indexes. This gives the time-space tradeoff previously stated.
As a B+-tree is maintained for each member of G, this tradeoff allows
to construct an efficient data structure in external storage which occupies
o(log2 m(N/ B)) pages and has the worst case I/O query time O(log2 c10gB N +
T / B), where B is the size of the external memory page, m is the number of
classes in the inheritance hierarchy, N is the number of objects in the inheri-
tance hierarchy and T is the number of objects the query retrieves. The update
time in such a structure is O(log2 dogB N).
The above schema provides the worst case complexities for any class hierar-
chy. Hovever, for many hierarchies, values of rand q may be further improved
by using heuristics, some of them were discussed in [Ramaswamy and Kanel-
lakis, 1995]. Also, an improvement of the data structure that reduces the query
time from O(log2 c10gB n +i/B) to O(logB n + log2 B +i/B) was proposed in
[Kanellakis and Ramaswamy, 1996].
OBJECT-ORIENTED DATABASES 29
1.4 Integrated organizations
Even though we have addressed indexing techniques separately for each dimen-
sion along which an object database is organized (namely, aggregation and in-
heritance), most object-oriented queries involve classes along both dimensions.
Such queries typically contain nested predicates and have as a target any num-
ber of classes in a given inheritance hierarchy. The query that retrieves all
books and manuals written by authors from AT&T Lab. is an example of
such queries. Developing integrated indexing techniques able to support such
queries is crucial. In principle, every indexing technique defined for one dimen-
sion could be combined with any technique defined for the other dimension.
However, no integrated indexing technique has been proposed, with the excep-
tion of the nested-inherited index [Bertino and Foscoli, 1995], that we describe
in the remainder of this section.
The nested-inherited index is defined as a combination of concepts from the
nested index, the join index and the CH-tree techniques. In order to present
this indexing technique, we need some additional definitions. To simplify the
following discussion, we make the assumption that a class occurs only once in
a path.
First we recall that, given a class C, C' denotes the set of classes in the
inheritance hierarchy rooted at C. As an example, consider the object-oriented
schema in Figure 1.1:
Book' = {Book, Manual, Handbook}.
Given a path P = C1 .A1.A2 ... . An (n 2 1), the scope of P is defined
as the set UC,EclasS('P) Ct. Class C1 is the root of the scope. Given a class
C in the scope of a path, the position of C is given by an integer i, such
that C belongs to the inheritance hierarchy rooted at class Cj4, where Cj E
class(P). The scope of a path simply represents the set of all classes along
the path and all their subclasses. For example, consider the path P= Orga-
nization.staff.books.publisher.name, scope(P) = {Organization, Author, Book,
Publisher}. Class Organization is the root of P. Class Organization has posi-
tion one, class Author has position two, classes Book, Manual and Handbook
have position three, and class Publisher has position four. In the remainder of
the discussion, given an object 0, we will use the term parent object to denote
an object that references O. For example, the parents of the instance M[I] of
class Manual are objects A[I] and A[4], instances of class Author.
Given a path P = C1 .A1.A2 ... An, the nested-inherited index associates
with a value v of attribute An OIDs of instances of each class in the scope of P
having vas value of the (nested) attribute An. A nested-inherited index on path
P=Organization.staff.books.publisher.name associates with a given publisher
name all organizations having in their staff authors of books or manuals or
30 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
handbooks published by the publisher. Similarly for all the other classes in the
scope. Logically, the index will contain the following entries
Academic Press (Publisher, {P[l]})
Addison-Wesley (Organization,{O[I])), (Author,{A[I], A[2], A[4]}) ,
(Book,{B[I],B[4]}), (Manual, {M[l]},
(Publisher, {P[2]})
Elsevier (Organization, {O[3]}), (Author, {A[5J}),
(Handbook, {H[l]}), (Publisher, {P[3]})
Kluwer (Organization, {O[2]}), (Author, {A[3]}),
(Book, {B[2], B[3]}), (Publisher, {P[4]})
Microsoft (Manual, {M[2]}), (Publisher, {P[5]})
Figure 1.17. Nested-inherited index for path P=Organization.Author.Book.Publisher.
The nested-inherited index, as the nested index and path index, supports
efficient retrieval operations. However, unlike those two organizations, the
nested-inherited index does not require object traversals for update operations,
because of some additional information that is stored in the index. The format
of a non-leaf node has a structure similar to that of traditional indexes based
on B+-tree. The record in a leaf node, called primary l'ecord, has a different
structure. It contains the following information:
• record-length
• key-length
• key-value
• class-directory
• for each class in the path scope, the number of elements in the list of OIDs
for the objects that hold the key-value in the indexed attribute, and the list
of OIDs.
The class-directory contains a number of entries equal to the number of
classes having instances with the key-value in the indexed attribute. For each
such class Ci , an entry in the directory contains:
• the class identifier
• the offset in the primary record where the list of OIDs of Ci instances are
stored
OBJECT-ORIENTED DATABASES 31
• the pointer to an auxiliary record where the list of parents is stored for each
instance of Gi . An auxiliary record is allocated for each class, except for the
root class of the path and for its subclasses. An auxiliary record consists of
a sequence of 4-tuples. A 4-tuple has the form:
(oidi , pointer to primary record, no-oids, {p - oidi 1 ' •.. , p - oidi J ).
There are as many 4-tuples as the number of instances of Gi having the
key-value in the indexed attribute. For an object Oi, the tuple contains the
identifier of Oi, the pointer to the primary record, the number of parent
objects of Oi, the list of parent objects. In the 4-tuple definition above,
no-oids denotes the number of parent objects, and p - oidi . denotes the j-thJ
parent of Oi.
Auxiliary records are stored in different pages than primary records. Given
a primary record, there are several auxiliary records that are connected to it. A
second B+-tree is superimposed on the auxiliary records. The second B+-tree
indexes the 4-tuples based on the OIDs that appear as the first elements of
4-tuples. Therefore, the index organization actually consists of two indexes.
The first, called the primary index, is keyed on the values of attribute An.
It associates with a value v of An the set of OIDs of instances of all classes
relative to the path that have v as value of the (nested) attribute. The second
index, called the auxiliary index, has OIDs as indexing keys. It associates
with the OlD of an object 0 the list of OIDs of the parents of O. Leaf-
node records in the primary index contain pointers to the leaf-node records
in the auxiliary index, and vice versa. The reason for the auxiliary index is
to provide all information for updating the primary index without accessing
the objects themselves. Recall that when updates are executed, the nested
index may require object forward and reverse traversals, while the path index
only requires forward traversals. By contrast, the nested inherited index does
Dot require any access to the objects. The reason for this organization will be
however more clear when discussing the operations.
Figure 1.18 provides an example of the partial index contents for the objects
shown in Figure 1.2.
The IG for a nested-inherited index contains three sets of arcs. First, because
the primary index associates each value of attribute Gn .An with the instances
of all classes in the scope of the indexed path, the IG contains arcs from vertex
Gn.An to classes Gi, where i = 1, ... , n. Second, it contains arcs from Gi to
Gn.An, i = 2, ... , n. Finally, the IG contains arcs 0;+1 to Gi, i = 1, ... , n - 1.
The IG for the path P=Organization.staff.books.publisher.name is shown in
Figure 1.19.
We now discuss how retrieval, insert, and delete operations are performed
on the nested-inherited index. For ease of presentation, we will use examples
32 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
offl
Non·leaf nude record
in the primary B··tree
"ff2 oflJ uff4 "ffS
Organizatitln offl
AddiStHl· Authur "fll ( AIIJ. (B[IJ.
,I fi Book ufO I (O(lJ) AI2J. (MIl) I (PI2J)
Wesley Manual "lf4 AI4J)
B[4J)
Handht)(}k
Puhlisher olTS
Auxiliary rCl,;urd for
c1as.' Alllhur
Nt)O·Jcaf Il{}dc rcc(lCt!
in i.luxilary B·lrce
Figure 1.18. Example of index contents in a nested-inherited index.
Figure 1.19. Indexing graph of the
P =0rga nization.Author. Book. Publisher. na me.
nested-inherited index for path
OBJECT-ORIENTED DATABASES 33
to describe the operations. Formal algorithms are presented in [Bertino and
Foscoli, 1995].
Retrieval
The nested inherited index supports a fast evaluation of predicates on the
indexed attribute for queries having as target any class, or class hierarchy,
in the scope of the path ending with the indexed attribute. As an example,
consider a query that retrieves the organizations whose staff members have
published books with Addison-Wesley. This query is executed by first executing
a lookup on the primary index with key value equal to "Addison-Wesley". The
primary record is then accessed. A lookup in the class directory is executed
to determine the offset where the aIDs of Organization instances are stored.
Then those aIDs are fetched and returned as result of the query. For our query,
the result is {O[l]}.
We now consider a query that retrieves the books published by Addison-
Wesley. The same steps as before are executed. The only difference is that the
class-directory lookup is executed for classes Book, Manual, and Handbook.
Since the entry for class Handbook is empty, only the record portions for classes
Book and Manual are accessed, with offsets obtained from the class-directory.
The query result, {B[I]' B[4], M[I]}, is generated by merging the lists of aIDs
returned for classes Book and Manual. Therefore, the retrieval operation is
similar to retrieval in an CH-tree [Kim et aI., 1989J. The main difference,
however, is that a nested-inherited index can be used for queries on all class
hierarchies found along a given path. By contrast, the CH-tree is allocated on
a single inheritance hierarchy. Therefore, if a path has length n, the number of
CH-trees allocated would be n.
Insert
Suppose that a new manual B[5J with author A[4J is created with P[2J as value
of attribute "publisher". B[5J is therefore a new parent of P[2J. The overall
effect of the insertion in the index must be that B[5J is added to the primary
record with key-value equal to "Addison-Wesley" , and to the parent list of P[2J.
The following steps are executed:
1. The auxiliary index is accessed with key-value equal to P[2J.
2. The 4-tuple of P[2J is retrieved and modified by adding B[5J to the list of
P[2J parents.
3. From the 4-tuple of P[2J the pointer to the primary record is determined.
4. The primary record is accessed.
5. A look-up is executed of the class directory in the primary record to deter-
mine the offset where aIDs of the class Book are stored.
34 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
6. B[5] is added to the list of OIDs stored at the offset determined at the
previous step.
7. A 4-tuple for B[5] is inserted in the auxiliary index with {A[4]} as the author
list.
Note that there is no need to execute a look-up of the primary index, since the
address of the primary record can be directly determined from the auxiliary
record.
Delete
Suppose now that manual M[l] is removed. The overall effect of this operation
on the index must be that M[l] and all instances referencing M[l] (that is, 0[1]'
A[l] and A[4]) be eliminated from the primary record with key-value equal to
"Addison-Wesley". Moreover, the 4-tuples for instances M[l], 0[1], A[l] and
A[4] must be eliminated. Finally, M[l] must be eliminated from the parent list
of P[2]. Note that the update to the parent lists of P[2] may not be needed
if P[2] is removed as well; in this case it may be better to accumulate several
delete operations on the same index. However, we will include that update to
exemplify the algorithm.
1. The value of attribute "publisher" of M[l] is determined. This value is the
aID P[2].
2. The auxiliary index is accessed with key-value equal to P[2].
3. The 4-tuple of P[2] is retrieved and modified by removing M[l] from the list
of parents of P[2].
4. From the 4-tuple of P[2] the pointer to the primary record is determined.
5. The primary record is accessed.
6. A look-up is executed on the class-directory in the primary record to de-
termine the offset where the aIDs of the class Manual are stored and the
pointer to the auxiliary record for class Manual.
7. M[1] is removed from the list of a IDs stored at the offset determined at the
previous step.
8. The auxiliary record of class Manual is accessed and the 4-tuple containing
as first element the OlD M[l] is determined. From this tuple, the aIDs of
the M[l] parents are determined. Those are A[l] and A[4]. Then the 4-tuple
of M[l] is removed.
9. The 4-tuples of A[l] and A[4] are accessed to retrieve the parent lists.
OBJECT-ORIENTED DATABASES 35
O. A lookup is executed on the class-directory in the primary record to deter-
mine the offset where the OIDs of the class Author are stored.
1. A[l] and A[4] are removed from the list of OIDs stored at the offset deter-
mined at the previous step.
2. A lookup is executed on the class-directory in the primary record to deter-
mine the offset where the OIDs of class Organization are stored.
3. 0[1] is removed from the list of OIDs stored at the offset determined at the
previous step.
The delete operation may appear rather costly. However, note that the
primary record is accessed only once from secondary storage. Several modifi-
cations may be required on this record. However, the record can be kept in
memory and written back after all modifications have been executed. Also note
that the algorithm may require accessing several auxiliary records. However,
they are all connected to the same primary record. Therefore, they are likely
to be in the same page.
A preliminary comparison among the nested-inherited index and two other
organizations ha.s been presented in [Bertino, 1991a, Bertino and Foscoli, 1995].
The first of the two organizations is a multi-index organization and simply con-
sists of allocating a.n index on each class in the scope of the path. In the exam-
ple of path P=Organization.staff.books.publisher.name, seven indexes would
be allocated. The second organization, called inherited-multi-index, consists
of allocating an inherited index on each inheritance hierarchy found along the
path. Therefore, the inherited-multi-index is a combination of the CH-tree
organization (defined for inheritance hierarchies) with the multi-index organi-
zation (defined for aggregation hierarchy). For the same path P, there would be
an CH-tree rooted at class Book (thus, indexing Book, Manual and Handbook),
and three B+-tree indexes on classes Organization, Author and Publisher. Ma-
jor results from the comparison are the following:
• The nested-inherited index has the best retrieval performance.
• The nested-inherited index has quite good performance for the insert oper-
ation, since it requires an additional cost of at most three I/O operations
with respect to the other two organizations.
• The delete operation for the nested-inherited index has in the worst case an
additional cost of 4 x i (where i is the position of the class in the path) with
respect to the other organizations.
An accurate model of those costs has been recently developed in [Bertino
and Foscoli, 1995].
36 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
The nested-inherited index does not support any customization with respect
t.o t.he operation profile (see Subsection 1.2.1). Nevertheless, it may be success-
fully used in t.he path splitting approach together wit.h other basic techniques
as an index allocated on some subpath which contains one or more inherita.nce
hierarchies.
1.5 Caching and pointer swizzling
The indexing techniques we discussed so far are based on object structures, that
is, on object attributes. Another possibility is to provide indexing based on ob-
ject behavior, that is, on method results [Bretl et al., 1989]. Techniques based
on this approach have been proposed in [Bertino, 1991b, Bertino and Quarati,
1991, Jhingran, 1991, Kemper et al., 1994]. Most techniques are based on
precomputing or caching the results of method invocations. Moreover, precom-
puted results can be stored in an index, or other access structures, so that it is
possible to efficiently evaluate queries containing the invocation of the method.
A major issue of this approach is how to detect when the computed method
results are no longer valid. In most approaches some dependency information is
kept. This dependency information keeps track of which objects (and possibly
which attributes of each object) have been used to compute a given method.
When an object is modified, all method precomputed results that have used
that object are invalidated. Different solutions can be devised to the problem
of dependencies, also depending on the characteristics of the method. In the
approach proposed in [Kemper et al., 1994], a special structure (implemented
as a relation) keeps track of these dependencies. A dependency has the format
This dependency records the fact that the object whose identifier is oidj
has been used in computing the method of name method_name with input
parameters < oidl , oid2 , .... , oidk >. Note that the input parameters include
also the identifier of t.he object to which the message invoking the method has
been sent.
A more sophisticated approach has been proposed in [Bertino and Quarati,
1991]. If a method is local, that is, uses only the attributes of the object
upon which it has been invoked, all dependencies are kept within the object
itself. Those dependencies are coded as bit-strings, therefore they require a
minimal space overhead. If a method is not local, that is, uses attributes of
other objects, all dependencies are stored in a special object. All objects whose
attributes have been used in the precomputation of a method, have a reference
to this special object. This approach is similar to the one proposed in [Kemper
OBJECT-ORIENTED DATABASES 37
et al., 1994]. The main difference is that in the approach proposed by Bertino
and Quarati, dependencies are stored not in a single data structure, rather they
are distributed among several "special objects". The main advantage of this
approach is that it provides a greater flexibility with respect to object allocation
and clustering. For example, a "special object" may be clustered together with
one of the objects used in the precomputation of the method, depending on the
expected update frequencies.
To further reduce the need of invalidation, it is important to determine the
actual attributes used in the precomputation of a method. As noted in [Kemper
et al., 1994], not all attributes are used in executing all methods. Rather, each
method is likely to require a small fraction of an object's attributes. Two basic
approaches can be devised exploiting such observation. The first approach is
called static and it is based on inspecting the method implementation. There-
fore, for each method the system keeps the list of attributes used in the method.
In this way, when an attribute is modified, the system has only to invalidate
a method if the method uses the modified attribute. Note, however, that an
inspection of method implementations actually determines all attributes that
can be possibly used when the method is executed. Depending on the method
execution flow, some attributes may never be used in computing a method
on a given object. This problem is solved by the dynamic approach. Under
this approach, the attributes used by a method are actually determined only
when the method is precomputed. Upon precomputation of the method, the
system keeps track of all attributes actually accessed during the method exe-
cution. Therefore, the same method precomputed on different objects may use
different sets of attributes for each one of these objects. Performance studies
of method precomputation have been carried out in [Jhingran, 1991, Kemper
et al., 1994].
Besides caching and precomputing, a close class of techniques, commonly
referred to as "pointer swizzling" [Kemper and Kossmann, 1995, Moss, 1992],
was investigated for managing references among main-memory resident per-
sistent objects. Pointer swizzling is a technique to optimize accesses through
such references to objects residing in main-memory. Generally, each time an
object is referenced through its OlD, the system has to determine whether the
object is already in main memory by performing a table lookup. If the object
is not already in main memory, it must be loaded from secondary storage. The
basic idea of pointer swizzling is to materialize the address of a main-memory
resident persistent object in order to avoid the table lookup. Thus, pointer
swizzling converts database objects from an external (persistent) format con-
taining aIDs into an internal (main memory) format replacing the aIDs by the
main-memory address of the referenced objects. Though the choice of a specific
swizzling strategy is strongly influenced by the characteristics of the underly-
38 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
ing object lookup mechanism, a systematic classification of pointer swizzling
techniques, quite independent from system characteristics, has been developed
[Moss, 1992]. Later, this classification was extended and a new dimension of
swizzling techniques, when swizzling objects can be replaced from the main-
memory buffer, was proposed [Kemper and Kossmann, 1995].
1.6 Summary
In this chapter, we have discussed a number of indexing techniques specifi-
cally tailored for object-oriented databases. We have first presented indexing
techniques supporting an efficient evaluation of implicit joins among objects.
Several techniques have been developed. No one of them, however, is opti-
mal from both retrieval and update costs. Techniques providing lower retrieval
costs, such as path indexes or access relations, have a greater update costs com-
pared to techniques, such as multi-index, that, however have greater retrieval
costs.
Then we have discussed indexing techniques for inheritance hierarchies. Fi-
nally, we have presented an indexing technique that provides integrated support
for queries on both aggregation and inheritance hierarchies [Bertino and Foscoli,
1995].
Overall, an open problem is to determine how all those indexing techniques
perform for different types of queries. Studies along that direction have been
carried out in [Bertino, 1990, Kemper and Moerkotte, 1992, Valduriez, 1986].
Similar studies should be undertaken for all the other techniques. Another
open problem concerns optimal index allocation.
In the chapter we have also briefly discussed techniques for an efficient exe-
cution of queries containing method invocations. This is an interesting problem
that is peculiar to object-oriented databases (and in general, to DBMSs sup-
porting procedures or functions as part of the data model). However, few
solutions have been proposed so far and there is, moreover, the need for com-
prehensive analytical models.
Notes
1. Note that in GemStone, unlike other OODBMSs, attributes must not necessarily have
a domain.
2. For sake of homogeneity, we will denote the class domain Cn.An as class Cn+1 •
3. A set containing class C itself and all classes in the inheritance hierarchy rooted at C
is denoted as C'
4. Note that if a class occurs at several points in a path, the class has a set of positions.
2 SPATIAL DATABASES
Many applications (such as computer-aided design (CAD), geographic infor-
mation systems (GIS), computational geometry and computer vision) operate
on spatial data. Generally speaking, spatial data are associated with spatial
coordinates and extents, and include points, lines, polygons and volumetric
objects.
While it appears that spatial data can be modeled as a record with multiple
attributes (each corresponding to a dimension of the spatial data), conven-
tional database systems are unable to support spatial data processing effec-
tively. First, spatial data are large in quantity, complex in structures and
relationships, and often represent non-zero sized objects. Take GIS, a popular
type of spatial database systems, as an example. In such a system, the database
is a collection of data objects over a particular multi-dimensional space. The
spatial description of objects is typically extensive, ranging from a few hun-
dred bytes in land information system (commonly known as LIS) applications
to megabytes in natural resource applications. Moreover, the number of data
objects ranges from tens of thousands to millions.
Second, the retrieval process is typically based on spatial proximity, and em-
ploys complex spatial opemtors like intersection, adjacency, and containment.
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997
40 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Such spatial operators are much more expensive to compute compared to the
conventional relational join and select operators. This is due to irregularity in
the shape of the spatial objects. For example, consider the intersection of two
polyhedra. Besides the need to test all points of one polyhedron against the
other, the result of the operation is not always a polyhedron but may sometimes
consist of a set of polyhedra.
Third, it is difficult to define a spatial ordering for spatial objects. The con-
sequence of this is that conventional techniques (such as sort-merge techniques)
that exploits ordering can no longer be employed for spatial operations.
Efficient processing of queries manipulating spatial relationships relies upon
auxiliary indexing structures. Due to the volume of the set of spatial data
objects, it is highly inefficient to precompute and store spatial relationships
among all the data objects (although there are some proposals that store pre-
computed spatial relationships [Lu and Han, 1992, Rotem, 1991]). Instead,
spatial relationships are materialized dynamically during query processing. In
order to find spatial objects efficiently based on proximity, it is essential to have
an index over spatial locations. The underlying data structure must support
efficient spatial operations, such as locating the neighbors of an object and
identifying objects in a defined query region.
In this chapter, we review some of the more promising spatial data struc-
tures that have been proposed in the literature. In particular, we focus on
indexing structures designed for non-zero sized objects. The review of these
indexes is organized in two steps: first, the structures are described; second,
their strengths and weaknesses are highlighted. The readers are referred to
[Nievergelt and Widmayer, 1997, Ooi et al., 1993) for a comprehensive survey
on spatial indexing structures.
The rest of this chapter is organized as follows. In Section 2.1, we briefly
discuss various issues related to spatial processing. Section 2.2 presents a tax-
onomy of spatial indexing structures. In Section 2.3 to Section 2.6, we present
representative indexing techniques that are based on binary tree structure, B-
tree structure, hashing and space-filling techniques. Section 2.7 discusses the
issues on evaluating the performance of spatial indexes, and approaches adopted
in the literature are reviewed, and finally, we summarize in Section 2.8.
2.1 Query processing using approximations
Spatial data such as objects in spatia.! database systems, and roads and lakes
in GIS, do not conform to any fixed shape. Furthermore, it is expensive to per-
form spatial operations (for example, intersection and containment) on their
exact location and extent. Thus, some simpler structure (such as a bounding
rectangle) that approximates the objects are usually coupled with a spatial in-
SPATIAL DATABASES 41
dex. Such bounding structures allow efficient proximity query processing by
preserving the spatial identification and dynamically eliminating many poten-
tial tests efficiently. Consider the intersection operation. Two objects intersect
implies that their bounding structures intersect. Conversely, if the bounding
structures of two objects are disjoint, then the two objects do not intersect.
This property reduces the testing cost since the test on the intersection of two
polygons or a polygon and a sequence of line segments is much more expensive
than the test on the intersection of two bounding structures.
By far, the most commonly used approximation is the container approach. In
the container approach, the minimum bounding rectangle/circle (box/sphere)
- the smallest rectangle/circle (box/sphere) that encloses the object - is
used to represent an object, and only when the test on container succeeds
then the actual object is examined. The bounding box (rectangle) is used
throughout this chapter as the approximation technique for discussion purposes.
The k-dimensional bounding boxes can be easily defined as a single dimensional
array of k entries: (10, ft, ... ,h-d where Ii is a closed bounded interval [a, b]
describing the extent of the spatial object along dimension i. Alternatively, the
bounding box of an object can be represented by its centroid and extensions
on each of the k directions.
Objects extended diagonally may be badly approximated by bounding boxes,
and false matches may result. A false match occurs when the bounding boxes
match but the actual objects do not match. If the approximation technique is
very inefficient, yielding very rough approximations, additional page accesses
will be incurred. More effective approximation methods include convex hull
[Preparata and Shamos, 1985] and minimum bounding m-corner. The covering
polygons produced by these two methods are however not axis-parallel and
hence incur more expensive testing. The construction cost of approximations
and storage requirement are higher too.
Decomposition of regions into convex cells has been proposed to improve ob-
ject approximation [Gunther, 1988). Likewise, an object may be approximated
by a set of smaller rectangles/boxes. In the quad-tree tessellation approach
[Abel and Smith, 1984], an object is decomposed into multiple sub-objects
based on the quad-tree quadrants that contain them. The decomposition has
its problem of having to store object identity in multiple locations in an index.
The problems of the redundancy of object identifiers and the cost of object-
reconstruction can be very severe if the decomposition process is not carefully
controlled. They can be controlled to a certain extent by limiting the num-
ber of elements generated or by limiting the accuracy of the decomposition
[Orenstein, 1990].
The object approximation and spatial indexes supporting such concepts are
used to eliminate objects that could not possibly contribute to the answer of
42 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
queries. This results in a multi-step spatial query processing strategy [Brinkhoff
et al., 1994]:
1. The indexing structure is used to prune the search space to a set of candidate
objects. This set is usually a superset of the answer.
2. Based on the approximations of the candidate objects, some of the false hits
can be further filtered away. The effectiveness of this step depends on the
approximation techniques.
3. Finally, the actual objects are examined to identify those that match the
query.
Clearly, the multi-step strategy can effectively reduce the number of pages
accessed and the number ofredundant data to be fetched and tested through the
index mechanism, and reduce the computation time through the approximation
mechanism.
The commonly used conventional key-based range (associative) search, which
retrieves all the data falling within the range of two specified values, is general-
ized to an intersection search. In other words, given a query region, the search
finds all objects that intersect it. The intersection search can be easily used to
implement point search and containment search. For point search, the query
region is a point, and is used to find all objects that contain it. Containment
search is a search for all objects that are strictly contained in a given query
region and it can be implemented by ignoring objects that fail such a condition
in intersection search.
The search operation supported by an index can be used to facilitate a spatial
selection or spatial join operation. While a spatial selection retrieves all objects
of the same entity based on a spatial predicate, a spatial join is an operation
that relates objects of two different entities based on a spatial predicate.
2.2 A taxonomy of spatial indexes
Various types of data structures, such as B-trees [Bayer and McCreight, 1972,
Comer, 1979], ISAM indexes, hashing and binary trees [Knuth, 1973], have
been used as a means for efficient access, insertion and deletion of data in large
databases. All these techniques are designed for indexing data based on pri-
mary keys. To use them for indexing data based on secondary keys, inverted
indexes are introduced. However, this technique is not adequate for a database
where range searching on secondary keys is a common operation. For this
type of applications, multi-dimensional structures, such as grid-files [Nievergelt
et al., 1984]' multi-dimensional B-trees [Kriegel, 1984, Ouksel and Scheuer-
mann, 1981, Scheuermann and Ouksel, 1982], kd-trees [Bentley, 1975] and
SPATIAL DATABASES 43
quad-trees [Finkel and Bentley, 1974] were proposed to index multi-attribute
data. Such indexing structures are known as point indexing structures as they
are designed to index data objects which are points in a multi-dimensional
space.
Spatial search is similar to non-spatial multi-key search in that coordinates
may be mapped onto key attributes and the key values of each object represent
a point in a k-dimensional space. However, spatial objects often cover irregular
areas in multi-dimensional spaces and thus cannot be solely represented by
point locations. Although techniques such as mapping regular regions to points
in higher dimensional spaces enable point indexing structures to index regions,
such representations do not help support spatial operators such as intersection
and containment.
Based on existing classification techniques [Lomet, 1992, Seeger and Kriegel,
1988], the techniques used for adapting existing indexes into spatial indexes can
be generally classified as follows:
The transformation approach. There are two categories of transformation
approach:
• Parameter space indexing. Objects with n vertices in a k-dimensional space
are mapped into points in an nk-dimensional space. For example, a two-
dimensional rectangle described by the bottom left corner (Xl, yt} and the
top right corner (X2, Y2) is represented as a point in a four-dimensional
space, where each attribute is taken from a different dimension. After the
transformation, points can be stored directly in existing point indexes. An
advantage of such an approach is that there is no major alteration of the
multi-dimensional base structure. The problem with the mapping scheme is
that the spatial proximity between the k-dimensional objects may no longer
be preserved when represented as points in an nk-dimensional space. Con-
sequently, intersection search can be inefficient. Also, the complexity of
insertion operation typically increases with higher dimensionality.
• Mapping to single attribute space. The data space is partitioned into grid
cells of the same size, which are then numbered according to some curve-
filling methods. A spatial object is then represented by a set of numbers
or one-dimensional objects. These one-dimensional objects can be indexed
using conventional indexes such as B+-trees.
The non-overlapping native space indexing approach. This category
comprises two classes of techniques:
44 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
• Object duplication. A k-dimensional data space is partitioned into pairwise
disjoint subspaces. These subspaces are then indexed. An object identifier
is duplicated and stored in all the subspaces it intersects.
• Object clipping. This technique is similar to the object duplication approach.
Instead of duplicating the identifier, an object is decomposed into several
disjoint smaller objects so that each smaller sub-object is totally included in
a subspace.
The most important property of object duplication or clipping is that the data
structures used are straightforward extensions of the underlying point indexing
structures. Also, both points and multi-dimensional non-zero sized objects
can be stored together in one file without having to modify the structure.
However, an obvious drawback is the duplication of objects which requires extra
storage and hence more expensive insertion and deletion procedures. Another
limitation is that the density (the number of objects that contain a point) in
a map space must be less than the page capacity (the maximum number of
objects that can be stored in a page).
The overlapping native space indexing approach. The basic idea of
this approach to indexing spatial database is to hierarchically partition its
data space into a manageable number of smaller subspaces. While a point
object is totally included in an unpartitioned subspace, a non-zero sized object
may extend over more than one subspace. Rather than supporting disjoint
subspaces as in the non-overlapping space indexing approach, the overlapping
native space indexing approach allows overlapping subspaces such that objects
are totally included in only one of the subspaces. These subspaces are organized
as a hierarchical index and spatial objects are indexed in their native space. A
major design criterion for indexes using such an approach is the minimization
of both the overlap between bounding subspaces and the coverage of subspaces.
A poorly designed partitioning strategy may lead to unnecessary traversal of
multiple paths. Further, dynamic maintenance of effective bounding subspaces
incurs high overhead during updates.
A number of indexing structures use more than one extending technique.
Since each extending method has its own weaknesses, the combination of two or
more methods may help to compensate the weaknesses of each other. However,
an often overlooked fact is that the use of more than one extending method may
also produce a counter effect: inheriting the weaknesses from each method.
Figure 2.1 shows the evolution of spatial indexing structures we adapted
from [Lu and Ooi, 1993]. A solid arrow indicates a relationship between a new
structure and the original structures that it is based upon. A dashed arrow
SPATIAL DATABASES 45
1984
1985
1986
1987
1988
1989
1990 GBD-lree
1991
1992
1993
1994
1995
binar~-lree
LSD-lree
B-tree
TV-tree
EXCELL
Hashing
Grid-files
Quad-tree
based location
keys
DOT
1996 X-tree
Figure 2.1.
Filter-tree
Evolution of spatial index structures.
46 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
indicates a relationship between a new structure and the structures from which
the techniques used in the new structure originated, even though some were
proposed independently of the others. In the diagram and also in the subse-
quent sections, the indexes are classified into four groups based on their base
structures: namely, binary trees, B-trees, hashing, and space filling methods.
Most spatial indexing structures (such as R-trees, R*-trees, skd-trees) are
nondeterministic in that different sequences of insertions result in different tree
structures and hence different performance even though they have the same set
of data. The insertion algorithm must be dynamic so that the performance of
an index will not be dependent on the sequence of data insertion. During the
design of a spatial index, issues that need to be minimized are:
• The area of covering rectangles maintained in internal nodes.
• The overlaps between covering rectangles for indexes developed based on the
overlapping native space indexing approach.
• The number of objects being duplicated for indexes developed based on the
non-overlapping native space indexing approach.
• The directory size and its height.
There is no straightforward solution to fulfill all the above conditions. The
fulfillment of the above conditions by an index can generally ensure its efficiency,
but this may not be true for all the applications. The design of an index needs to
take the computation complexity into consideration as well, which although is
a less dominant factor considering the increasing computation power of today's
systems. Other factors that affect the performance of information retrieval as a
whole include buffer design, buffer replacement strategies, space allocation on
disks, and concurrency control methods.
2.3 Binary-tree based indexing.. techniques
The binary search tree is a basic data structure for representing data items
whose index values are ordered by some linear order. The idea of repetitively
partitioning a data space has been adopted and generalized in many sophisti-
cated indexes. In this section, we will examine spatial indexes originated from
the basic structure and concept of binary search trees.
2.3.1 The kd-tree
The kd-tree [Bentley, 1975], a k-dimensional binary search tree, was proposed
by Bentley to index multi-attribute data. A node in the tree (see Figure 2.2)
serves two purposes: representation of an actual data point and direction of a
SPATIAL DATABASES 47
search. A discriminator whose value is between 0 and k-1 inclusive, is used to
indicate the key on which the branching decision depends. A node P has two
children, a left son LOSON(P) and a right son HISON(P). If the discriminator
value of node P is the jth attribute (key), then the jth attribute of any node in
the LOSON(P) is less than the jth attribute of node P, and the jth attribute
of any node in the HISON(P) is greater than or equal to that of node P. This
property enables the range along each dimension to be defined during a tree
traversal such that the ranges are smaller in the lower levels of the tree.
(0,100) (100, 100)
0) discriminator
o(x-axis)
8(10,75) tD
30,90)
• F(80,
4.A(40,60)
-OC(2 -,15) E(70,20)
I (y-axis)
o(x-axis)(100,0)(0,0)
(a) The planar representation. (b) The structure of a kd-tree.
Figure 2.2. The organization of data in a kd-tree.
Complications arise when an internal node is deleted. When an internal
node is deleted, say Q, one of the nodes in the subtree whose root is Q must
be obtained to replace Q. Suppose i is the discriminator of node Q, then
the replacement must be either a node in the right subtree with the smallest
ith attribute value in that subtree, or a node in the left subtree with the
biggest ith attribute value. The replacement of a node may also cause successive
replacements.
To reduce the cost of deletion, a non-homogeneous kd-tree [Bentley, 1979b]
was proposed. Unlike a homogeneous index, a non-homogeneous index does
not store data in the internal nodes and its internal nodes are used merely as
directory. When splitting an internal node, instead of selecting a data point,
the non-homogeneous kd-trees selects an arbitrary hyperplane (a line for the
two dimensional space) to partition the data points into two groups having
almost the same number of data points and all data points reside in the leaf
nodes.
The kd-tree has been the subject of intensive research over the past decade
[Banerjee and Kim, 1986, Beckley et al., 1985a, Beckley et al., 1985b, Beckley
48 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
et al., 1985c, Bentley and Friedman, 1979, Bentley, 1979a, Chang and Fu,
1979, Eastman and Zemankova, 1982, Friedman et al., 1987, Lee and Wong,
1977, Matsuyama et al., 1984, Ohsawa and Sakauchi, 1983, Orenstein, 1982,
Overmars and Leeuwen, 1982, Robinson, 1981, Rosenberg, 1985, Shamos and
Bentley, 1978, Sharma and Rani, 1985]. Many variants have been proposed
in the literature to improve its performance with respect to issues such as
clustering, searching, storage efficiency and balancing.
2.3.2 The K-D-B-tree
To improve the paging capability of the kd-tree, the K-D-B-tree was proposed
[Robinson, 1981]. K-D-B-tree is essentially a combination of a kd-tree and a
B-tree [Bayer and McCreight, 1972, Comer, 1979], and consists of two basic
structures: region pages and point pages (see Figure 2.3). While point pages
contain object identifiers, region pages store the descriptions of subspaces in
which the data points are stored and the pointers to descendant pages. Note
that in a non-homogeneous kd-tree [Bentley, 1979b], a space is associated with
each node: a global space for the root node, and an unpartitioned subspace
for each leaf node. In the K-D-B-tree, these subspaces are explicitly stored in
a region page. These subspaces (for example, 811, 812 and 813) are pairwise
disjoint and together they span the rectangular subspace of the current region
page (for example, 81), a subspace in the parent region page.
During insertion of a new point into a full point page, a split will occur. The
point page is split such that the two resultant point pages will contain almost
the same number of data points. Note that a split of a point page requires an
extra entry for the new point page, this entry will be inserted into the parent
region page. Therefore, the split of a point page may cause the parent region
page to split as well, which may further ripple all the way to the root; thus the
tree is always perfectly height-balanced.
When a region page is split, the entries are partitioned into two groups
such that both have almost the same number of entries. A hyperplane is used
to split the space of a region page into two subspaces and this hyperplane
may cut across the subspaces of some entries. Consequently, the subspaces
that intersect with the splitting hyperplane must also be split so that the new
subspaces are totally contained in the resultant region pages. Therefore, the
split may propagate downward as well. If the constraint of splitting a region
page into two region pages containing about the same number of entries is not
enforced, then downward propagation of split may be avoided. The dimension
for splitting and the splitting point are chosen such that both the resultant
pages have almost the same number of entries and the number of splittings is
minimized. However, there is no discussion on the selection of splitting points.
51
SPATIAL DATABASES 49
52
• • •• •
• •
•
•• • •
511
• •
521
522
DQ
(a) Planar partition. (b) A hierarchical I<-D-8-tree structure.
Figure 2.3. The K-D-B-tree structure.
The upward propagation of a split will not cause the underflow of pages
but the downward propagation is detrimental to storage efficiency because a
page may contain less than the usual page threshold, typically half of the page
capacity. To avoid unacceptably low storage utilization, local reorganization
can be performed. For example, two or more pages whose data space forms a
rectangular space and who have the same parent can be merged followed by a
resplit if the resultant page overflows.
The K-D-B-tree has incorporated the pagination of the B-tree and the tree
is height-balanced as a result. Nevertheless, poorer storage efficiency is the
trade-off.
2.3.3 The hE-tree
In the K-D-B-tree, a region node is split by cutting the region with a plane,
possibly cutting through some subregions as well. The child nodes with their
space being cut must also invoke the splitting process, causing sparse nodes at
lower levels. To overcome such a problem, a new multi-attribute index structure
called the holey brick B-tree (the hB-tree) [Lomet and Salzberg, 1990a] allows
the data space to be holey, enabling removal of any data subspace from a
50 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
data space. The concept of holey bricks is not new - it has been used to
improve the clustering of data in a kd-tree known as the BD-tree [Ohsawa and
Sakauchi, 1983]. The hB-tree structure is based on the K-D-B-tree structure
and hence preserves the height-balanced property. However, it allows the data
space associated with a node to be non-rectangular and it uses kd-trees for
space representation in its internal nodes. In an hB-tree, the leaf nodes are
known as data nodes and the internal nodes as index nodes. The data space
of an index node is the union of its child node subspaces which are obtained
through kd-tree recursive partitioning.
Indexnode
N:
AA 2
NI N2
A B C E G F
NI: N2~
~
. ex(~A1yl H
x_ y.
F y
ABC E
G F
(a) Internal structure of an hB-tree index
node.
Figure 2.4.
(b) The resultant pages after a split.
The hB-tree structure.
A k-dimensional data space represented by its boundaries requires 2k co-
ordinates. To obtain a data space of interest to the search, half of the data
subspaces in a node have to be searched on average and for each data space,
2k comparisons are required. For m data spaces, we need on average m . k
comparisons. The m data subspaces derived through kd-tree recursive parti-
tioning can be represented by a kd-tree with m - 1 kd-tree nodes. It requires
one comparison at each internal node and 2k comparisons for the unpartitioned
subspace. The average number of comparisons is much smaller than that of the
boundary representation. The use of kd-trees therefore reduces the search time
as well as the storage space requirement.
Like conventional kd-trees, internal nodes of the kd-tree structure in an hB-
tree index node partition the search space recursively. Its leaf nodes reference
SPATIAL DATABASES 51
some index nodes of the hB-tree. However, multiple leaves of a kd-tree structure
may refer to the same hB-tree index node (see Figure 2.4a), giving rise to the
"holey brick" representation. As such, the hB-tree is not truly a tree. During
a split, the kd-tree is split into two subtrees, with each having between 1/3 and
2/3 of the nodes. In order to achieve this, a subtree may have to be extracted
from the original tree structure. This causes duplication of a portion of the
tree close to the root in the parent index node. A leaf node of such a kd-
tree references either an hB-tree data node, an index node, or a marker (ext
in Figure 2.4b) indicating that a subtree has previously been extracted and
is referenced from a higher level index node. The deletion algorithm is not
addressed in the paper.
The hB-tree overcomes the problem of sparse nodes in the K-D-B-tree. How-
ever, this is achieved at the expense of more expensive node splitting and node
deletion. The multiple references of an hB-tree node may cause a path to be
traversed more than once. Of course, this can. be avoided by checking the list
of traversed hB-tree nodes. Deletion may result in the kd-tree being collapsed
to remove the duplicated portion of kd-trees, followed by a resplit if necessary.
2.3.4 The skd-tree
Ooi et al. [Ooi et al., 1987, Ooi et al., 1991] developed an indexing structure
called the spatial kd-tree (the skd-tree) in an attempt to avoid object duplica-
tion and object mapping. At each node of a kd-tree, a value (the discriminator
value) is chosen in one of the dimensions to partition a k-dimensional space
into two subspaces. The two resultant subspaces, HISON and LOSON, nor-
mally have almost the same number of data objects. Point objects are totally
included in one of the two resultant subspaces, but non-zero sized objects may
extend over to the other subspace. To avoid the division of objects for and the
duplication of identifiers in several subspaces, and yet to be able to retrieve
all the wanted objects, a virtual subspace for each original subspace was in-
troduced such that all objects are totally included in one of the two virtual
subspaces [Ooi et al., 1987]. With this method, the placement of an object in
a subspace is based solely upon the value of its centroid.
Since a space is always divided into two, an additional value for each subspace
is required: the maximum of the objects in the LOSON subspace (maxLOsoN),
and the minimum of the objects in the HISON subspace (minHISON ), along
the dimension defined by the discriminator. Thus, the structure of an internal
node of the skd-tree consists of two child pointers, a discriminator (0 to k -1 for
a k-dimensional space), a discriminator-value, (maxLosoN) and (minHIsoN)
along the dimension specified by the discriminator. The maximum range value
52 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
of LOSON (maxLosoN) is the nearest virtual line that bounds the data objects
whose centroids are in the LOSON subspace, and the minimum range value of
HISON (minH/SoN) is the nearest virtual line that bounds the data objects
whose centroids are in the HISON subspace.
Leaf nodes contain min-range and max-range (in place of maXLOSON and
minH/ SON of an internal node) respectively, describing the minimum and max-
imum values of objects in the data page along the dimension specified by bound,
and a pointer to the secondary page which contains the object bounding rect-
angles and identifiers. The minimum and maximum values could be kept for k
dimensions. However, for storage efficiency, the range along one dimension that
results in the smallest bounding rectangle is chosen. It has been shown [Ooi,
1990] that such a range increases the height of the tree when it is stored as a
multiway tree, and hence the improvement becomes fairly marginal. Figure 2.5
shows the structure of a two-dimensional skd-tree and illustrates the virtual
boundary (dotted line), minH/SON or maxLOSON of each resultant subspace.
An implicit rectangular space is associated with each node and it is ma-
terialized during traversal. This rectangle is tested against the query region,
and the subtree is examined if they intersect. Since the virtual boundary may
sometimes bound the objects tighter than the partitioning line, the intersec-
tion search takes advantage of the existing virtual boundary to prune the search
space efficiently. To further exploit the virtual boundaries, containment search
which retrieves all spatial objects contained in a given query rectangle was pro-
posed. During tree traversal, the algorithm always selects the boundaries that
yield smaller search space. The direct support of containment search is useful
to operators like within and contain. The search rapidly eliminates all objects
that are not totally contained in the query region.
Inserting index records for new data objects is similar to insertion into a
point kd-tree. As new index records are added to a bucket, the bucket is split
if it overflows. At each node, the algorithm uses the centroid of the bounding
rectangle of the new object to determine which subspace the object will be
placed, and updates the virtual boundary if necessary.
To delete an object, the centroid of its bounding rectangle is used to deter-
mine where the object resides. The removal of an object may cause a bucket
to underflow, and merging or reinsertion is then required. If the neighboring
node is a leaf-node, then the two buckets are merged and the resultant bucket
is resplit if overflow occurs. Otherwise, the records are required to be inserted
into the neighboring subtree, and the neighboring node is promoted to replace
the parent node. The merging follows the principle of buddy system [Niev-
ergelt et aI., 1984], that is the region of two merged nodes is rectangular and
a proper subspace derivable from discriminator values in parent nodes. The
major problem with deletion occurs when an object contributes to the bound-
SPATIAL DATABASES 53
~.,,~
?12.~ ?'14'~
? 13.~ Ix.xI..h2] ?15,(IOJ (X.h2,! .x2J
(y,y2•• y!l {y.Y3" h4) Ix.. f J Ix,. f dalapagc
~ ~ ",'"'~(a) A 2-d directory of the skd-tl'ee.
x=x I bI II b2 x2
1-- - ~ - - - - - - - - - :- "I ~ - - - - - -p3. I
1 ¥r pl.: ~ .. :1 1'2 I. p2•.. .·... 1
1 ~ LI1J'~--------~1 : p4. 1'3: I : [[4..i
I : : 1 I
y=b3 I· •.•.•.. : • • . . . . . • . • . • • . • • : 1 : pII I
12 l- - - - - - ~ - T - - - - - -, .
b41" ~ 1.. 'LJ7.. ':" .: : r9 :
I · .......... , ..
yltij...... 1 rSI . I
I 1'6 1 I'.:.... ..i
1 .W~ -,~ . r .
p5 : y~y3 :... 'p'~~ ...' • p9 I
• I ' , pS 1
y2 ..... : I . . . • • plOl
'- :... _ 1 _: ....: _ I _ :.... I
x=b5 13 b6
(b) A 2-d space coordinate representation.
Figure 2.5. The structure of a spatial kd-tree.
y=b7
14
bS
b9
blO
15
ary of a virtual space is deleted. A new tighter boundary needs to replace the
old boundary which may not be as effective. The operation can be expensive
as several pages whose space is adjacent to the deleted boundary need to be
searched. The operation cost can be reduced by periodically sweeping the sub-
trees that are affected by deletions. It should be noted that the delay of finding
replacements does not result in any invalid answer.
The directory of the skd-tree is stored in secondary memory. The bottom-up
approach for binary tree paging [Cesarini and Soda, 1982] is modified to store
the skd-tree as a multiway-tree. When such a page splits, one of the subtrees
54 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
is migrated to an existing page that can accommodate the subtree or a new
page, and the root of the subtree is promoted to the parent page.
It was shown that the containment search is insensitive to the different sizes
of objects and distribution of objects, and it is always more efficient than the
intersection search due to a smaller search space [Ooi et al., 1991]. It can be
noticed that the leaf nodes of the skd-tree take up about half of the storage
requirement for the directory. The main objective of having such layer of leaf
nodes is to reduce the fetching of data pages. Experiments have been conducted
to evaluate the performance of skd-trees with and without the leaf nodes, under
different data distributions [Ooi, 1990]. The experiments show that for uniform
distributions of spatial objects, the leaf nodes can reduce the page accesses.
However, when the distributions are skewed, the extra layers are not effective
and large directory sizes incur more page reads than that by the modified skd-
tree. The modified skd-tree, which has less number of nodes, saves up to 40%
of the directory storage space.
2.3.5 The BD- and GBD-trees
The BD-tree [Ohsawa and Sakauchi, 1983], a variant of kd-trees, allows a more
dynamic partitioning of space. Each non-leaf node in the BD-tree contains
a variable-length string, called the discriminator zone (DZ) expression, con-
sisting of D's and l's. The 0 means "<" and 1 "2::", with the leftmost digit
corresponding to the first binary division, and the nth bit corresponding to
the nth binary division. The string describes the left subspace while the right
subspace is its complement. Each string uniquely describes a space. A data
space with the DZ expression (for example, 0100) which is the initial substring
of a longer DZ expression (for example, 010001) encloses its data space. A BD-
tree is different from a kd-tree in the following aspects. One, the data space
of a BD-tree node is not a hyper-rectangle. The use of complement makes the
space holey. Two, unlike the conventional kd-tree, the use of DZ expression
enables rotation, achieving a greater degree of balancing. Three, the partition
divides a space into two equal sized subspaces. Four, the discriminators are
used cyclically so that each bit of a DZ expression can be correctly associated
with a dimension.
The BD-tree is expanded to a balanced multi-way tree called the GBD-
tree (generalized BD-tree) [Ohsawa and Sakauchi, 1990]. In addition to a DZ
expression, a bounding rectangle is used to describe a data space that boupds
the objects whose centroids fall inside the region defined by the DZ expression.
Centroids of objects are used to determine placement of objects in the correct
bucket. While a DZ expression is used to determine the position in the tree
SPATIAL DATABASES 55
structure where an entity is located based on its centroid, a bounding rectangle
is used in intersection search.
In an internal node, each entry describes a data space obtained through
binary decomposition. The union of these data spaces forms the data space of
the node. While the data spaces described by the entries' DZ expressions do
not overlap, their associated bounding rectangles overlap. During point search
of an entity, an inclusion check of the DZ expression of the entity is performed
against the DZ expression of a node. For the data space that includes the entity,
its subtree is traversed. For the intersection search, the bounding rectangles
stored in a node are used instead to select subtrees for traversal.
When a leaf node overflows, it is split into two. A recursive binary decom-
position on alternative axis is performed on the overflowed data space until a
subspace contains at least 2(M+1)/3 entries, where M is the maximum number
of entries a node can contain. While the smaller space has a new DZ expression,
the other subspace takes the DZ expression of the space before splitting. We
call such a space a complementary subspace. A new entry is inserted into the
parent node and the affected bounding rectangles are re-adjusted accordingly.
In an internal node splitting, the subspaces are checked in decreasing order
of their sizes to find a data space that contains almost (M + 1)/2 entries. A
data space described by the DZ expression el contains the data space described
by the DZ expression e2, if el forms the initial substring of e2. In the testing, all
DZ expressions must be checked. The worst case is when a node is split into two
nodes respectively having M entries and one entry. The DZ expression obtained
is used as the DZ expression of a new node. The other new node, which re-
uses the original node, is assigned with the DZ expression of the original space.
When an entry is deleted, a node may underflow. Like B-trees, tree collapsing
is required.
Conceptually, the GBD-tree is similar to the BANG file [Freeston, 1987].
The use of bounding rectangles can be applied to the BANG file. The GBD-
tree has been shown to have better efficiency than the R-tree in terms of tree
construction time for a small set of data [Ohsawa and Sakauchi, 1990].
2.3.6 The LSD-tree
As an improvement to the fixed size space partitioning of the grid files, a binary
tree, called the Local Split Decision tree (LSD-tree), that supports arbitrary
split position was proposed [Henrich et aI., 1989a]. A split position can be
chosen such that it is optimal with respect to the current cell. The directory
of an LSD-tree is similar to that maintained by the kd-tree [Bentley, 1975].
Each node of the LSD-tree represents one split and stores the split dimension
56 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
(cf: the discriminator of kd-trees) and position (cf: the discriminator value of
kd-trees), and each leaf node points to a data bucket.
In an LSD-tree, the nodes in a directory T are divided into two directories:
the internal directory and the external directory. The internal directory consists
of a subtree that contains the root and is stored in main memory. The external
directory consists of multiway-trees and is stored in secondary memory. In an
external directory page, the subtree is organized as a heap. When a directory
page is split, the root node of that directory page is inserted into the directory
T and the left and right subtrees are stored in two distinct directory pages.
The main objective of the paging algorithm [Henrich et al., 1989b] is to ensure
that the heights of multiway-trees differ by at most one directory page. The
proposed paging strategy is similar to binary paging strategy [Cesarini and
Soda, 1982], although the latter makes no distinction between the external
and internal directories. The major difference is that the internal directory is
restructured such that the heights of multi-way trees in the external directory
always differ by at most one page. l'o achieve this, nodes close to the boundary
that separates the internal and external directories must be moved around
between these two directories. Note that the size of the internal directory
depends on the allocated internal memory. Like kd-trees, rotation of the tree is
not possible. If the data is very skewed, the property of the height differences
of at most one cannot be upheld.
The deletion algorithm is not presented. We believe that the deletion of
[Cesarini and Soda, 1982] can be applied here.
2.4 B-tree based indexing- techniques
B+-trees have been widely used in data intensive systems to facilitate query
retrieval. The wide acceptance of the B+-tree is its height-balanced elegant
characteristic, making it ideal for disk I/O where data transfer is in the unit
of page. It has become an underlying structure for many new indexes. In this
section, we discuss indexes based on the concept of the hierarchical structure
of B+-trees.
2.4.1 Tlle R-tree
The R-tree [Guttman, 1984] is a multi-dimensional generalization of the B-tree,
that preserves height-balance. Like the B-tree, node splitting and merging are
required for inserting and deleting objects. The R-tree has received a great
deal of attention due to its well defined structure and the fact that it is one of
the earliest proposed tree structures for non-zero sized spatial object indexing.
Many papers have used the R-tree as a model to measure the performance of
their structures.
SPATIAL DATABASES 57
An entry in a leaf node consists of an object-identifier of the data object
and a k-dimensional bounding rectangle which bounds its data objects. In a
non-leaf node, an entry contains a child-pointer pointing to a lower level node
in the R-tree and a bounding rectangle covering all the rectangles in the lower
nodes in the subtree. Figure 2.6 illustrates the structure of an R-tree.
R.!. _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2
R3 :~- - - - ;'. ~i6----~----I~~---------I~~~-P3-:
I • p4 I II 1 I I
I I I D 31 r- - - - - - - - - -I'
l - - - - - - - I ' 1 I R7 I II
I~ L • pll
I I 1 II
'~p:p~ RSFpl-: N::I I I I :>7 I I II
I I I ,I 1_ - - L - - - - - - I
1"5 p5 I I I plj I I
_:=====~ -_i_-_-,..j _II I"S' : :
I p S . • :>10 I
1 -_ 1 I
(a) A planar representation.
~
(b) The directory of an R-tree.
Figure 2.6. The structure of an R-tree.
In order to locate all objects which intersect a query rectangle, the search
algorithm descends the tree from the root. The algorithm recursively traverses
down the subtrees of bounding rectangles that intersect the query rectangle.
When a leaf node is reached, bounding rectangles are tested against the query
rectangle and their objects are fetched for testing if they intersect the query
rectangle.
To insert an object, the tree is traversed and all the rectangles in the current
non-leaf node are examined. The constraint of least coverage is employed to
insert an object: the rectangle that needs least enlargement to enclose the
new object is selecteel, the one with the smallest area is chosen if more than
58 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
one rectangle meets the first criterion. The nodes in the subtree indexed by
the selected entry are examined recursively. Once a leaf node is obtained, a
straightforward insertion is made if the leaf node is not full. However, the leaf
node needs splitting if it overflows after the insertion is made. For each node
that is traversed, the covering rectangle in the parent is readjusted to tightly
bound the entries in the node. For a newly split node, an entry with a covering
rectangle that is large enough to cover all the entries in the new node is inserted
in the parent node if there is room in the parent node. Otherwise, the parent
node will be split and the process may propagate to the root.
To remove an object, the tree is traversed and each entry of a non-leaf node
is checked to determine if the object overlaps its covering rectangle. For each
such entry, the entries in the child node are examined recursively. The deletion
of an object may cause the leaf node to underflow. In this case, the node needs
to be deleted and all the remaining entries of that node are reinserted from
the root. The deletion of an entry may also cause further deletion of nodes
in the upper levels. Thus, entries belonging to a deleted ith level node must
be reinserted into the nodes in the ith level of the tree. Deletion of an object
may change the bounding rectangle of entries in the ancestor nodes. Hence
readjustment of these entries is required.
In searching, the decision to visit a subtree depends on whether the covering
rectangle overlaps the query region. It is quite common for several covering
rectangles in an internal node to overlap the query rectangle, resulting in the
traversal of several subtrees. Therefore, the minimization of overlaps of covering
rectangles as well as the coverage of these rectangles is of primary importance
in constructing the R-tree.
The heuristic optimization criterion used in the R-tree is the minimization
of the area of internal nodes covering rectangles. Two algorithms involved in
the process of minimization are the insertion and its node splitting algorithms.
Of the two, the splitting algorithm affects the index efficiency more. Guttman
[Guttman, 1984] presented and studied splitting algorithms with exponential,
quadratic and linear cost, and showed that the performance of the quadratic
and linear algorithms were comparatively similar. The quadratic algorithm
in a node splitting first locates two entries that are furthest apart, that is a
pair of entries that would waste the largest area if they are put in the same
group. These two rectangles are known as the seeds and the pair chosen tend
to be small relative to others. Two groups are formed, each with one seed.
For the remaining entries, each entry rectangle is used to calculate the area
enlargement required in the covering rectangle of each group to include the
entry. The difference of two area enlargements is calculated and the entry that
has the maximum difference is selected as the next entry to be included into the
group whose covering rectangle needs the least enlargement. As the selection
SPATIAL DATABASES 59
is mainly based on the minimal enlargement of covering rectangles and the
rectangle that has been enlarged before requires less expansion to include the
next rectangle, it is quite often that a single covering rectangle is enlarged till
the group has M - m + 1 rectangles (M is the maximum number of entries
per node). The two resultant groups will respectively contain M - m + 1 and
m rectangles. The linear algorithm chooses the first two objects based on the
separation between the objects in relation to the width of the entire group along
the same dimension. Greene proposed a slightly different splitting algorithm
[Greene, 1989]. In her splitting algorithm, two most distant rectangles are
selected and for each dimension, the separation is calculated. Each separation
is normalized by dividing it with the interval of the covering rectangle on the
same dimension, instead of by the total width of the entire group [Guttman,
1984]. Along the dimension with the largest normalized separation, rectangles
are ordered on the lower coordinate. The list is then divided into two groups,
with the first (M + 1)/2 rectangles into the first group and the rest into the
other.
2.4.2 The R*-tl'ee
Minimization of both coverage and overlaps is crucial to the performance of
the R-tree. It is however impossible to minimize the two at the same time. A
balancing criterion must be found such that the near optimal of both minimiza-
tion can produce the best result. Beckmann et al. introduced an additional
optimization objective concerning the margin of the covering rectangles; squar-
ish covering rectangles are preferred [Beckmann et al., 1990]. Since clustering
rectangles with little variance of the lengths of the edges tend to reduce the area
of the cluster's covering rectangle, the criterion that ensures the quadratic cov-
ering rectangles is used in the insertion and splitting algorithms. This variant
of R-tree is referred to as the R*-tree.
In the leaf nodes of the R*-tree, a new record is inserted into the page whose
entry covering rectangle if enlarged has the least overlap with other covering
rectangles. A tie is resolved by choosing the entry whose rectangle needs the
least area enlargement. However, in the internal nodes, an entry whose covering
rectangle needs the least area enlargement is chosen to include the new record,
and a tie is resolved by choosing the entry with the smallest resultant area.
The improvement is particularly significant when both the query rectangles
and data rectangles are small, and when the data is non-uniformly distributed.
In the R*-tree splitting algorithm, along each axis, the entries are sorted by
the lower value, and also sorted by the upper value of the entry rectangles. For
each sort, M - 2m + 2 distributions of splits are considered; and in the kth
distribution (1 :s: k :s: M - 2m + 2), the first group contains the first m - 1 + k
60 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
entries and the other group contains the remaining !l1 - m - k entries. For
each split, the total area, the sum of edges and the overlap-area of the two new
covering rectangles are used to determine the split. Note that not all three can
be minimized at the same time. Three selection criteria were proposed based on
the minimum over one dimension, the minimum of the sum of the three values
over one dimension or one sort, and the overall minimum. In the algorithm,
the minimization of the edges is used.
Dynamic hierarchical spatial indexes are sensitive to the order of the inser-
tion of data. A tree may behave differently for the same data set but with a
different sequence of insertions. Data rectangles inserted previously may result
in a bad split in R-tree after some insertions. Hence it may be worth to do
some local reorganization, which is however expensive. The R-tree deletion
algorithm provides reorganization of the tree to some extent, by forcing the
entries in underflowed nodes to be inserted from the root. The performance
study shows that the deletion and reinsertion can improve the R-tree perfor-
mance quite significantly [Beckmann et al., 1990]. Using the idea of reinsertion
of the R-tree, Beckmann et al. proposed a reinsertion algorithm when a node
overflows. The reinsertion algorithm sorts the entries in decreasing order of
the distance between the centroids of the rectangle and the covering rectan-
gle and reinserts the first p (variable for tuning) entries. In some cases, the
entries are reinserted back into the same node and hence a split is eventually
necessary. The reinsertion increases the storage utilization; and this can be
expensive when the tree is large. Experimental study conducted indicates that
the R*-tree is more efficient than some other variants, and the R-tree using lin-
ear splitting algorithm is substantially less efficient than the one with quadratic
splitting algorithm [Beckmann et aI., 1990].
2.4.3 The R+-tree
The R+-tree [Sellis et al., 1987] is a compromise between the R-tree and the
K-D-B-tree [Robinson, 1981] and was proposed to overcome the problem of the
overlapping covering rectangles of internal nodes of the R-tree. The R+-tree
differs from the R-tree in the following constraints: nodes of an R+-tree are
not guaranteed to be at least half filled; the entries of any internal node do not
overlap; and an object identifier may be stored in more than one leaf node.
The duplication of object identifiers leads to the non-overlapping of entries.
In a search, the subtrees are examined only if the corresponding covering rect-
angles intersect the query region. The disjoint covering rectangles avoid the
multiple search paths of the R-tree for point queries. For the space in Fig-
ure 2.7, only one path is traversed to search for all objects that contain point
P7; whereas for the R-tree, two search paths exist. However, for certain query
SPATIAL DATABASES 61
rectangles, searching the R+-tree is more expensive than searching the R-tree.
For example, suppose the query region is the left half of object rs. To retrieve
all objects that intersect the query region using the R-tree, two leaf nodes have
to be searched, respectively through Rs and Rs, and it incurs five page ac-
cesses. To evaluate such a query, three leaf nodes of the R+-tree have to be
searched, respectively through R6 , Rg , and RiO, and a total of six page accesses
is incurred.
R.!. ______________ R2
R3 ~~- - - - ;1. :FR~I=2= ====1~~·-p.1-~1
I R4 ,I .r . ~ p2 II
, ep4 I r.O--- , I "
I _ _ _ _ _ _ _ I I r3 II I' II
~ - - - - - - - - - - - II II
,I I pll
., fr~'-~----: :~ ::~EF1: ,: :--JI I I I I
_I =====~ 1 II r8 it: :
Rld~ =~8~= =e-!~o :
(a) A planar representation.
(b) The directory of an R+-tree.
Figure 2.7. The structure of an R+ -tree.
To insert an object, multiple paths may be traversed. At a node, the subtrees
of all entries with covering rectangles that intersect with the object bounding
rectangle must be traversed. On reaching the leaf nodes, the object identifier
will be stored in the leaf nodes; multiple leaf nodes may store the same object
identifier.
62 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Three cases of insertions need to be handled with care [Gunther, 1988, Ooi,
1990]. The first is when an object is inserted into a node where the covering
rectangles of all entries do not intersect with the object bounding rectangle.
The second is when the bounding rectangle of the new object only partially
intersects with the bounding rectangles of entries; this requires the bounding
rectangle to be updated to include the new object bounding rectangle. Both
cases must be handled properly such that the coverage of bounding rectangles
and duplication of objects could be minimized.
The third case is more serious in that the covering rectangles of some entries
can prevent each other from expanding to include the new object. In other
words, some space ("dead space") within the current node cannot be covered
by any of the covering rectangles of the entries in the node. If the new object
occupies such a region, it cannot be fully covered by the entries. To avoid
this situation, it is necessary to look ahead to ensure that no dead space will
result when finding the entries to include an object. Alternatively, the crite-
rion proposed by Guttman [Guttman, 1984] can be used to select the covering
rectangles to include a new node. When a new object cannot be fully covered,
one or more of the covering rectangles are split. This means that the split may
cause the children of the entries to be split as well, which may further degrade
the storage efficiency.
During an insertion, if a leaf node is full and a split is necessary, the split
attempts to reduce the identifier duplications. Like the K-D-B-tree, the split
of a leaf node may propagate upwards to the root of the tree and the split
of a non-leaf node may propagate downwards to the leaves. The split of a
node involves finding a partitioning hyperplane to divide the original space
into two. The selection of a partitioning hyperplane was suggested to be based
on the following four criteria: the clustering of entry rectangles, minimal total
x- and y-displacement, minimal total space coverage of two new subspaces,
and minimal number of rectangle splits. While the first three criteria aim to
reduce search by tightening the coverage, the fourth criterion confines the height
expansion of the tree. The fourth criterion can only minimize the number of
covering rectangles of the next lower level that must be split as a consequence.
It cannot guarantee that the total number of rectangles being split is minimal.
Note that all four criteria cannot possibly be satisfied at the same time.
While the R+-tree overcomes the problem of overlapping rectangles of the R-
tree, it inherits some problems of the K-D-B-tree [Robinson, 1981]. Partitioning
a covering rectangle may cause the covering rectangles in the descendant sub-
tree to be partitioned as well. Frequent downward splits tend to partition the
already under populated nodes, and hence the nodes in an R+-tree may contain
less than M /2 entries. Object identifiers are duplicated in the leaf nodes, the
extent of duplication is dependent on the spatial distribution and the size of
SPATIAL DATABASES 63
the objects. To delete an object, it is necessary to delete all identifiers that
refer to that object. Deletion may necessitate major reorganization of the tree.
2.4.4 The BY-tree
The BY-tree, proposed by Freeston, is a generalization of the B-tree to higher
dimensions [Freeston, 1995]. While the BY-tree guarantees that it can specialize
to (and hence preserves the properties of) a B-tree in the one-dimensional case,
at higher dimensions, it may not be height-balanced and its storage utilization
is reduced to no worst than 33% (instead of 50% in B-tree). Despite foregoing
these two properties, it is able to maintain the logarithmic access and update
time.
Based on the BANG file [Freeston, 1987], a subspace 5 is split into two
regions 51 and 52 such that the boundary of 51 encloses that of 52. Each
region is uniquely identified by a key, and the key is used to direct the search in
the BY tree. Although the physical boundaries of regions may be recursively
nested, there is no correspondence between the level of nesting of a region and
the index tree hierarchy which represents it. In fact, whenever a region r1
whose boundary directly encloses the boundary of a region r2 resulting from a
split, then r1 is "promoted" closer to the root. To facilitate searching correctly,
the actual level in which r1 belongs to (called a guard) is stored.
Figure 2.8 illustrates a BY-tree. As shown in the figure, boundary of region
aO encloses that of region bO, which in turns encloses the boundary of regions
cO, dO and eO. In this example, region bO has been promoted to the root as it
serves as a guard for region bi.
--- ....
-------
(a) A planar representation. (b) The BY-tree.
Figure 2.8. The structure of a BV-tree.
The search begins at the root, and descends down the tree. At each node,
every entry is checked to identify a guard set that represents regions that best
64 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
match the search region. Two types of entries can be found in the guard set -
those that correspond to the set guards of an unpromoted entry, and the best
match unpromoted entry that encloses the best match guard. As the tree is
descended from level h to level h - 1, the guard sets found at levels h - 1 and
h are merged in the process of which some may be pruned away. Once the leaf
node is reached, the guard set contains the regions where the search region may
be found. The data corresponding to the regions of the guard set are searched
to answer the query.
During insertion, complication arises when a promoted region is to be split
into two such that one region encloses higher-level regions while the other does
not. In this case, the entry for the second region will have to be demoted to its
unpromoted position in the tree. Deletion may require merging and resplitting.
This requires finding a region to merge, and finding a way to split the merged
. .
regIOn agam.
2.5 Cell methods based on dynamic hashing
Both extendible hashing [Fagin et aI., 1979] and linear hashing [Kriegel and
Seeger, 1986, Larson, 1978] lend themselves to an adaptable cell method for
organizing k-dimensional objects. The grid file [Nievergelt et aI., 1984] and the
EXtendible CELL (EXCELL) method [Tamminen, 1982] are extensions of dy-
namic hashed organizations incorporating a multi-dimensional file organization
for multi-p,ttribute point data. We shall restrict our discussion to the grid file
and its variants.
2.5.1 The grid file
The grid file structure [Nievergelt et aI., 1984] consists of two basic structures:
k linear scales and a k-dimensional directory (see Figure 2.9). The fundamental
idea is to partition a k-dimensional space according to an orthogonal grid. The
grid on a k-dimensional data space is defined as scales which are represented by
k one-dimensional arrays. Each boundary in a scale forms a (k-l )-dimensional
hyperplane that cuts the data space into two subspaces. Boundaries form k-
dimensional unpartitioned rectangular subspaces, which are represented by a
k-dimensional array known as the grid directory. The correspondence between
directory entries and grid cells (blocks) is one-to-one. Each grid cell in the grid
directory contains the address of a secondary page, the data page, where the
data objects that are within the grid cell are stored. As the structure does not
have the constraint that each grid cell must at least contain m objects, a data
page is allowed to store objects from several grid cells as long as the union of
these grid cells together form a rectangular rectangle, which is known as the
storage region. These regions are pairwise disjoint, and together they span the
SPATIAL DATABASES 65
data space. For most applications, the size of the directory dictates that it be
stored on secondary storage, however, the scales are much smaller and may be
cached in main memory.
data pages
grid directory
I----i---+----------i
•
• ••
• ••
•
•
• •
:.
Figure 2.9. The grid file layout.
Like other tree structures, splitting and merging of data pages are respec-
tively required during insertion and deletion. Insertion of an object entails
determining the correct grid cell and fetching the corresponding page followed
by a simple insertion if the data page is not full. In the case where the page
is full, a split is required. The split is simple if the storage region covers more
than one grid cell and not all the data in the region fall within the same cell;
the grid cells are allocated to the existing data page and a new page with the
data objects distributed accordingly. However, if the page region covers only
one grid cell or the data of a region fall within only one cell, then the grid
has to be extended by a (k-l)-dimensional hyperplane that partitions the stor-
age region into two subspaces. A new boundary is inserted into one of the
k grid-scales to maintain the one-to-one correspondence between the grid and
the grid directory, a (k-l )-dimensional cross-section is added into the grid di-
rectory. The resulting two storage regions are disjoint and, to each region a
corresponding data page is attached. The objects stored in the overflowing page
are distributed among the two pages, one new and one existing page. Other
66 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
grid cells that are partitioned by the new hyperplane are unaffected since both
parts of the old grid cell will now be sharing the same data page.
Deletions may cause the occupancy of a storage region to fall below an ac-
ceptable level, and these trigger merging operations. When the joint occupancy
of a storage region whose records have been deleted and its adjacent storage re-
gion drops below a certain threshold, the data pages are merged into one. Based
on the average bucket occupancy obtained from simulation studies, Nievergelt
et al. [Nievergelt et aI., 1984] suggested that 70% is an appropriate value of the
resulting bucket. Two different methods were proposed for merging, the neigh-
bor system and the buddy system. The neighbor system allows two data pages
whose storage regions are adjacent to merge so long as the new storage region
remains rectangular; this may lead to "dead space" where neighboring pages
prevent any merging for a particular under-populated page. A more restrictive
merging policy like the buddy system is required to prevent the dead space.
For the buddy system, two pages can be merged provided their storage regions
can be obtained from the subsequent larger storage region using the splitting
process. However, total elimination of dead space for a k-dimensional space is
not always possible. The merging process will also make the boundary along
the two old pages redundant, when there are no storage regions adjacent to
the boundary. In this case, the redundant boundary is removed from its scale
and the one-to-one correspondence is maintained by removing the redundant
entries from the grid directory.
The grid file has also been proposed as a means for spatial indexing of non-
point objects [Nievergelt and Hinrichs, 1985]. To index k-dimensional data
objects, mapping from a k-dimensional space to a nk-dimensional space where
objects exist as points is necessary. One disadvantage of the mapping scheme is
that it is harder to perform directory splitting in the higher dimensional space
[Whang and Krishnamurthy, 1985]. To index a rectangle, it is represented as
(ex, ey, dx, dy), where (ex, ey) is the centroid of the object and (dx, dy) are the
extensions of the object from the centroid. The (ex, ey, dx, dy) representation
causes objects to cluster close to x-axis, while objects cluster on top of x = y
for (Xl, X2, Yl, Y2) representation. For ease of grid partitioning, the former
representation is therefore preferred. For an object (ex, ey, dx, dy) to intersect
with the query region (qex, qey, qdx, qdy), the following conditions must be
satisfied:
ex - dx < qex + qdx and
ex + dx > qex - qdx and
ey - dy < qey + qdy and
ey+ dy > qey - qdy
SPATIAL DATABASES 67
Consider Figure 2.10a, where rectangle q is the query rectangle. The inter-
section search region on ex - dx hyperplane, the shaded region in Figure 2.10b,
is obtained by the first two inequality equations of the above intersection con-
dition. Note that the search region can be very large if the global space is
large and the largest rectangle extension along the x-axis is not defined. In
Figure 2.10, the known upper bound, udx, for any rectangle extension along
the x-axis, reduces the search region to the enclosed shaded region. The same
argument applies for the other coordinate. Objects that fall in both search
regions satisfy the intersection condition.
qcx-
qdx
qcy
lJCX
(a) Object distribution.
,," "" ,,/ ,," ,"/ /
udx -h~/_":"/---''-/---L_/ --'-_/....,,£.
,,/ ,," .~/ ,,"
I' I' I' I'
I' " I' "
"".... ,,I' //.d
"'It .-" 1'. .g
he I' I' I.: • f
qcx-qdx qcx qcx+qdx
(b) Search regions on
cx-dx hyperplane.
dy
'Icy·
'lily
IIdy +-~~........---,.4-~~+
L-.:::...:.....-'O"'--4---;<-;;~ cy
lJcY·lJdy 'Icy lIcY+lJdy
(c) Search regions on
cy-dy hyperplane.
Figure 2.10. Intersection search region in the grid file.
The mapping of regions from a k-dimensional space to points in a nk-
dimensional space undesirably changes the spatial neighborhood properties.
Regions that are spatially close in a k-dimensional space may be far apart when
they are represented as points in an nk-dimensional space. Consequently, the
intersection search may not be efficient.
2.5.2 The R-file
The grid file structure was originally designed to guarantee two disk accesses for
exact match queries, one to access the directory and the other to access the data
page. The "two disk access" property can only be ensured if the directory is
stored as an array and all grid cells are of the same size. However, with such an
implementation, the size of the directory is doubled whenever a new boundary
is introduced. Most of these directory entries correspond to empty grid cells
that do not contain any data objects. Simulated results [Nievergelt et al.,
1984] indicate that the size of the directory grows approximately linearly with
the size of the file. To alleviate this problem, multi-level directories [Blanken
et al., 1990, Hinrichs, 1985, Hutflesz et al., 1990, Freeston, 1987, Whang and
Krishnamurthy, 1985] where grid cells are organized in a hierarchical structure
68 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
have been suggested. We shall present the R-file approach which is designed for
non-zero sized objects. In the R-file [Hutftesz et al., 1990], cells are partitioned
using the partitioning strategy of the grid file and a cell is split when overflowed.
In order for cells to tightly contain the spatial objects, cells are partitioned
recursively by repeated halving till the smallest cell that encloses the spatial
objects is obtained. Spatial objects that are totally contained in a cell are
stored in its corresponding data page, and those that intersect the partitioning
line are stored in the original cell. If the number of spatial objects that intersect
a partitioning is more than what can be stored in a data page, partitioning line
along the other dimensions will be used. If all records lie on the cross point of
partitioning lines, they cannot be partitioned by any partitioning lines, and in
such a case, a chain of buckets is used.
After a split, the original cell and the two new cells overlap and to keep the
directory small, empty cells are not maintained. After a split, both the original
and new cells have almost the same number of spatial objects. Figure 2.11
illustrates a case in point. Even so, a high number of cells will be inspected
for intersection queries, especially those original large cells. The fact that
spatial objects stored in the original unpartitioned cells tend to intersect the
partitioning lines of the cells suggests the clustering property of these objects.
In order to make intersection search more efficient, two extra values that bound
the objects in the partitioning dimension are kept with the original cells. Due
to the overlapping cells, the directory is potentially large. To avoid storing the
cell boundaries, a z-ordering scheme [Orenstein, 1986] is used to number the
cells. With such a scheme, cells are partitioned cyclically. For each cell, the
directory stores the cell number, the bounding interval, and the data bucket
reference. Experiments conducted [Hutftesz et al., 1990] strongly indicate that
the bounding information leads to substantial saving of page accesses.
2.5.3 PLOP-hashing
In [Kriegel and Seeger, 1988], the grid file was extended for the storage of
non-zero sized objects. The method is a multi-dimensional dynamic hashing
scheme based on Piecewise Linear Order Preserving (PLOP) hashing. Like the
grid file, the data space is partitioned by an orthogonal grid. However, instead
of using k arrays to store scales that define partitioning hyperplanes, k binary
trees are used to represent the linear scales. Each internal node of a binary tree
stores a (k-l)-dimensional partitioning hyperplane. Each leaf node of a binary
tree is associated with a k-dimensional subspace (a slice), where the interval
along its associated axis is a sub-interval and the other k-l intervals assume
the intervals of the global space. Each slice is addressed by an index i stored in
its leaf node. To each cell, a page is allocated to store all points that fall in the
I I I
I
GJD9--0-
~D ;0
(a) Original space.
SPATIAL DATABASES 69
I I
D
(b) First bucket.
Do oo
------r-----
I I I
I I I
I I I
I I I
I I I
-----,
I
~D :I1.- ---' _
(c) Second & Third bucket. (d) Fourth bucket.
Figure 2.11. The R-file.
unpartitioned subspace. From the indexes stored in k binary trees, the address
of a page can be computed. Adopting the bounding scheme similar to that of
skd-tree, two extra values are stored in a leaf node to bound the objects whose
centroids are in the corresponding slice along the axis that the binary tree is
associated with. Hence, an object is inserted into the grid cell that contains its
centroid. The regions defined by the two extra values may overlap and they
will be used for intersection search.
The file organizations based on hashing are generally designed for multi-
dimensional point data. To use them for spatial indexing, the mapping of
objects from k-dimensional space to nk-dimensional space or duplication of
objects identifiers are generally required. Indexing in a parameter space is
not efficient for general spatial query retrievals [Guttman, 1984, Whang and
Krishnamurthy, 1985].
70 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
2.6 Spatial objects ordering
Existing DBMS supports efficient one-dimensional indexes and provides fast ac-
cess to one-dimerisional data. If multi-dimensional objects can be converted to
one-dimensional objects, such indexes can be used directly without alteration.
The mapping functions used in mapping must preserve the proximity between
data well enough in order to yield reasonably good spatial search. The idea is
to assign a number to each representative grid in a space and these numbers
are then used to obtain a representative number for the spatial objects. Tech-
niques on ordering multi-dimensional objects using single-dimensional values
have been proposed. These include the Peano curve [Morton, 1966], locational
keys [Abel and Smith, 1983], Z-ordering [Orenstein and Merrett, 1984], Hilbert
curve [Faloutsos and Roseman, 1989], and gray ordering [Faloutsos, 1988]. We
discuss the method based on locational keys proposed by Abel and Smith [Abel
and Smith, 1983].
A space is recursively divided into four equal sized subspaces, forming a
hierarchy of quadrants. For each subspace, a unique numeric key of base S is
attached. All objects falling within a given subspace are assigned the subspace's
key. The key k for a subspace of level h (> 1) can be derived from the key (k')
of the ancestor subspace by the following formula:
{
k' +sm-h
k' + 2 *sm-h
k-
- k' + 3 *sm-h
k' + 4 *sm-h
if k is the SW son of k'
if k is the NW son of k'
if k is the SE son of k'
if k is the NE son of k'
Here m is an arbitrary maximum number of levels in decomposition, which
is greater than h. The global space has Sm as the key.
Figure 2.12 illustrates an example of key assignment (base S), where the
maximum level of decomposition is 4. One can notice that, when the locational
keys of the same level are traced, the ordering is a form of N- or Z-ordering.
To assign a key to a rectangle, the smallest block which completely covers
the rectangle is used. An inherent problem of such an assignment is that an
object bounding rectangle may be very much smaller (as a consequence of the
bounding rectangle spanning one or more subspace divisions) than the asso-
ciated quadrant. To alleviate this problem, a decomposition technique [Abel
and Smith, 1984] is used, where a rectangle may be represented by up to four
adjacent quadrants. Rectangles Band C in Figure 2.12b illustrate the cases
where one and two quadrants are used: keys 1300 for rectangle B, and keys
1422 and 1424 for rectangle C. By associating each rectangle with a collection
of quadrants, a better approximation of a rectangle is achieved. This form of
representation requires an object identifier to be stored in multiple locations.
SPATIAL DATABASES 71
z.onJcring
(a) Assignment of locational keys. (b) Assignment of covering nodes.
Figure 2.12. Ordering based on locational keys.
However, even if this approach is adopted, the size of the representative quad-
rant may still be much larger than the size of the object's bounding rectangle.
A B+-tree is used to index the objects based on their associated locational keys.
For an intersection search, all quadrants that intersect the query region have
to be scanned. The major advantage of the use of the locational key is that
B+-tree structures are widely supported by conventional DBMSs.
2.7 Comparative evaluation
In this section, we briefly summarize some comparative studies that have been
conducted in the literature.
Greene evaluated the performance of R-trees and R+-trees [Greene, 1989].
In the comparison between R-trees and R+-trees, it is found that the R+-tree
requires much more splits, especially for large data objects, but fewer splits for
smaller data objects. For a uniform data distribution of square rectangles that
fully covers the map space, 30% of the objects are duplicated. Interestingly, the
results show that for the case where the coverage is 100% and the objects are
long and narrow along the x-axis dimension, the duplication decreases. This
is likely due to the better grouping achieved along the x-axis. In general, the
query efficiency tests show that R+-trees perform better for smaller objects and
slightly worse off for larger objects. The study in fact exhibits similar pattern
of results to that of the kd-trees extended using the overlapping approach and
the non-overlapping approach [Ooi, 1990].
72 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Ooi, et al. [Ooi et al., 1991] compared the performance of the skd-tree and
the R-tree. The results indicate that the skd-tree is a more efficient structure
than the R-tree with nearly the same storage requirement. The containment
search provided by the skd-tree is more efficient than its intersection search
and is less sensitive to skewed data.
In [Hoel and Samet, 1992], Hoel and Samet conducted a qualitative compar-
ative study on the performance of three spatial indexes, namely the R*-tree, the
R+-tree, and the PMR quadtree [Nelson and Samet, 1987], on large line seg-
ment databases. Spatial testing on line segments was conducted. These queries
include finding all lines incident at a given point, and at the other endpoint
of the line segment of a given point, nearest line segments of a given point,
the MBR of line segments that contains a given point and all line segments
with a given rectangular window. In their implementation, the execution time
of query retrieval is the prime objective, which is sometimes achieved at the
expense of a little more expensive storage space. The difference in performance
is not very great, although the PMR quadtree has a slight edge over the other
two, and the R+-tree is slightly better than the R*-tree because of the disjoint
decomposition of line segments. The R+-tree required considerably more space
than the other two structures. However, the study did not result in claims of
convincing superiority for any of the tested three indexes. This could be due
to the use of line segments, which are much simpler than non-zero sized and
irregularly shaped objects.
In [Ooi, 1990], the efficiency of three extending methods was studied using
a family of kd-trees, namely skd-trees [Ooi et al., 1987], Matsuyama kd-tree
[Matsuyamaet al., 1984], and the 4d-tree [Banerjee and Kim, 1986]. Databases
of 12,000 objects were generated with different distribution of object sizes and
object locations. The average data density used is 3. However, for very skewed
object placements, the data density of certain locations could be very high. The
study shows that the Matsuyama kd-tree which adopts the non-overlapping
native space indexing approach performs efficiently in terms of page accesses
for small objects. As the object sizes become bigger, its performance degrades.
The 4d-tree is the least efficient structure. Its nodes store less information than
those of the skd-tree, which accounts for a smaller directory size. Intersection
search is not supported efficiently because of its inability to prune the search
space effectively.
In [Papadias et al., 1995], the topological relationships of meet, overlap,
inside, covered-by, covers, contains, and disjoint between MBRs were studied.
The efficiency of the R-tree, R+-tree, and R*-tree were then studied using three
databases of 10,000 objects, with different sizes of MBRs, and 100 queries. For
small MBRs (less than 0.02% of the map area) and medium MBRs (less than
0.1% of the map area), R*-trees and R+-trees outperform the R-tree, with the
SPATIAL DATABASES 73
R+-tree slightly more efficient than the R*-tree. However, for large MBRs (less
than 0.5% of the map area), the R+-tree becomes less efficient than the other
two due to additional levels caused by duplications. The R+-tree does not work
for high data density [Greene, 1989, Papadias et aI., 1995].
We also set out to investigate the performance of the R-tree and R*-tree for
high-dimensional data. We implemented both structures using C on the Sun
SPARC workstation running SunOS 5.5. The size of a disk page used for both
trees is 4 KByte. The quadratic cost splitting algorithm [Guttman, 1984] is
adopted for the R-tree, and the quadratic cost version of evaluating the overlap
of a given node is also implemented for the R*-tree. To deal with paging, a
priority based page replacement strategy that adopts a least useful policy is
employed [Chan et aI., 1992]. A page is useful if it will be referenced again in
the traversal; otherwise, it is useless. The strategy favors useless pages that
are at the higher level of the tree, and useful pages that are at the lower level
of the tree. We conducted our experimental study on a real data set consisting
of Fourier points in high-dimensional space (2, 4, 8 and 16 dimensions) of the
contours of industrial parts. The database used is the same one employed in
[Berchtold et aI., 1996], except that we extracted a subset of 1 million objects
only. Figure 2.13 shows some representative results which are largely consistent
with previous works. First, as expected, R*-tree is more space efficient than the
R-tree (see Figure 2.13a). Second, R*-tree's insertion cost is larger than that
of the R-tree, and as the number of dimensions increases, the relative difference
also widens. This is consistent with the result in [Beckmann et aI., 1990].
For point query retrievals, we perform 1000 queries, and used the average
number of disk accesses as the metrics. The 1000 points are randomly selected
from the respective test data of the dimensions. We observe that when the
number of dimensions is small (see Figure 2.13c), both the R*-tree and R-tree
perform equally well (with R*-tree slightly better). This result is again consis-
tent with the findings in [Papadias et al., 1995] for large databases. However,
as the number of dimensions increases, the R*-tree requires more disk accesses
than the R-tree during retrieval. We also evaluated 1000 range queries, and
the result is shown in Figure 2.13d. The result confirmed the observation that
R*-tree outperforms the R-tree only at low dimensions, but is inferior to the
R-tree at higher dimensions. Finally, from the results, we note that both the
R-tree and the R*-tree does not scale well with the number of dimensions.
2.8 Summary
We have reviewed a number of indexes that are suitable for indexing non-zero
sized objects in spatial database systems. These have been categorized based
on their extending methods and the base structures. We have also discussed
74 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
0.8 20
..----- ---------------------
-------
-------- R·tree - -----_..-
c: 16 W·tree ..... .'
.2 0.6
.. ~
~ Cl>
:; 8 12Cl>
'"'" "'"'" 0.4 R·tree - <Il
CI. '6
"'" W·tree ..__. Cl>
.!a
'" 8."
e
-----------
Cl> Cl>
'" >
e '"Cl>
0.2>
'" 4
0 0
0 4 8 12 16 0 4 8 12 16
dimension dimension
(a) Storage cost. (b) Insertion cost.
2000 3500 R·tree -
W-tree .....
R·tree -
W·tree ..... 3000
1600 /
~
<Il 2500
<Il
Cl> Cl>
" "" 1200 "'" " '" 2000
"'" "'"<Il <Il
'6 '6
"Cl>
" Cl>
1500
'" 800 '"e eCl> Cl>
"> >
"'" '" 1000
""
,
400
-- ---- 500
..~....--"
0 0
0 4 8 12 16 0 4 8 12 16
dimension dimension
(c) Point query cost. (d) Range query cost.
Figure 2.13. Comparison of R-tree and R*-tree.
SPATIAL DATABASES 75
the strengths and weaknesses of these techniques. Despite so many work, we
believe the area will remain a very fruitful and challenging one for the next
decade with several promising research directions.
First, there is clearly a lack of benchmarks for evaluating spatial indexes.
This can be attributed to the many factors that need to be considered in eval-
uating a spatial index. Concerning the data, spatial data varies widely in
sizes; spatial objects come in irregular shapes; and objects are not uniformly
distributed in the data space. Furthermore, queries range from simple point
queries to complex spatial join operations that come in different flavors (inter-
section, containment and proximity). Designing a suite of benchmarks is an
important issue that cannot be ignored.
Second, as pointed out, the evaluation of spatial indexes has been rather
limited. Most of the performance study used R-tree as the base for comparisons.
Furthermore, most of the work used synthetic data. We believe that more
extensive and comprehensive performance studies using real data sets will be
necessary and useful for practitioners as well as·developers.
Third, the scalability (in terms of number of dimensions of the data space)
of existing indexes has not been adequately addressed. Most of the work are
restricted to two-dimensional space. Recent work by Berchtold et al. [Berchtold
et aI., 1996] addressed the scalability of indexes with respect to the number of
dimensions, and showed that the R*-tree does not scale well. Instead the R*-
tree degenerates drastically. The same paper also shows that the TV-tree [Lin
et aI., 1995] can perform poorly as the number of dimensions increases. While
the X-tree [Berchtold et aI., 1996] appears to be a promising scalable index, we
believe that designing scalable high-dimensional indexes will be highly exciting
and rewarding.
3 IMAGE DATABASES
Images have always been an essential and effective medium for presenting vi-
sual data. With advances in today's computer technologies, it is not surprising
that in many applications, much of the data is images. In medical applications,
images such as X-rays, magnetic resonance images and computer tomography
images are frequently generated and used to support clinical decision making.
In geographic information systems, maps, satellite images, demographics and
even tourist information are often processed, analyzed and archived. In police
department criminal databases, images like fingerprints and pictures of crimi-
nals are kept to facilitate identification of suspects. Even in offices, information
may arrive in many different forms (memos, documents, and faxes) that can
be digitized electronically and stored as images.
The traditional database management systems, which have been effective
in managing structured data, are unable to provide satisfactory performance
for images that are non-alphanumeric and unstructured. The growing need for
image information systems has led to the design and implementation of image
database systems [Chang and Fu, 1980, Chang and Hsu, 1992, Kunii, 1989,
Knuth and Wegner, 1992, Nagy, 1985, Ogle and Stonebraker, 1995, Tamura
and Yokoya, 1984].
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997
78 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
In this chapter, we focus on content-based retrieval techniques, that is tech-
niques that retrieve images based on their visual properties such as texture,
color and shape of objects. In particular, we look at the critical issue of speed-
ily finding the correct images from a large image database system based on
the image feature. For a large collection of images, sequentially comparing
image features is time-consuming and impractical, if not impossible. Instead,
access methods that exploit the image features to narrow the search space are
necessary.
We begin our discussion by looking at what constitute an image database
system. Following that, in Section 3.2, will shall discuss some of the issues
involved in the design of a content-based index. In the same section, we also
review indexing mechanisms that can be used to support content-based re-
trievals. In Section 3.3, we provide a taxonomy of existing image indexes. The
taxonomy is based on the image features used for indexing. Following that, we
present four indexes that facilitate speedy retrieval of images based on color-
spatial information. In Section 3.4, we examine three hierarchical indexes that
integrate multiple existing indexes into a single structure, and in Section 3.5,
we present a signature-based technique. Finally, we shall conclude with a spec-
ulation on future trends in Section 3.6.
3.1 Image database systems
An image database system must deal with both structured and unstructured
data. Furthermore, an image database system also distinguishes itself by the
following additional functionalities:
• Feature extraction. In order to organize the images and their associated
information, it is necessary for the system to understand the contents of the
images. Thus, the system must be able to analyze an image to extract key
features such as the shape of objects in an image, its color components and
texture.
• Feature-based indexing. Traditional database systems index their data by
key attributes which are usually numeric or fixed-length text data. For
image database systems, the system must build indexes based on the features
extracted. Such feature-based indexes can then be used to facilitate efficient
search of a large collection of images and other related information based on
the features of the images.
• Content-based retrievals. Image database systems should support a wide
range of queries. In particular, queries that involve the contents of the
image, in words/text or pictorial form are important and crucial.
IMAGE DATABASES 79
• A measure of similarity. Since content-based queries are usually inexact, the
system requires a measure to capture what we humans perceive as similarity
between two images. However, as the notion of similarity does not neces-
sarily mean correct, the similarity measure must be carefully designed not
to exclude any relevant images, while at the same time minimize irrelevant
images from the results.
Input Image
t
PREPROCESSING MODULE
Image Feature Update
InputlScanner Extraction IndexlDatabase
tQUERY MODULE
r
----~
---Runtime
~Interactive Processor Feature/lmage
~ Query
Feature Database
Formulation
Extraction
Concurrency
Browsing Feature
Control &
-E-- & Matching
Recovery
Feedback
Manager
Output
Retrieved
Images
User
Query
Figure 3.1. Architecture of an image database system.
Figure 3.1 shows the (generic) architecture of an image database system. Im-
ages are preprocessed to extract the key features used for searching. The images
and the feature indexes are then stored in the database. During retrieval, fea-
tures are extracted from the query image, and matched against those stored to
retrieve images that are similar to it. As a consequence of the need to retrieve
images based on similarity, the user interface will usually incorporate some
browsing and feedback mechanisms to facilitate reformulation of queries to im-
prove accuracy. Like traditional database systems, concurrency control and
recovery managers are also critical components of an image database system.
Supporting a fully functional image database system is a difficult problem
and embraces different technologies such as image processing, user interface
design, and database management. In fact, early systems are largely attribute-
based or free-text-based and hardly have any real content-based support. For
attribute-based systems, images are treated as binary large objects (BLOBs).
SO INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
A conventional DBMS, extended with the capability to handle BLOBs, can
be used to manage the images. Access to the unstructured images is achieved
through the structured attributes of the images. Hence, no special effort is re-
1uired to design the organization technique, indexing mechanisms (such as B+-
tree and inverted files) and query processing methods of the systems. However,
this approach is not capable of handling the more user-friendly content-based
queries.
The free-text-based approach applies the concepts of document retrieval
techniques to provide "content-based" functionalities by manual description
of the image and treating the image description as those of a document. Image
access is done through the accompanying image description. For example, for
the query "Retrieve all images that show a girl skating in an ice ring", the
description "a girl skating in an ice ring" is used to retrieve the images. The
system attempts to search this description with that of images stored in the
database. Indexing methods that can be used include signature file access meth-
ods, inverted file access methods and direct (or sequential) file access methods
Besides being unable to facilitate true content-based queries, other limitations
of the free-text-based approach include a free-text description of an image is
highly variable due to the ambiguities in the natural language associated with
annotating images with text and the different interpretations of the image; im-
age description is usually incomplete since an image is semantically richer than
text description; and the vocabulary of the person creating the index and the
user or even between users may not match. As such, the effectiveness of this
approach is fairly limited. The readers are referred to Chapter 5 for an in-depth
discussion on text indexing techniques.
3.2 Indexing issues and basic mechanisms
3.2.1 Key issues in content-based index design
Designing an access method for an image database system is more complex
than a traditional database system. This is because the features to be indexed
(hereafter, we shall refer to as indexing features) are usually unstructured.
Three key issues that must be addressed in designing an index structure for
content-based image retrieval are:
• Determine a representation for the indexing feature.
• Determine a similarity measure between two images based on their repre-
sentations.
• Determine an appropriate index organization.
IMAGE DATABASES 81
For the first issue, a suitable representation must be determined and used
to represent the indexing feature. Some of the desirable properties of a repre-
sentation include
• Exactness. For a representation to be useful, it has to capture the essential
details of the indexing feature;
• Space efficiency. The representation should keep the storage cost low. To
this end, approximate representations rather than exact representations are
often used. For example, instead of representing the shape of an object,
its bounding box can be used. As another example, grouping colors that
are perceptually similar can reduce the number of colors that need to be
maintained by the system without sacrificing retrieval accuracy.
• Computationally inexpensive similarity matching. It should be easier and
faster to compute the similarity between the representations than between
their features. In general, computing the degree of similarity between ap-
proximate representations is less computationally intensive. For example,
computing the intersection of two polygons is more costly than computing
the intersection of two rectangles that represent them.
• Preservation of the similarity between the features. Two features that are
similar should remain so under their representations.
• Automatic extraction. The representation should be automatically extracted,
rather than manually generated.
• Insensitivity to noise, distortion, rotation. Any noise or distortion should
not affect the representation drastically. In other words, two features of the
same image, one without noise, and the other distorted by some noise, should
be represented in a similar way (if not exactly). Similarly, the representation
of a feature, regardless of whether the image has been rotated or not, should
be the same.
It is hard to find an effective representation with all the desirable properties. In
fact, some of the above properties conflict. For example, representing the color
of an image as a vector (color histogram) which has all the above properties
has been shown to be less effective than one that also captures the spatial
information. However, the latter representation of color incurs more storage,
and is more sensitive to the orientation of the image.
Before moving on, we would like to look at two methods that can be used to
represent image features coarsely. These methods have the advantages of space
efficiency as well as reducing the dimensionality of the indexes (for vector-based
representations). They can be categorized as follows:
82 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
• Partitioning. This method partitions an image space into a fixed size grid.
Each such cell is assigned a label and can be used to approximate the size
of an object or the spatial location of a feature. For example, the set of cells
that contains an object serves as an indication of the size of the object. As
another example, the location of an object can be determined by the position
of the cell it is in.
• Grouping. This method combines several components of a feature into
groups, and represents the image feature in terms of the groups instead
of the large number of components. For example, the basic color feature can
have over 100 different colors, but can be grouped into a small number of
groups based on the fact that many colors are perceived to be similar by
humans. As another example, the shape of an object can be described by a
small number of primitives such as lines and arcs.
A coarse representation can be used as a quick means of pruning away irrelevant
images, and a finer representation is usually necessary in order to restrict the
set of potential candidate images to a manageable size.
The second issue follows from the first. The similarity measure between
the indexing features of two images, say 51, may no longer be appropriate on
the representations. Thus, an appropriate similarity measure on the represen-
tations, say 52, has to be derived. The main criterion for such a similarity
measure is that two features that are similar under 51 should remain so under
52. In fact, since the representations may be approximate, we expect the num-
ber of images that are similar to a query image under 52 to be larger than that
under 51. There are several alternatives to determine the similarity between
two features through their representations:
• Exact match. In this approach, the representation of an image feature is
usually coarse, in the sense that images with similar features will be mapped
to the same representation. As a result, an exact match on the representation
can be used to search for similar features.
• Approximate match. Under this approach, the degree of similarity between
the image representations is computed based on some approximation tech-
niques. One advantage of this category is that the image representation can
be exact. Where approximate representations are used, we can expect more
irrelevant images to be retrieved as well.
Finally, an appropriate index organization should be determined to organize
the representations in a manner that the similarity measure can be supported
efficiently. Other important criteria for selection of an index structure include
storage efficiency and maintenance (update) overhead. To a certain extent, the
IMAGE DATABASES 83
representation and similarity measure determine the index structure. For exam-
ple, if the image feature is represented as a vector, and the similarity measure
is the Euclidean distance, then a natural choice is the multi-dimension point
access method. Here, the vector is mapped to a point in a multi-dimensional
space, and a region search can be used to search for similar images in the multi-
dimensional space. On the other hand, if the image features are represented as
rectangles in the image space, then a spatial access method may be employed.
In fact, as we shall see in Section 3.3, most of the image indexes are based on
existing techniques. As such, we shall review some of these techniques before
proceeding to look at the taxonomy.
3.2.2 Basic indexing scllemes
Spatial access methods. Spatial access methods are file structures used to
organize large collection of multi-dimensional points or geometric objects to
facilitate efficient range or nearest neighbor searches. It turns out that we can
easily exploit such techniques to speed up retrieval of images. The basic idea is
to extract k image features from each image, thus mapping images into points
in a k-dimensional feature space. Once this is done, any spatial access methods
can be used as the index, and similarity queries will then correspond to nearest
neighbor or range searches. As an example, let us consider the color feature.
In general, the color feature can be represented as a k-tuple for a system that
supports k colors, and the values of the tuple of an image are the percentages
of the colors in the image.
Many spatial access methods have been proposed in the literature. These
include methods that transform geometric objects into points in a higher di-
mensionality space such as the grid file [Hinrichs and Nievergelt, 1983]; meth-
ods that linearize spatial data such as quad-trees [Gargantini, 1982] and "z-
ordering" [Orenstein, 1986]; and methods that are based on trees such as the
family of R-trees [Guttman, 1984]. However, most of these methods suffer from
the so-called "high-dimensionality curse", that is these techniques perform no
better than sequential scanning as the number of dimensions becomes suffi-
ciently large [Faloutsos et aI., 1994]. For example, for R-trees, performance
begins to degrade drastically as the dimensionality hits 20 and above. We refer
the readers to Chapter 2 for a survey on spatial access methods.
Inverted file. In an inverted file index, an inverted list is created for each
distinct key (indexed feature). The inverted list essentially consists of a list
of pointers to the objects that contain features that are similar to the indexed
feature. Given an image feature, the inverted file is scanned, and all images
with the features that are similar to it can thus be retrieved speedily. However,
84 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
inverted file method incurs high storage overhead and is also expensive to up-
date. Some recent work has been done to address the storage problem [Witten
et al., 1994, Moffat and Zobel, 1996].
Signature file. The signature file access method is an efficient access method
for objects that can be characterized by a set of descriptors, making it suitable
for indexing unstructured data such as textual documents (characterized by
a set of keywords) and images (characterized by a set of semantic objects or
colors). Each descriptor of an image can be represented as a string of bits, and
an image signature can be obtained by superimposing (inclusive-OR) all the
descriptors of the image. The signatures of all images can then be maintained
in a file called the signature file. During query retrieval, the descriptors of the
query image can be coded into a signature, and the signature file is then used
as a filtering mechanism to eliminate most of the unqualifying data so that
only a portion of the data file needs to be accessed. The retrieval performance,
however, can be hampered by a high false drop probability (due to irrelevant
images' signatures matching the query image). Variations of signature file
access methods have been proposed to improve on the retrieval efficiency of the
signature file. These include single-level signature file [Roberts, 1979], multi-
level signature file [Sacks-Davis et al., 1987], and partitioning approach [Lee
and Leng, 1989].
3.3 A taxonomy on image indexes
Existing image indexing mechanisms can be classified based on the image fea-
tures used for indexing. For each image feature, further classifications can be
made with respect to the semantic representations used for the feature. A dif..
ferent type of semantic representation entails a different indexing method. In
this section, we provide a taxonomy of image indexing schemes based on such
classifications. These schemes have been reported in the literature. For some
features, other schemes which may also be used but not reported are excluded
from our discussion. The taxonomy is summarized in Figure 3.2.
3.3.1 Shape feature
The shape feature is extremely useful for image database systems like an X-
ray system or a criminal picture identification system. In an X-ray system,
queries like "Retrieve all kidney X-rays with a kidney stone of this shape" are
very common. For a criminal picture system, we expect queries like "Retrieve
all criminals with a round face shape". The example shape, the shape of a
kidney stone in the first case, and round in the second, can be supplied using
an example image.
IMAGE DATABASES 85
Content-based indexes
Multi-dimensional
index
Rectangles
I
Sequenced
Multi-attribute
treee
Signature
I
Signature
file
Color-Spatial
Multi-level
histogram
~.
2-level . -tier
color
B+-tree index
-------------Color
histogram
I
Geometric
properties
IInverted
file
Multi-dimensional
index
Color
Objects
. ~aritYagainst
SIgnature representative
I obts
Signature Inverted
file file
Texture
ITilmura features
I
Multi-dimensional
index
Multi-dimensional
index
Shape
Rectangular
cover
I
Spatial
relationship
~Signature 2-D string
MUlti~level Isignature file Sequential file
Multi-dimensional
index
Figure 3.2. A taxonomy of image indexing schemes.
86 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Shape features can be represented using its boundary information by any of
16 primitive shape features. Each primitive feature is either a line, an arc with
a starting point, an ending point, and so on. Moreover, the primitive feature
can be denoted by a distinct character. Thus, the boundary information can
be compactly stored as a one-dimensional string [Jea and Lee, 1990]. The
shape features of a shape boundary can then be represented by substrings of
the one-dimensional string. This simple representation allows the exploitation
of existing efficient string matching algorithms. Since objects with the same
shape will be encoded in the same manner, exact string matching is performed
instead. To index the string representation, an inverted file is used.
A closely related work by Mehrota and Gary [Mehrotra and Gary, 1993]
used a set of structural components to represent shape boundary. These com-
ponents are modeled as an ordered set of interest points such as locally maximal
curvature points or vertices of the polygonal approximation. A shape feature
can be obtained by fixing the number of points to be used to represent the
shape feature. The feature is then mapped into a point in a multi-dimensional
space, where the dimension is given by the number of points used to repre-
sent the shape. The similarity measure can then be given by the Euclidean
distance between pair of points in the multi-dimensional space. The point
multi-dimensional access method is used for indexing the shape feature.
In [Jagadish, 1991]' a collection of rectangles that forms a rectangular cover
of the shape is used. Since shapes vary widely from objects to objects, the
number of rectangles can be very large. To reduce the storage requirement, at
most k rectangles in the cover is used to represent the shape. The k rectangles
picked must capture the most important features of the shape "sequentially",
that is the k rectangles form a sequence. As each rectangle is represented by
two pairs of coordinates, and there are at most k rectangles, the shape feature
can be easily mapped into a point in a 4k-dimensional space. Thus, a multi-
dimensional point access method can be readily used for indexing the shape
feature. Similarity retrieval based on Euclidean distance is performed using a
region search query.
Shape can also be represented based on the concept of mathematical mor-
phology [Korn et al., 1996, Maragos and Schafer, 1986, Zhou and Venetsanopou-
los, 1988], which employs a primitive shape to interact with an image to extract
useful information about its geometrical and topological structure. A (2M+1)
vector, called the size distribution of a shape [Serra, 1988], can be used to store
the measurements of the area of an image at different (2M+1 of them) scales.
The pattern spectrum [Maragos, 1989] turns out to be a compact representation
that captures the same information. The advantage of the scheme is that it is
essentially invariant to rotation and translation, and can highlight differences at
several scales. In [Korn et al., 1996], the pattern spectrum is first employed to
IMAGE DATABASES 87
capture the shape information of an image (in the domain of a tumor database).
The information is then mapped into the (2M+l) vector of the size distribution
so that a multi-dimensional point index can be employed to index the shape
information. While similarity retrieval is essentially a nearest neighbor search,
the paper also presented a distance function, max-granulometric distance, that
guarantees no false dismissals.
Numerical vectors have also been employed to model shape. These include
using the coefficients of the 2-D Discrete Fourier Transform or Discrete Wavelet
Transform [Mallat, 1989], as well as first few moments of inertia [Faloutsos et 301.,
1994, Flickner et 301., 1995]. These techniques usually maps the shape feature
to multi-dimensional point access method and use the Euclidean distance for
similarity retrieval. Alternatively, the shape features can be represented by the
geometric properties of the image such as shape factors (for example, ratio of
height to width), mesh features, the moment features and curved line features.
In this case, the inverted file has been used for indexing.
For a system that is based on the shape feature, unless the images have very
distinct shape, the performance may suffer. As such, shape is usually employed
in specialized domains.
3.3.2 Semantic objects
If objects within an image are prominent and can be easily recognized, retrieval
can be achieved based on the objects. Queries can be evaluated by matching
the list of objects of a query image against the list of objects of images in the
database. Two methods have been adopted in the literature:
• An object in an image may be analyzed to determine its degree of similarity
against a set of distinct objects. This degree of similarity is represented as
a belief interval (bi) [Rabitti and Stanchev, 1989] that indicates how closely
an image object is compared to the represented object used in the system.
An inverted file is used to maintain for each distinct object a list of (bi,
ptr) pairs where ptr is a pointer to an image that contains an object that
resembles the indexed object with a belief interval of bi. In this way, given a
query image object, one first determines the corresponding distinct object it
belongs to, from which one can obtain all objects that are similar to it. By
sorting the list in non-ascending order, the system can have a control over
the degree of similarity desired.
• An object may also be represented by an object signature. An image signa-
ture is obtained by superimposing all the object signatures of the objects in
the image [Rabitti and Savino, 1991]. The signature file access method can
then be used to speed up the retrieval process. A query image's set of sig-
88 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
natures can be obtained, and its image signature is first used to prune away
images that are irrelevant. Candidate images are then further examined by
comparing their object signatures against those of the query image.
The object-based approach is, however, limited by current image analysis
techniques. Unless objects are very well defined, it still requires substantial
human intervention in order to ensure that the objects are correctly extracted.
3.3.3 Spatial relationship
In an object-based system, a query image with a ball above a box may also
result in images with a ball next to a box or a box above a ball being retrieved.
A more discriminating way to retrieve images is to facilitate a more precise
querying that specifies both the semantic objects in the images as well as the
spatial relationships between the objects. An an example, consider the query
"Retrieve all paintings with a house and a tree on its left". Here, the house
and tree are the objects while "to the left" is a spatial relationship between the
two. In [Chang et al., 1987, Chang et al., 1988], a semantic representation for
spatial relationship using a two-dimensional string (2-D string) was proposed.
An image is first preprocessed to obtain the symbols that represents the objects
it contains. The 2-D string representation is then a projection of the symbols
along the x-axis and the y-axis, and consists of a pair of one-dimensional strings
(I-D string) each representing the ordering and spatial relationships of the
objects along the projected axis. For example, consider an image with three
objects such that 0 1 is to the left of O2 which is to the left of 0 3 . The projection
on the x-axis results in the I-D string 0 1 < O2 < 0 3 , where "<" is a spatial
operator that denotes "to the west or to the south of". In [Chang et al., 1987],
only three spatial operators are used: "=" to mean "at the same spatial location
as", ,,:,. to represent "in the same grid cell as", and "<" as explained. During
query processing, the 2-D string representation of the query image is obtained,
and compared against those in the database. Similarity retrieval is supported
using an exact representation and an approximate matching algorithm.
Variations and extensions to the 2-D strings have been explored [Chang
et al., 1989, Lee and Hsu, 1990, Costagliola et al., 1992, Lee et al., 1992].
In particular, a multi-level signature file access method has been adopted as
follows. An image can be partitioned into a M x N grid. For each object, a
M x N bit object signature can be obtained by setting bit (i - 1) . M +j to 1 if
the object occur in cell (i,j); otherwise the bit is cleared. An image signature
can then be obtained by superimposing the object signatures. Querying is
performed by determining the object and image signatures of the query image,
and using them to filter the images to be retrieved.
IMAGE DATABASES 89
The effectiveness of exploiting spatial relationships, as already mentioned,
can be drastically affected by the orientation of the images since the relation-
ships between objects may no longer be preserved.
3.3.4 Texture
Texture is an important property that can be used as cues for image retrieval.
In particular, because it can be extracted from both gray-level images as well
as color images, it can be used in many applications. However, the extraction
of texture information is a computationally intensive operation.
One of the most popular texture representations is the Tamura features
[Tamura et al., 1978]. While texture can be captured by six basic compu-
tational forms, coarseness, contrast, directionality, linelikeness, regularity and
roughness, it has been shown that the first three can sufficiently be used to
discriminate between texture differences in images. As such, these three forms
(coarseness, contrast and directionality) have been widely used in texture recog-
nition. These three components are briefly summarized here:
• Coarseness. The coarseness component measures the scale of the texture
(for example, pebbles versus boulders). When two patterns differ only in
scale, then the magnified one is considered to be coarser. For patterns with
different structures, those that have larger element size or fewer element
repetitions are perceived to be coarser by the human eyes. Coarseness can
be computed using moving windows of different sizes. The essence of the
method adopted in [Tamura et al., 1978] is to pick the coarsest texture as
the best size. For every region in an image, its coarseness is represented by
the largest best size texture, Sbest. The coarseness of the image can then be
obtained by taking the average of Sbest over the image.
• Contrast. The contrast component can be thought of as representing the
quality of the image. A good quality image is one that is sharp in contrast,
while a low quality image is blurred. The human eyes can easily discriminate
between a sharp image and a blurred one. As an image contrast can be varied
by stretching or shrinking its gray scale, the intensity of each pixel of an
image can be multiplied by a positive constant to derive at different contrast
value. The contrast can then be obtained as a function of the variance of
the gray-level histogram [Tamura et aI., 1978].
• Directionality. Directionality describes whether an image has a favored di-
rection (like grass) or whether it is isotropic (like a smooth object such as
glass). The human eyes can easily differentiate a directional pattern from
one that is non-directional. In [Tamura et aI., 1978], the degree of direc-
tionality is calculated using a histogram of local edge probabilities against
90 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
their directional angle. Although this measure does not categorize images as
directional or non-directional, this histogram representation can sufficiently
capture the global features of the images such as long lines and simple curves.
Clearly, texture can be modeled as a 3-tuple (coarseness, contrast, direction-
ality). Moreover, since images that are alike if their coarseness, contrast and
directionality are similar, the Euclidean distance can be used as a measure of
the degree of similarity between images. To speed up the retrieval process, the
texture feature can be represented as a point in a 3-dimensional space, with
region search being used to prune the search space.
There are other representation of texture, such as the Simultaneous Au-
toregressive (SAR) model, and the Wold features [Francos et al., 1993]. Both
methods also represent texture as vector of numbers, and compare images based
on the Euclidean distance. As such, a multi-dimensional indexing mechanism
can be used to index the texture features also.
3.3.5 Color
A natural way to retrieve colorful images would be to retrieve them by color.
The color composition of an image is a global property which does not require
knowledge of the component objects of an image. Moreover, color distribution
is independent of view and resolution, and color recognition can be carried out
automatically without human intervention.
A semantic representation for color is the use of color histogram that cap-
tures the color composition of images [Swain, 1993]. Using the RGB color
space, the histogram comprises a set of "bins" each representing a color that
is obtained by a range of red, blue and green values. The number of pixels of
an image falling into each of these bins can be obtained by counting the pixels
with the corresponding color. The histogram is then normalized by dividing its
entries by the total number of pixels of the image. The normalized histogram is
size-independent and it enables images of different sizes to be compared mean-
ingfully. The degree of similarity between two images is determined by the
extent of the intersection between the histograms. Query by visual example
is possible by matching the histograms. Object recognition is also achieved
by using the color composition of the object. However, to support indexing
using color histograms, a multi-dimensional indexing method is necessary and
the number of dimensions required is of very high order (which is the num-
ber of distinct colors to be supported). The color histogram of an image is
mapped into a point in the multi-dimensional space, and a region query can be
performed to find matching images.
However, it has become clear that color alone is not sufficient to characterize
an image. For example, consider two images - one with the top half blue and
IMAGE DATABASES 91
bottom half red, while the other's top left and bottom right quadrants are red
and its bottom left and top right quadrants are blue. Although these two images
are similar in color composition, they are entirely different to a human observer.
This is because the ways the colors are clustered and the positions ofthe clusters
are very different from one another in the two images. As such, several recent
studies have proposed to integrate color and its spatial distribution to facilitate
image retrieval [Chua et al., 1997, Gong et al., 1995, Hsu et al., 1995, Lu et al.,
1994, Ooi et al., 1997]. Most of the indexing mechanisms proposed for color-
spatial information are generally multi-layered - two-level B+-tree [Gong et aI.,
1995], three-tier color index [Lu et aI., 1994] and Sequenced Multi-Attribute
Tree (SMAT) [Ooi et al., 1997]. An exception to this trend is based on signature
files approach [Chua et al., 1997].
3.4 Color-spatial hierarchical indexes
In this section, we describe three indexes that have been proposed to integrate
color and spatial information for image retrieval. All these schemes are hierar-
chical indexes in that multiple indexing mechanisms are integrated to form a
single index structure. The search process begins from the top level index, and
moves down to the lowest level index, traversing along the path that satisfies
the search criterion.
3.4.1 Two-level B+ -tree structure
In [Gong et al., 1995], the color-spatial information of an image is modeled by
splitting the image into 9 equal sub-areas (3 x 3), and the color information
within each sub-area is represented by a color histogram. In this way, by
matching the corresponding color histograms of two images, one can obtain
a more accurate similarity (in terms of color-spatial information) between the
two images than the traditional histogram-based approach. Although color
histogram is a multi-dimensional representation, Gong et ai. cleverly mapped
it into a numerical key. This not only turns the computationally intensive
matching process into simple numerical-key comparisons, it also fa6litates the
exploitation of existing single-dimensional indexing structures such as the B+-
tree structure. As a result, a two-level B+-tree structure was proposed to speed
up the retrieval process. We shall first look at the retrieval technique, followed
by the transformation of color-histogram to a numerical key before proceeding
to examine the index structure.
The retrieval technique. Given an image, it is first processed to extract its
9 color histograms. Each histogram is then mapped into two levels of informa-
tion. The first level describes the composition of colors corresponding to the
92 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
histogram of the region. However, instead of using the full set of colors (which
is very large), the colors are grouped into 11 "bins" only. The grouping of
colors is based on the observation that some colors are perceived to be similar
by humans. This is accomplished in two steps:
• The RGB color space is transformed into Munsell's HVC color space [Miya-
hara and Yoshida, 1989]. This is necessary because it is not possible to
determine the similarity between two colors based on the RGB color space.
Instead, the HVC color space describes colors in terms of hue (the color type),
value (brightness) and chroma (saturation), and the perceptual differences
can be determined by the geometric distances.
• The HVC color space is grouped coarsely into 11 bins, each of which can be
distinguished from the others as a distinct color by subjective perception.
The grouping is based on the argument that two images with the same visual
content but taken with minor differences in illuminating conditions should
not be considered as different images.
Furthermore, instead of the traditional approach ofusing the normalized pixel
count to represent the proportion of the groups, each group is assigned a range
which bounds the percentage of pixels in the image with colors of the group.
A total of 9 disjoint ranges are predetermined and used: [0,5), [5,15), [15,25),
..., [65,75), [75,100]. Because of the groupings, two histograms are considered
to be similar if all the corresponding ranges of the 11 bins are the same. This
simplifies the histogram matching process, but the coarse grouping increases the
probability of retrieving irrelevant images, and missing relevant images whose
color composition fall into neighboring ranges.
The second level of information contains the average H, average V, and
average C values of all the 11 histogram bins. As in the color composition, the
H, V and C values are grouped into 9,4 and 4 groups respectively, with interval
of 40°, 2.5 and 7.5. This level is used as a secondary similarity measure to
complement the histogram metrics in order to reduce the number of irrelevant
images retrieved.
During query retrieval, the query image is processed to extract its 9 his-
togram. For each histogram, the two levels of information are obtained from
the sample query. The level 1 information is used to prune away dissimilar
images, and candidate images are further examined and compared on their H,
V and C group values.
The index: Two-level B+-tree structure. The above retrieval mechanism
has the nice property that only exact matches need to be performed: two
histograms are similar if they have the same range values for the 11 histogram
IMAGE DATABASES 93
pins, and for each pair of bins, the groups for the H, V and C values are
the same. As such, the authors proposed that the first level information be
mapped into a composite key with 12 attributes: the first attribute indicates
the histogram region (1 of the 9 region), and each of the other 11 attributes
corresponds to one histogram bin and has a value that indicates its range (note
that instead of keeping the range, since the set of ranges is predetermined, fixed
and disjoint, a range is represented by a number). Similarly, the second level
information is mapped into a 34-attribute composite key: the first attribute
represents the histogram region, and the other 33 attributes are split into 11
groups of 3 attributes, each group for a histogram bin, with one attribute for
the group number of the H value, one for the group number of the V value, and
one for the C value.
Level I:
B+-lree on Normalized Pixel Count
Level 2:
B+-tree on Average H,Y and C values
~.
!IJIIEDJ
Figure 3.3. The two-level B+-tree structure.
94 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
A two-level B+-tree can then be exploited to speed up the retrieval process,
Figure 3.3 shows the structure. The top level index is a B+-tree built on the
12-attribute key, and is used to facilitate the histogram matching process. The
entries in the leaf node of this level is associated with an independent B+-
tree that is built on the 34-attribute key. This second level tree is devised to
facilitate the comparison of the average H, V and C values. Internal nodes
stores the maximum values of the child nodes in order to direct the search.
Since images with the same histogram configuration will have the same first
part of the key, they can be found in the same leaf node of the top level tree,
and hence in the same second level tree associated to that leaf node. Thus the
images in the second level tree will be fetched only if matching at both levels
are successful.
3.4.2 Three-tier color index
To handle speedy image retrieval based 011 the positional information of color,
Lu, Ooi and Tan proposed a three-tier color index [Lu et al., 1994]. While layers
1 and 2 prune away irrelevant images based on colors, layer 3 matches images
based on their color positions as well. We shall first look at the individual layer
1 and layer 3 and their motivations before presenting the index structure as a
whole. The second layer is the R-tree structure.
Layer 1: Dominant color classification. The first layer is the dominant
color classification. For each image, a fixed number of dominant colors is ex-
tracted. The dominant colors are those with the largest numbers of pixel count.
Based on the dominant colors, the image can be assigned to a partition. In this
way, images with the same dominant colors can be found in the same partition.
The underlying assumption is that images with the same dominant colors tend
to be more similar than images that match on the less dominant colors. Thus,
during the image retrieval process, only a few partitions with the similar sets of
dominant colors need to be examined, while the other partitions with different
dominant colors can be ignored.
Let k denote the number of dominant colors. Then the number of classes is
given by:
n!
number of classes =nCk =..,.----,-.,...--,(n - k)!k!
where n is the number of colors supported in the system. Figure 3.4 illustrates
this layer when k = 3.
Layer 3: Multi-level color histogram. The third layer is a complete quad-
tree structure, called the multi-level color histogram, used to capture spatial
IMAGE DATABASES 95
distribution of colors. The basic idea is to capture the set of histograms for
an image by recursively decomposing the image. For an image, its multi-level
color histogram comprises several levels. The top level (root) of the tree corre-
sponds to a histogram that gives the color composition of the entire image. The
second level consists of four histograms that represent the color composition of
the top left, top right, bottom left and bottom right quadrants of the image
respectively. At the next level, we have the set of histograms that are obtained
from further splitting each quadrant of the image into four equal parts, where
each histogram is a description of the color content of each smaller part. This
process is repeated for the number of levels desired. In general, at the ith level,
the image is subdivided into 4i
-
1
regular regions, and each region has its own
histogram to describe its color composition. For example, in Figure 3.4, the
third layer is a 3-level color histogram.
With multi-level color histograms, since every level captures the color com-
position of the entire image, any level can be used to compute the similarity
between two images. For a level, the degree of similarity is given by the sum of
the intersections of the corresponding pairs of histograms at the level. In other
words, at the ith level, the similarity value is computed as follows:
4i - 1 m
Si = 4i~1 'L'Lmin(NH7(Q),NH7(D))
j=1 k=1
where m is the number of colors supported by the system, Q and D are the
query and database images, and NH7 (fMC) is the normalized pixel count of
the kth color in the jth histogram of the image fMC.
As the lower level of the tree reflects more closely the color composition and
distribution of the image, it is clear that the similarity value decreases as the
tree is traversed downwards. This observation leads to a filtering mechanism
during image retrieval. During query processing, the query image and the
database images are compared based on their color histograms. The top-level
histograms are first compared. If they match within some threshold value, the
next level will be searched and compared, and so on. Only when the threshold
value at the leaf level is met then will the image be retrieved. The target image
will be "discarded" if the similarity value fails to meet the threshold at any
level of the tree. As it costs less to compute the similarity value at the higher
levels of the tree, a significant amount of processing time may be saved and
unnecessary accesses to irrelevant images can be minimized.
The index: Three-tier color index. Figure 3.4 shows the three-tier color
index which employs three level of pruning to speed up retrieval. The first
layer is the dominant color classification. It allows us to prune away images
96 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
belonging to classes that would never satisfy the query to narrow the search
space to some classes.
Layer 2 is a multi-dimensionaIR-tree structure to further prune away im-
ages within the candidate partitions that are not relevant. This is achieved as
follows. For each partition, an R-tree is used to organize the images within the
class based on the proportion of the dominant colors in the images. Since the
dominant colors is sufficient to discriminate between images, the dimensional-
ity required is relatively small. Thus, images that are similar will be spatially
close to one another, and a region query will be able to restrict the search to
the relevant images within the partition.
Finally, the last layer, which is the multi-level color histogram, compares the
histograms of the query image with those of the remaining potential candidate
images. Images that fail the test need not be retrieved. Thus, we can see that
the three-tier color index can minimize accesses to the image collections to only
images that are most likely to satisfy the query.
3.4.3 SMAT: A height-balanced color-spatial index
In the two color-spatial approaches presented above, the spatial distribution of
colors are coarsely captured by the various histograms. There is no indication
on how the color is distributed in the image space within each space represented
by a histogram.
Another problem with the two approaches is that though the individual tree
structures (B+-tree, R-tree, Dominant Color Classification) employed in the
respective layers are height-balanced, the entire hierarchical index structure
may not be so. For example, in the two-level B+-tree structure, if the database
images are skewed such that many images have the similar color composition,
then a small number of the B+-trees at the second layer will be much larger
(and taller) than the rest. Retrieving these images will result in longer access
times. The same scenario holds for the three-tier color index. To resolve this
problem calls for a new notion of height-balancing, and new height-balanced
index structures to be developed.
In this section, we look at a height-balanced color-spatial index developed by
Ooi, et al. [Ooi et al., 1997]. We shall describe the representation of the color-
spatial information, the algorithm to extract them and the retrieval technique
before looking at the proposed hierarchical index structure.
Representing the color-spatial information. It has been observed that
humans are prone to focus on large patches of colors, rather than on small
patches that are scattered around [Beck, 1967, Treisman and Paterson, 1980].
The resultant effect is that given two images, they will appear to be similar
IMAGE DATABASES 97
Tier I: Dominant Color Classification
k =I
k=2 (0,2) -(O,n) (1,2) (2,n) .
/
Tier 3: Multi-Level Color Histogram
D
EE
1m
Figure 3.4. The three-tier color index.
98 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
if both of them have large patches (refer to as clusters) of similar colors at
roughly the same locations in the images. For example, Figure 3.5 shows three
images and the corresponding eight largest clusters, sorted in descending order.
These clusters have been extracted using the proposed color-spatial technique
to be discussed shortly. From the cluster representation of image A (Figure
3.5(b)), it can be seen that several clusters contain color 4 (pink). The cluster
representation in image B (Figure 3.5(d)) also shows that there are dominant
clusters containing color 4 (pink) that fall in the same region and intersect
those clusters in image A. Hence, the two images are "similar" in terms of
color and spatial information. Similarly, based on the cluster representation in
Figure 3.5(f), it is clear that image C is different from the other two images
since there are no common color and location between them. Based on the
observation, the work [Ooi et aI., 1997] represented the color-spatial information
of an image as a set of single-colored clusters in the image space, and these
clusters are used to facilitate image retrieval.
Extracting the color-spatial information. To extract the color and spa-
tial information, a heuristic similar to the one adopted in [Hsu et aI., 1995] was
employed. The heuristic, which comprises three phases, represents the color-
spatial information as a set of k single-colored regions, for some predetermined
value k which is expected to be small.
In the first phase, a set of k representative colors of an image is selected.
The colors selected are those with the largest number of pixel counts in the
image. This set of colors is called the dominant colors. In the second phase,
a set of clusters for each of the dominant colors are determined. The algo-
rithm adopted is based on the maximum entropy discretization method [Chiu
and Kolodziejczak, 1986]. Briefly, for each selected color in the first phase,
the maximum entropy discretization algorithm is applied to the image space
to extract the spatial information of the color. Initially, the entire image is
regarded as one whole region. In the first pass, the image is partitioned into
four regions, and the process is repeated on the four regions recursively. For
each region, an evaluation criterion is used to determine whether further par-
titioning is needed. The results of the application of the algorithm is a set of
representative regions for each selected color. Each region is represented as a
rectangle within the image space.
At the end of phase two, a large set of single-colored clusters have been
derived. In phase three, these clusters are ranked (regardless of color) in de-
scending order of their sizes (area of the rectangles). The k largest clusters will
be picked as the dominant clusters to be used as the color-spatial information
of the image.
(a) Image A
(c) Image B
(e) Image C
IMAGE DATABASES 99
Dunninant Colur Xmin Ymin Xmax Ymax Area
Cluster
I 17 116 0 147 114 3,534
2 17 147 0 173 114 2,%4
3 4 20 0 30 114 1,140
4 4 30 0 40 114 1,140
5 17 61 R 116 15 3R5
6 17 61 0 116 7 3R5
7 4 0 0 19 17 323
R 4 0 IR 19 35 323
(b) A's 8 largest clusters
Dunninant Culor Xmin Vlllin Xll1ax Yma.lt Area
Clusler
1 4 147 0 173 114 2,%4
2 4 0 0 23 114 2,622
3 40 R6 20 105 104 1,596
4 37 60 25 76 114 1,424
5 4 72 21 R6 114 1,302
6 4 7R 15 147 31 1.104
7 4 24 62 3R 114 72R
R 4 24 0 7R 13 702
(d) B's 8 largest clusters
Durminanl Culur Xmin VllIin Xmax YlHax Area
Cluster
I 3 150 0 lfifi 114 I,R24
2 3 0 0 12 114 1,36R
3 3 166 0 173 114 79R
4 39 35 3 54 26 437
5 39 RO 3 105 19 4lKl
6 42 34 47 56 65 396
7 39 lOR 45 157 53 392
R 39 30 27 54 43 3R4
(f) C's 8 largest clusters
Figure 3.5. Three images and their 8 largest ·c1usters.
100 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
The similarity function used for image retrieval computes the degree of over-
lap between the rectangles of the source and target images. Two rectangles
overlap only if they have the same color, and they intersect in the image space;
the degree of overlap is given by the number of pixels intersected.
The retrieval process using the color-spatial information is as follows. The
image database is initially preprocessed to determine the clusters (color-spatial
information) of the images. Given a sample query image, its k clusters are first
extracted. The color-spatial information of each image in the database is then
compared with those of the query image using the similarity function described
above. The images can then be ranked based on the percentage of overlap,
retrieved and displayed in that order.
The index: Sequenced multi-attribute tree. Even though the approach
restricts the number of clusters per image to k, the number of cluster com-
parisons to be performed is still very large, about O(N . k2 ) where N is the
number of images in the database. Since only a small number of images is
likely to match the sample image, a large number of unnecessary comparisons
are being performed. To minimize the expensive comparisons, an index struc-
ture, the Sequenced Multi-Attribute Tree (SMAT), is proposed. SMAT is based
on three observations on the similarity function of the color-spatial approach:
• Color must be matched before the spatial property as color is deemed a more
important feature.
• If two clusters of two images share the same spatial property but different
color content, then the two clusters will not contribute to the similarity
function.
• If two clusters of two images share the same color but with non-overlapping
spatial properties, then the two clusters will also not contribute to the sim-
ilarity function.
SMAT is a multi-tier tree structure, where each layer corresponds to an
indexing attribute. For example, the top layer can be based on color, the
second is based on color percentage or size of the cluster, and the last is based on
spatial property. Each layer can be constructed using any indexing mechanism.
For example, the top layer can be implemented using a single dimensional
indexing structure such as the B+-tree [Comer, 1979]. On the other hand, the
lowest layer can employ a multi-dimensional indexing structure like the R-tree
[Guttman, 1984]. Except for the lowest level, entries in the leaf nodes of all
levels point to the roots of the trees in the next level. Only the leaf nodes of the
lowest level tree contain pointers to the image data. Thus, SMAT essentially
consists of multiple trees integrated together in a hierarchical manner. To
IMAGE DATABASES 101
reach the lowest layer of the SMAT where the actual images are pointed to,
the query must satisfy the conditions relating to the discriminating keys in all
the higher layers. Any condition violated in any layer will terminate the search
path prematurely.
In [Ooi et al., 1997], a variation of the R-tree structure [Guttman, 1984] was
employed to implement a 2-tier SMAT structure. Figure 3.6 shows the struc-
tural view of the SMAT structure implemented. The first layer discriminates
clusters based on color. Since color is a single-dimensional attribute, the R-tree
used at this layer is a single-dimensional R-tree (I-D R-tree). Each entry has
a color range that defines the data space of the subtree pointed by its child
pointer. The color ranges of internal nodes do not overlap, unless they are
exactly the same range. This occurs only when the data is very skewed. En-
tries of the leaf nodes of the first layer R-tree are of the form (color-range, BR,
PTR), where BR defines the spatial bounding rectangle which contains all the
clusters' color rectangles within the image space, and PTR points to an R-tree
of the next layer. Spatial information is required at the leaf node for balanc-
ing purposes. Suppose, for a given color range, the next layer R-tree pointed
by PTR outgrows others and the next split involves its root node (PTR). By
splitting such a node, the height of SMAT will increase. To enable some form
of balancing, the node is split according to the splitting strategy adopted at
the second layer, but the entry is inserted into the leaf node of the first layer
instead. In other words, two entries with the same color range (at the first
layer) are created, but with different bounding rectangles.
The second layer is based on the spatial information of the clusters. Each
entry of the internal node contains a rectangle that defines its child node's data
space and a pointer pointing to the subtree. The second layer R-tree is like a
2-dimensional spatial R-tree structure. For the leaf nodes, entries are of the
form (color, coordinates, PTR). The color attribute contains the color of the
cluster, the coordinates attribute contains the four coordinates of the cluster,
and PTR is a pointer to the address in the database that contains the image
data (see Figure 3.6). The image data contains the ID of the image, and the
colors and coordinates of the k dominant clusters. This information are used
in computing the similarity function (we shall see how this is used when we
discussed the matching algorithm).
Matching and searching a SMAT. The matching algorithm retrieves im-
ages that are similar to a sample image. Given a sample image, the algorithm
extracts k dominant clusters. For each of the clusters extracted, it determines
the set of images that are similar to it. This is done by traversing SMAT to
determine the clusters that matches the clusters of the sample image. It suffices
to know that the search algorithm returns a list of pointers to a file that con-
102 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
node I
{[eI,elJ, c::J}
IMAGE-ID
node 3
cI, coord I c2, coord2 1-oE;:----..J
c3, coord3 c4, coord4
Level I: 1-0 R-lree
color discriminator
node 2
Level 2: 2-D R-tree
spatial discriminator
Figure 3.6. The SMAT structure.
IMAGE DATABASES 103
tains information on potential matching images. Recall that these information
includes the image id and the (color, cluster) pairs of the image. From these
information, the algorithm proceeds to compute the similarity value of the sam-
ple image and the candidate image, and rank the candidate image accordingly.
Since it is possible that other clusters of the sample image may also match
the same candidate image at a later iteration, the image ids are maintained
in a hash table to avoid subsequent comparisons and retrieval. Finally, all the
images can be retrieved based on the image ids.
The search algorithm of a SMAT structure is fairly straightforward, and
follows from the wayan R-tree is searched. The algorithm descends the I-D
R-tree from the root, and at each internal node, entries are checked. For each
color range that contains the search color, the subtree is searched. When a
leaf node is reached, the color of the search cluster is used to check for any
entries whose color range contains the color. For all color ranges that qualify,
their spatial bounding rectangles are checked to see if they intersect the search
cluster. For qualified entries, the search continues to the corresponding 2-D
R-trees at the next layer. While the traversal of the I-D R-tree often leads to
a distinct path (unless there are duplicates), more than one subtree under the
2-D R-tree may need to be searched. Nevertheless, the search algorithm can
eliminate irrelevant clusters of the indexed images and examine only clusters
near the search area.
Inserting color clusters into SMAT. Inserting image clusters into a SMAT
raises some interesting issues concerning the growth of the tree. The first issue
concerns the initial loading of SMAT. In this case, the tree is not "mature" in
the sense that not all layers may have been constructed. The question of when
SMAT grows from one layer to the next arises. The second issue deals with the
height-balancing of SMAT. While the R-tree is height-balanced, SMAT may
not be fully height-balanced as images may be inserted towards one end of the
SMAT.
The strategy adopted let SMAT grow downward until some criterion is met,
and grow upward when height-imbalance occurs. Initially, the height of all the
layers are predetermined. For a SMAT structure with k layers, L1 , L2 , ... , Lk'
let the predetermined height for layer L; be h;. Note that h;, for all i E [1, k],
changes dynamically as SMAT grows. During initial loading, SMAT is not
fully developed, and so h; is used to guide the growth of layer L; downward as
follows: layer L;+1 will appear only if all the nodes along the path leading to
the leaf node of layer L; in which the new record is to be inserted are full, and
the length of the path has reached h;. This is to ensure that the height of the
SMAT is maintained and not increased further unless necessary. To illustrate,
consider the I-D R-tree in Figure 3.6. Suppose, leaf node 1 is full and h1 is set
104 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
to 2, and a new cluster is to be inserted into leaf node 1. If node 3 is full, instead
::If allowing the I-D R-tree to grow, the tree grows downward by creating the
l1ext layer tree, and the record is inserted there. On the other hand, if node 3
IS not full, then creating the next layer will undoubtedly increase the height of
the search path by one. Instead, leaf node 1 should be split as normal.
Once all the layers of SMAT are developed, the issue of height-balancing
becomes a concern since it affects the retrieval time of SMAT. Although R-tree
is height-balanced, SMAT may not be so. This happens especially if there are
a lot of clusters of a particular color. Thus, there is no guarantee that all the
trees in the second layer index will grow and shrink at the same rate. This
means that it is possible that a particular tree in a level may grow much faster
than the other trees in the same level causing the SMAT to be skewed to one
,ide. This is to say that the basic SMAT structure, can only be locally balanced
but not globally height balanced.
Since SMAT is a multi-tier structure, the concept of height-balanced is
,lightly different from a single structure index. A SMAT structure is height-
balanced if the following two conditions are met:
• Each tree structure within a layer is height-balanced.
• The difference in the heights of trees within a layer, say Li' is at most ei for
some predetermined ei for each layer.
Figure 3.7 illustrates a height-balanced tree. As can be seen, in the worst case,
the difference between trees in height within a k-layer SMAT is I:7=2 e;.
To keep SMAT height-balanced, the upper layers are allowed to grow once
the lowest layer has been established. The minimum height of the trees at each
layer are maintained. If there is an increase in the height of a tree (at a layer)
as a result of an insertion, the new height of the tree is compared against the
minimum height at that layer. If the difference between the two is above a
certain predetermined threshold, then rebalancing is activated. Rebalancing is
performed as follows. Let the layer where rebalancing is needed be Li' and its
parent layer be Li-1. Let the root of the tree that causes height imbalance at
L; be R;, and the leaf node of L;-1 that points to R; be LN;. Let the entry
in LN; that points to Ri be lold. The information at Ri is used to insert a
new entry, lnew into LNi. lold is set to point to the left child of Ri, and lnew
is set to point to the right child of Ri . Ri can then be removed. Note that the
corresponding bounding information in lold needs to be updated too.
The insertion algorithm that SMAT adopts within a tree is similar to that
used in R-trees in that new clusters are added to the leaves, nodes that overflow
are split, and splits are propagated up the tree. The splitting algorithm adopted
is based on the quadratic-cost algorithm of R-tree by Guttman [Guttman, 1984].
IMAGE DATABASES 105
hI Thl D layer I
h2+<1 f2 ...layer 2
h3 . . . layer 3
h3+e3
h4 . . .
layer 4
h4+e4
Figure 3.7. A height-balanced SMAT.
The algorithm attempts to find a small-area split, but is not guaranteed to find
one with the smallest area possible. There is, however, the additional task of
handling height-balancing.
3.5 Signature;..based color-spatial retrieval
In this section, we present a signature-based color-spatial retrieval technique
[Chua et aI., 1997]. The mechanism involves several components, and we discuss
each of them in a subsection. First, the color-spatial information has to be
extracted and represented. Next, we describe the retrieval process that is based
on the color-spatial information. In particular, the retrieval process requires
a measure to compute the similarity between two images (in terms of their
color-spatial representation). We also discuss an approach which incorporates
the concept of perceptually similar colors and weighting of colors.
3.5.1 Representing the color-spatial information
The proposed color-spatial approach partitions each image into a grid of m x n
cells of equal size. Figure 3.8 shows an example of an image being partitioned
into a 4 x 8 grid. Instead of obtaining the color-spatial information at pixel-
level, the colors that can be used to represent a cell are determined. This
is done as follows. For a given color, each cell is examined to determine the
percentage of the total number of pixels in the cell having that color. If this
106 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
percentage is greater than a pre-defined threshold value, then the cell is said
to be represented by that color. This approach is equivalent to applying the
maximum entropy discretization algorithm [Chiu and Kolodziejczak, 1986] un-
:ier the assumption of uniform color distribution. Note that, depending on the
threshold value, a cell may have no color representative or it may have more
than one representative.
o cell does not satisfy the threshold III cell satisfies the threshold
Figure 3.8. An image partitioned into a 4 x 8 grid.
For the approach to be practical and useful, several issues have to be ad-
dressed. First, the number of colors can be very large, resulting in a large set of
color-spatial information. This is resolved by restricting the number of colors
for an image to a set of C colors (called the dominant colors) of-the image. C
is expected to be small as most images are usually dominated by a few colors.
To select the C dominant colors, the heuristics employed in [Hsu et al., 1995]
is adapted. The heuristics works as follows. Two color histograms, Hi and He,
representing the color composition of the entire image and the center of the
image are obtained. First, Ci (Ci < C) colors that have the largest number of
pixels in Hi are picked. Next, the Ci colors picked are eliminated from con-
sideration when the remaining Ce (= C - Cd colors are to be picked. The Ce
colors are obtained from the remaining colors with the largest number of pixels
in He. While the first set of colors represents the background colors, the second
set represents the object colors (based on the inherent assumption that objects
usually appeal' in the center of an image). Unlike the algorithm in [Hsu et al.,
1995] where the background and the object colors are selected alternatively,
the modification is to reduce the probability that the most dominant color in
the center of the image (representing the object) is in fact one of the dominant
background color. This is based on the observation that a significant portion
of the center region of an image can be covered by the background colors.
The second issue concerns the representation of the color-spatial information.
It turns out that the proposed approach has a very nice property - given a
IMAGE DATABASES 107
color, a cell is either represented or not represented by it. As such, each cell
can be represented by a bit - if the cell satisfies the threshold value, the bit
is set; otherwise, it is cleared. Hence, for each color, a bitstream (called the
color signature) that captures the spatial distribution of that color is obtained.
In the color signature, bit (i· (m - 1) +j) corresponds to cell (i, j). Referring
to Figure 3.8 again, suppose a color qualifies to be the representatives of cells
0,4,5,6,7,10,14,15,25,26,30 and 31, its corresponding 32-bit color signature will
be 10001111001000110000000001100011. Given an image with k colors, there
will be k color signatures. These color signatures can be superimposed (bitwise
logical-OR) to obtain an image signature.
3.5.2 The retrieval process
From the human perception point of view, two images are perceived to be alike
if the color composition of the two images are similar, and the distributions of
the colors in the images are similar. Under the signature-based representation
of color-information, the above two points can be translated into the following
two conditions to facilitate efficient retrieval:
• The images have the same representative sets of colors.
• The signatures representing both images are similar in that they may only
differ in some of the bits. This only requires a simple operation (logical AND)
to compute the intersection between two images for a particular color.
We discuss in the next few subsections several similarity measures that have
been used [Chua et al., 1997] to indicate the similarity between two images
based on their signatures.
Basic similarity function. For the signature-based color-spatial approach,
recall that each bit in a signature represents a particular cell in the image.
Let Qi and Di denote the signatures of color i for a query image Q and a
database image D respectively. Then, the two images have the color i at
the same particular region (cell) if and only if the corresponding bits in both
signatures are set; otherwise the two images are not similar at the region. Let
the representative color sets of Q and D be CQ and CD respectively. Then,
the similarity measure, SIMbasic, between Q and D for a color i E CQ can be
determined as:
{
BitSet(Q;ID;J
SIMbasic(Q, D, i) = BitS~t(Q;)
if color i E CD
otherwise
(3.1)
where BitSet(BS) denotes the number of bits in the bitstream BS that are set,
and '1' represents the bitwise logical-AND operation. Now, if a large part of
108 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
cells in Q has the same color as that in D, then the similarity computed will
be closed to 1. The similarity measure between two images Q and D is then
given by:
SIMbasic(Q,D) = L SIMbasic(Q,D,i)
ViEC'Q
(3.2)
Similarity function with perceptually similar colors. Because of the
effectiveness of using perceptually similar colors [Niblack et al., 1993], Chua,
et al. also incorporated the contributions of perceptually similar colors in their
similarity measure. To determine the degree of similarity between two colors,
the method proposed by Ioka [Ioka, 1989] was adopted. The method first
transforms colors in the RGB space to the CIE (Commission Internatinale
de I'Eciairege) L*u*v* space, and the similarity between two colors can be
measured from the Euclidean distance between the colors in the CIE L*u*v
space. The Euclidean distance between two colors, i and j, in the L*u*v*
space is computed as:
(3.3)
Let M denote the number of L*u*v* colors the system can support. The degree
of similarity between two colors, i and j, is given by:
SIM(i,j) ={ 1 _ ~(i,j)pxD mox
ifD(i,j) > P x Dmax
otherwise
(3.4)
where Dmax = max D(i, j), i i=- j, 1::; i, j ::; M, and p is a predetermined
threshold value between 0 and 1 (in our study, we have arbitrarily set p to 0.2).
Essentially, p x Dmax represents the tolerance in which two colors are considered
to be similar. If SIM(i,j) > 0, then color i is said to be perceptually similar to
color j, and vise versa. The larger the value of SIM (i, j), the more similar the
two colors are. If SIM (i, j) = 0, it means that the two colors are not perceived
to be similar. The similarity values computed for all pairs of colors are stored
in a M x M matrix, called the color similarity matrix (denoted SM), where
entry (i, j) corresponds to the value of SIM(i, j). S M is stored in a flat file and
will be frequently used during the retrieval process to determine the similarity
between two colors.
Under the signature approach, the contribution of the perceptually similar
colors of color i for query image Q and database image D is computed as
follows:
. """' BitSet(QiIDj ) (..)
SIMpercept(Q, D, z) = LJ BitSet(Q.) x SM Z,J
jESp
,
(3.5)
IMAGE DATABASES 109
where Sp is the set of colors that are perceptually similar to color i as de-
rived from the color similarity matrix SM. SM(i,j) denotes the (i,j) entry
of matrix SM. To take the contributions of perceptually similar colors into
consideration, Equations 3.1 and 3.5 can be combined to obtain the perceived
similarity between two signatures on color i as follows:
SIMcolor-spatiaL(Q, D, i) =SIMbasic(Q, D, i) + SIMpercept (Q, D, i) (3.6)
Thus, the similarity measure for query image Q and database image D is the
sum of the similarity for each color in the representative set CQ for image Q,
and is given as follows:
SIMcoI01·-spatial(Q,D) = L SIMcolor-spatial(Q,D,i) (3.7)
'tiECQ
Weighted similarity function. In the above similarity measure, all the
dominant colors have been implicitly assigned the same weight. However, in
some applications, it may be desirable to give the object colors a higher weight.
This is particularly useful when the object is at the center and the user is only
interested in retrieving images containing similar objects at similar locations.
The authors also proposed a weighted similarity measure which is given as
follows:
SIMweighted(Q, D) L SIMcolor-spatial(Q, D, i) +
iECi
wt x L SIMcolor-spatial(Q,D,i)
iECe
(3.8)
where Ci and Cc are the set of background and object colors of Q respectively,
and wi (> 1) is the weight given to the object colors. A weight greater than
1 can be assigned to the object colors to give a higher weight to images with
similar object colors as that of the query image.
3.6 Summary
In this chapter, we have surveyed content-based indexing mechanisms for image
database systems. We have looked at various methods of representing and
organizing image features such as color, shape and texture in order to facilitate
speedy retrieval of images, and how similarity retrievals can be supported. In
particular, we have a more in-depth discussion on color-spatial techniques that
exploit color as well as their spatial distribution for image retrieval.
As images will continue to play an important role in many applications,
we believe the need for efficient and effective retrieval techniques and access
110 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
methods will increase. While we have seen much work done in recent years,
there remain a lot to be mined in this field. In what follows, we outline several
promising areas (not meant to be exhaustive) that require further research.
Performance evaluation
This chapter has presented a representative set of indexes for content-based im-
age retrievals. Unlike other related areas such as spatial databases, the number
of indexes proposed to facilitate speedy retrieval of images is still very small.
This is probably because content-based image retrievals have been largely stud-
ied by researchers in pattern recognition and imaging community, whose focuses
have been on extracting and understanding features of the image content, and
on studying the retrieval effectiveness of the features (rather than on efficiency
issues). It is not surprising then that the indexes discussed have not been ex-
tensively evaluated. Besides [Ooi et al., 1997], which reported a preliminary
performance comparison demonstrating that SMAT outperforms R-tree in most
cases, most of the other works have only compare with the sequential scanning
approach.
We believe that a comparative study is not only necessary but will be useful
for application designers and practitioners to pick the best method for their
applications. It will also help researchers to design better indexes that overcome
the weaknesses and preserve the strengths of existing techniques. Another
aspect of performance study, which is applicable for indexes in general, is the
issue of scalability. Again, most of the existing work has been performed on
small databases. How well will such indexes scale is certainly unclear until they
have been put to the test. The readers are referred to [Zobel et al., 1996] for
some guidelines on comparative performance study of indexing techniques.
More on access methods
The focus of this chapter has been on content-based access methods. There are
many other content-based retrieval techniques that have been proposed in the
literature [Aslandogan et al., 1995, Chua et al., 1994, Gudivada and Raghavan,
1995, Hirata et al., 1996, Iannizzotto et aI., 1996, Nabil et aI., 1996] and shown
to be effective (in terms of recall and precision). These works, however, have
not addressed the issue of speedy retrievals. Designing efficient access methods
for these promising methods will make them more practical and useful.
Another promising direction is to further explore color and its spatial dis-
tribution. One issue is to exploit the colors that are perceptually similar. For
example, out of the 16.7 million possible shades of colors displayable in a 24-bit
color monitor, the human eyes can only differentiate up to 350,000 shades. As
such, colors that are perceived to be similar should contribute to the compari-
IMAGE DATABASES III
son of color similarity. While some work has been done in this direction [Chua
et aI., 1997, Niblack et aI., 1993], perceptually similar colors are considered in
the computation of the degree of similarity, rather than being modeled in the
feature representation. We believe the latter can be more effective in pruning
the search space. Another issue is to exploit texture and color for segmentation
of an image space. Indexing of clusters based on both texture and color may
be more effective.
Concurrent access and distributed indexing
Traditionally, image retrieval systems have been used for archival systems that
are usually static in that the images are rarely updated. As such, the issue of
supporting concurrent accesses are not critical. Instead, in such applications,
the access methods should be designed to exploit this static characteristic.
However, as multimedia applications proliferates, we expect to see more
real-time applications as well as applications running in parallel or distributed
environment. In both cases, existing techniques will have to be extended to
support concurrent accesses. Some techniques have been developed for central-
ized systems [Bayer and Schkolnick, 1977, Sagiv, 1986, Ng and Kameda, 1993]
as well as parallel and distributed environment [Achyutuni et aI., 1996, Kroll
and Widmayer, 1994, Litwin et aI., 1993b, Tsay and Li, 1994]. But, we be-
lieve more research that tailors to image data, especially those that involved
hierarchical structures, are needed.
Integration and optimization
The retrieval results of an image database systems are usually not very precise.
The effectiveness of using the content of an image for retrieval depends very
much on the image representation and the similarity measure. It has been
reported that using colors and textures can achieve a retrieval effectiveness of
up to 60% in recall and precision [Chua et aI., 1996]. Furthermore, different
retrieval models based on different combination of visual attributes and text
descriptions achieve almost similar levels of retrieval effectiveness. Moreover,
each model is able to retrieve a different subset of relevant images. This is
because each image feature only captures a part of the image's semantics. The
problems then include selecting an "optimal" set of image features that fits best
for an application, as well as developing techniques that can integrate them to
achieve the optimal results. One promising method is to use content-based
techniques as the basis, but also exploits semantic meanings of the images and
queries to support concept-based queries. Such techniques have been known as
semantic-based retrieval techniques. Typically, some form of knowledge base
is required, rendering such techniques domain-specific. In [Chua et al. , 1996],
112 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
the domain knowledge is supplied by users as part of a query. The query is
modeled as a hierarchy of concepts through a concept specification language.
Concepts are defined in terms of the multiple images' content attributes such
as text, colors and textures. Each concept has three components: its name,
its relationships with other concepts, and rules for its identification within the
images' contents. In answering queries, the respective indexes are used to speed
up the retrievals for concepts that are at the leaf of the hierarchy, and their
results combined based on the hierarchy of concepts defined. More studies are
certainly needed along this direction.
4 TEMPORAL DATABASES
Apart from some primary keys and keys that rarely change, many attributes
evolve and take new values over time. For example, in an employee relation,
employees' titles may change as they take on new responsibilities, as will their
salaries as a result of promotion or increment. Traditionally, when data is
updated, its old copy is discarded and the most recent version is captured.
Conventional databases that have been designed to capture only the most recent
data are known as snapshot databases. With the increasing awareness of the
values of the history of data, maintenance of old versions of records becomes
an important feature of database systems.
In an enterprise, the history of data is useful not only for control purposes,
but also for mining new knowledge to expand its business or to move on to a new
frontier. Historical data is increasingly becoming an integral part of corporate
databases despite its maintenance cost. In such databases, versions of records
are kept and the database grows as the time progresses. Data is retrieved based
on the time for which it is valid or recorded. Databases that support the storage
and manipulation of time varying data are known as temporal databases.
In a temporal database, the temporal data is modeled as collections of line
segments. These line segments have a begin time, an end time, a time-invariant
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997
114 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
attribute, and a time-varying attribute. Temporal data can either be valid time
or transaction time data. Valid time represents the time interval when the
database fact is true in the modeled world, whereas transaction time is when a
transaction is committed. A less commonly used time is the user-defined time,
and more than one user-defined time is allowed.
A database that supports transaction time may be visualized as a sequence
of relations indexed by time and is referred to as a rollback database. The
database can be rolled back to a previous state. Here the rollback database
is distinguished from the traditional snapshot database where temporal at-
tributes are not supported and no rollback facility is supported. A database
that supports valid time records a history of the enterprise being modeled as
it is currently known. Unlike rollback databases, these historical databases al-
low retroactive changes to be made to the database as errors are identified. A
database that supports both time dimensions is known as bitemporal database.
Whereas a rollback database views records as being valid at some time as of
that time, and a historical database always views records as being valid at some
moment as of now, a bitemporal database makes it possible to view records as
being valid at some moment relative to some other moment.
One of the challenges for temporal databases is to support efficient query
retrieval based on time and key. To support temporal queries efficiently, a
temporal index that indexes and manipulates data based on temporal relation-
ships is required. Like most indexing structures, the desirable properties of a
temporal index include efficient usage of disk space and speedy evaluation of
queries. Valid time intervals of a time-invariant object can overlap, but each
interval is usually closed. On the other hand, transaction time intervals of a
time-invariant object do not overlap, and its last interval is usually not closed.
Both properties present unique problems to the design of time indexes. In this
chapter, we briefly discuss the characteristics of temporal applications, tempo-
ral queries, and various promising structures for indexing temporal relations.
We also report on an evaluation of some of the indexing mechanisms to provide
insights on their relative performance.
4.1 Temporal databases
In this section, we briefly describe some of the terms and data types used in
temporal databases. For a complete list of terms and their definitions, please
refer to [Jensen, 1994].
An instant is a time point on an underlying time dimension. In our discus-
sions that follow, we use 0 to mark the beginning of a time, and time point to
mean instant on the discrete time axis. A time interval [Ts, Te) is the time
between two time points, T s and T e , where T s :S Te, with the inclusion of the
TEMPORAL DATABASES 115
end time. Note that the closed range time is similar to the non-closed range
representation, since [Ts , Te] =[Ts , Te + 1). A chronon is a non-decomposable
time interval of some fixed minimal duration. In some applications, chronons
have been used to represent an interval. A span or time span is a directed du-
ration of time. It is the length of the time with no specific starting and ending
time points. A lifespan of a record is the time when it is defined. A lifespan
of a version (tuple) of a record is the time in which it is defined with certain
time-varying key values. For indexing structures that support time intervals,
start time and version lifespan are two parameters that may affect their query
and storage efficiency.
4.1.1 Transaction time relations
Transaction time refers to the time when a new value is posted to the database
by a transaction [Jensen, 1994]. For example, suppose a transaction time rela-
tion is created at time Ti , so that Ti is the transaction time value for all the
tuples inserted at the creation of the relation. The lifespan of these tuples is
[Ti , NOW]. The right end of the lifespan at this time is open, which can be
assumed to have the value of NOW to indicate progressing time span. At time
Tj when a new version of an existing record is inserted, the lifespan of the new
version is [Tj , NOW], and that of the previous version is [Ti , Tj). Transaction
times which are system generated follow the serialization order of transactions,
and hence are monotonically increasing. As such, a transaction time database
can rollback to some previous state of its transaction time dimension.
There are two representations for transaction time intervals. One approach
is to model transaction time as an interval [Snodgrass, 1987] and the other is
to model transaction time using a time point [Jensen et aI., 1991, Lomet and
Salzberg, 1989, Nascimento, 1996]. The latter approach implicitly models an
interval by using the time when a new version is inserted as the start time
of its transaction time and the time point immediately before the time when
the insertion of the next version as its transaction end time. In what follows,
we shall use the single time point representation to model transaction time.
However, explicit representation of transaction time intervals is often used for
performance reason.
To illustrate the concept of temporal relations, we use a tourist relation that
keeps track of the movement of tourists to study the tourism industry. The
relation has a time invariant attribute, pid, and a time varying attribute, city.
At time 0, the relation is created and the transaction time value for the current
tuples is °(Table 4.1). The lifespan of these tuples is [0, NO W]. At time 3, the
tuple with pid=p1 is updated, the new city value is Los Angeles (Table 4.2).
116 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Table 4.1. A tourist transaction time relation at time O.
tuple pid city Tt
tl pI New York 0
t2 p2 Washington 0
t3 p3 New York 0
Table 4.2. The tourist transaction time relation at time 3.
tuple pid city Tt
t1 pI New York 0
t2 p2 Washington 0
t3 p3 New York 0
t4 pI Los Angeles 3
t5 p6 Seattle 3
To keep the history, a new tuple t4 is inserted. Thus, the lifespan for tl is [0,
3) and the lifespan of t4 is [3, NOW].
In the transaction time relation, there are no retroactive updates (updates
that are valid in the past) and predictive updates (updates that will be valid
in the future). Each transaction is committed immediately with the current
transaction time. For instance, if at time 2, the city for p1 changes to Seattle,
this update cannot be committed at time 3. If a tuple will be updated at time
4, this update cannot be reflected in Table 4.2, because predictive update is
not supported in the transaction time relation. Note that time intervals that
are still valid at the present time point are not closed. In other words, the end
time progresses with the current time.
4.1.2 Valid time relations
The transaction time dimension only represents the history of transactions, it
does not model the real world activity. We need a time to model the history of
an enterprise such that the database can be rolled back to the right time-slice
with respect to the enterprise activity. Valid time is the time when a fact is
true. In a valid time relation, a time interval [Ts , Te] is used to indicate when
the tuple is true. Valid time intervals are usually supplied by the user, and each
TEMPORAL DATABASES 117
Table 4.3. The tourist valid time relation at time O.
tuple pid city Ts Te
tl pI New York 0 3
t2 p2 Washington 0 NOW
t3 p3 New York 0 NOW
Table 4.4. The tourist valid time relation at time 3.
tuple pid city Ts Te
tl pI New York 0 3
t6 pI Seattle 2 3
t2 p2 Washington 0 NOW
t3 p3 New York 0 NOW
t4 pI Los Angeles 3 NOW
t5 p6 Seattle 3 6
t7 p5 Washington 4 6
new tuple is inserted into the relation with its associated valid time interval.
A time-invariant key can have different versions with overlapping valid time,
provided the temporal attributes of these versions are different. Time intervals
that progress the current time are open. Since they are usually determined by
users, new tuples often have close intervals that end before or after the current
time NOW.
Tables 4.3 and 4.4 show the valid time relation of tourist. At time 0, the
tuples are inserted with the valid time ranges. Assume in period [2, 3], the city
for pi is changed from New York to Seattle, and from time 3, it is changed
again to Los Angeles. The relation in Table 4.4 represents these updates. Note
also that the valid time relation in Table 4.4 can capture proactive insertions,
for example, tuple t7 which has the valid time interval [4, 6] appears in the
relation at time 3.
Unlike transaction time relation, a valid time relation supports retroactive
and predictive updates. If an error is discovered in an older version of a record,
it is modified with the correct value, the old value being substituted by a new
value. Hence it is not possible to rollback to the past as in the transaction time
database.
L18 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Table 4.5. The tourist bitemporal relation at time O.
tuple pid city Ts Te Tt
t1 pI New York 0 3 0
t2 p2 Washington 0 NOW 0
t3 p3 New York 0 NOW 0
Table 4.6. The tourist bitemporal relation at time 5.
tuple pid city Ts Te Tt
tl pI New York 0 3 0
t6 pI Seattle 2 3 3
t2 p2 Washington 0 NOW 0
t3 p3 New York 0 NOW 0
t4 pI Los Angeles 3 NOW 3
t5 p6 Seattle 3 6 3
t7 p5 Washington 4 6 3
t8 p5 Washington 5 8 5
4.1.3 Bitemporal relations
In some applications, both the transaction time and valid time must be mod-
eled. This is to facilitate queries for records that are valid at some valid time
point and as of some transaction time point. A relation that supports both
times is known as a bitemporal relation, which has exactly one system sup-
ported valid time and exactly one system supported transaction time. Table 4.5
illustrates an example of the tourist bitemporal relation at time O.
From Table 4.6, note that tuples t7 and t8, with the same pid and city
values, bear overlapping valid time [Ts , Tel. This is possible because the two
tuple versions have different transaction time values. However, in a valid time
relation, this situation cannot be represented.
Like a valid time relation, the bitemporal relation also supports retroactive
and predictive versioning.
TEMPORAL DATABASES 119
4.2 Temporal queries
Various types of queries for temporal databases have been discussed in the
literature [Gunadhi and Segev, 1993, Salzberg, 1994, Shen et aI., 1994]. Like
any other applications, temporal indexing structures must be able to support
a common set of simple and frequently used queries efficiently. In this section,
we describe a set of common temporal queries. These queries should be used
to benchmark the efficiency of a temporal index.
We use the tourist relation shown in Table 4.7 as an example in our discussion
that follows. We assume that the time granularity for this application is one
day for both valid and transaction time. Consider the first tuple. The object
with pid pI is at New York from day 0 to day 2 inclusive. Its transaction time
starts at day 1 and ends when there is an update to the tuple.
A set ofcanonical queries was initially proposed by Salzberg [Salzberg, 1994].
We extend this set of queries by further classifying temporal queries in each
query type based on the search predicates - intersection, inclusion, contain-
ment and point. Such finer classification can provide insights into the effec-
tiveness of the indexes on different kinds of search predicates. For queries
that involve only one time and one key, the key can either be a time-invariant
attribute or a time-varying attribute, and the time can either be valid time
or transaction time. However, the single time dimensional queries are more
meaningful for valid time databases. They can however be applied to transac-
tion time. Nonetheless, the search remains the same although the semantics of
time may be different. The following constitutes the common set of temporal
queries:
l. Time-slice queries. Find all valid versions during the given time interval
[Ts, Te]' For a valid time database, the answer is a list of tuples whose valid
time fall within the query time interval. For transaction time database, the
answer are snapshots during the query time interval and hence the predicate
"as of" is used for transaction time.
Based on the search operation on the temporal index, time-slice queries can
be further classified as:
• Intersection queries. Given a time interval [Ts , Te], retrieve all the
versions whose time intervals intersect it. For example, a valid time
query to find all tourists who are in US during the interval [3,7] would
return 9 tuples: t2, t3, t4, t5, t6, t7, tlO, t12 and t14.
• Inclusion queries. Given a time interval [Ts , TeL retrieve all the versions
whose valid time intervals are included in it. For example, the query
"Find all tourists who stay in a city between day 3 and day 7" would
return 2 tuples: t5 and tlO.
120 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
• Containment queries. Given a time interval [Ts , Tel, retrieve all the
versions whose valid time intervals contain it. For example, the query
"Find all tourists who stay in a city from day 3 to day 5" would result
in 5 tuples: t3, t4, t7, tiO and tI4.
• Point queries. Given a specific time point t(instant), retrieve all the
versions whose valid intervals contain the time point. Point queries
can be viewed as the special case of intersection queries or containment
queries where the time interval [Ts , Te] is reduced to a single time instant
T . For example, the query "Find all tourists who are in US on day 1"
would result in 3 tuples: tI, t3 and t4.
2. Key-range time-slice queries. Find all tuples which are in a given key range
[ks , ke] that are valid during the given time interval [Ts , Te]. It is a con-
junction of keys and time. Like the time-slice query, the time-slice part of
the query can assume one of the predicates described above. For example,
the query to find all tourists who are in New York during the interval [3,7]
is a key-range time-slice query with intersection predicate. The result of the
query is now 2 tuples instead: t3 and t6. As another example, the query
"Retrieve all tourists who are in cities with names beginning in the range
[D,N] on day 1" would be a point key-range time-slice query that results in
3 tuples: tI, t3 and t4.
The key-range time-slice query is an exact-match query if both ranges are
reduced to single value; that is, find the versions of the record with key k at
time t. An example of this category is "Find all tourists who visited New
York on day 1", and results in tuples: tl and t3.
3. Key queries. Find all the historical versions of the records in the given key
range [ks , ke]. Such a query is a pure key-range query over the whole lifespan.
For example, the query "Find all tourists who visited New York" is a past
versions query. This query will return the tuples: tl, t3, t6, t9 and tIl.
4. Bitemporal Time-slice queries. Find all versions that are valid during the
given time interval [Ts , Te] as of a given transaction time Tt ·
5. Bitemporal key-range Time-slice queries. Find all versions which are in the
given key range [ks , ke] that are valid during the given time interval [Ts , T e]
as of a given transaction time Tt .
To answer time-slice queries, the index must be able to support retrieval
based on time. The key-range time-slice queries require the search to be based
on both key and line segments. To support valid time, an index must support
dynamic addition, deletion and update of data on the time-dimension, and
TEMPORAL DATABASES 121
Table 4.7. A tourist relation for running examples.
tuple pid city period trans_time
t1 pI New York [0,2] 1
t2 p2 Washington [5, now] 1
t3 p3 New York [0, 6] 1
t4 p4 Detroit [0, 7] 2
t5 p5 Washington [4, 6] 2
t6 p5 New York [7, now] 3
t7 p6 Seattle [3, now] 3
t8 p4 Washington [10, now] 3
t9 p3 New York [12, now] 3
tlO pI Los Angeles [3,6] 3
tIl p7 New York [14, now] 4
t12 pI Detroit [7,9] 4
t13 pI Detroit [10, 12] 5
t14 p9 Los Angeles [3,8] 6
t15 pI San Francisco [13, now] 6
support time that is beyond the current time. In other words, reactive and
proactive updates are required. An index that has been designed for valid time
can be easily extended for transaction time even though a transaction database
can be thought of as an evolving collection of objects. The major differences
are that delete operations are not required for transaction time databases, and
time increases on one end dynamically as it progresses. However, it is much
more difficult to extend a transaction time index for indexing valid time data
since transaction time indexes are designed based on the fact that transaction
times do not overlap, and such property is quite often built into the index.
Further, some transaction time indexes are specifically designed for intervals
that are always appended from the current time, and do not support reactive
update and proactive insertion.
4.3 Temporal indexes
Without considering the semantics of time, temporal data can be indexed as
line segments based on its start time, end time, or the whole interval, together
with the time-varying attribute or time-invariant attribute. Indexing structures
based on start time or end time are straightforward and structurally similar to
122 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
existing indexes such as B+-tree [Comer, 1979]. Such an index is not efficient for
answering queries that involve time-slice since no information on the data space
is captured in the index. To search for time intervals with a given interval, a
large portion of the leaf nodes have to be scanned. To alleviate such a problem,
temporal data can be duplicated at the data buckets whose data space of time
intervals it intersects. However, duplication increases storage cost and the
height of the index, which affects the query cost. Alternatively, temporal data
can be indexed directly as line segments or mapped into point data and indexed
using multi-dimensional indexes. As such, most temporal indexes proposed so
far are mainly based on the conventional B+-tree and spatial indexes like the
R-tree [Guttman, 1984].
In this section, we review several promising indexes for temporal data. They
are the Time-Split B-tree [Lomet and Salzberg, 1989, Lomet and Salzberg,
1990b, Lomet and Salzberg, 1993], the Time Index [Elmasri et al., 1990], the
Append-Only tree [Gunadhi and Segev, 1993], the R-tree [Guttman, 1984]' the
Time-Polygon tree [Shen et al., 1994], the Interval B-tree [Ang and Tan, 1995],
and the B+-tree with Linearized Order [Goh et al., 1996]. Where necessary,
we also discuss the extensions that have to be incorporated for such indexes to
facilitate retrieval by both key and time dimensions.
4.3.1 B-tree based indexes
The Time-Split B-tree. The Time-Split B-Tree (TSB-tree) [Lomet and
Salzberg, 1989, Lomet and Salzberg, 1990b, Lomet and Salzberg, 1993] is a
variant of the Write-Once B-Tree (WOBT) [Easton, 1986]. The TSB-tree is
one of the first temporal indexes that support search based on key attribute
and transaction time. An internal node contains entries of the form <aU-
value, trans-time, Ptr>, where aU-value is the time-invariant attribute value of
a record, trans-time is the timestamp of the record and Ptr is a pointer to a
child node [Lomet and Salzberg, 1989].
Searching algorithms are affected by how a node is split and the information
it captures about its data space. Therefore, we shall begin by looking at the
splitting strategy. In the TSB-tree, two types of node splits are supported, key
value and time splits. A key split is similar to a node split in a conventional
B+-tree where a partition is made based on a key value. A TSB-tree after a key
split is shown in Figure 4.1. For the time split, an appropriate time is selected
to partition a node into two. Unlike key split, all record entries that persist
through the split time are replicated in the new node, which stores entries with
time greater than the split time. Figure 4.2 shows the TSB-tree time splitting in
which the record <pI, Detroit, 4> is duplicated in the historical and new nodes.
If the number of different attribute values in a node is more than lM/2J(M is
TEMPORAL DATABASES 123
pI New York T=I p2 Washington T=I p3 New York T=I
index page
data pages
After insertion of record <p9, Los Angeles, 6>
Ipi New York T=I 0 p2 Washington T=I DL.... _
Ip3 New York T=I op9 Los AngelesT=6 IJL _
Figure 4.1. A key split of a leaf node in the TSB-tree based on p3.
the maximum number of entries in a node), a key split is performed; otherwise
the node is split based on time. If no split time can be used except the lowest
time value among the index item, a key split is executed instead of time split.
To search based on key and time, index keys and times of internal nodes are
used respectively to guide the search. With data replication, data whose time
intersects the data space defined in the index entries are properly contained in
its subtree, and this enables fast search space pruning.
The TSB-tree can only support transaction times in the sense that times of
the same invariant key must strictly be in increasing order. In other words,
there is no time overlapping among versions of a record. When a record is
updated, the existing record becomes a historical record, and a new version
of the record is inserted. The TSB-tree can answer all the basic queries on
transaction time and time-invariant key.
The major problem of the TSB-tree is that data replication could be severe,
and hence this may affect its storage requirements and query performance. As
noted, the index cannot be used for valid time data.
The Time Index. Elmasri et al. [Elmasri et al., 1990] proposed the time
index to provide access to temporal data valid in a given time interval. The
technique duplicates the data on some selected time intervals and index them
using a B+-tree like structure. Duplications not only incur additional cost
124 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
pI New York T=1 PI Los Angeles T=3 pi Detroit T=4
Now insert record <p9. Los Angeles. 6>. choose T=5 as the split time
index page
The new nodes are:
IpOT=1 II pOT=5 II 1
data pages Ipi New York T=I I IpI Los Angeles T=3 IIpI Detroit T=4 I
lL':p..:..1..:.D..:..et.:..:ro..:..:it-=.T_=4 [1 p9 Los Angeles T=6 0'---- _
Figure 4.2. Time splitting in TSB-tree.
in insertion and deletion, but also degrade the space utilization and query
efficiency. In the worst case that all intervals start at different instants but end
at the same instant, the storage cost is of order O(n2
).
As for the query operation, to report all intersections with a long interval
requires an order of O(n2
), since most of the buckets need to be searched. To
reduce the number of duplications, an incremental scheme is adopted which
only allows the leading buckets to keep all their id's, whereas others maintain
the starting or ending instants [Elmasri et al., 1990]. Figure 4.3 depicts the
time index constructed using the most current snapshot of the tourist relation
in Table 4.7. In the figure, the "+" and "-" signs indicate the starting instant
and ending instant of an interval respectively. The number of duplications
has been reduced, however, there are still many duplications for tuples having
long intervals. To search from an instant onward, all the leading id buckets
belonging to the same leaf node have to be read and checked. For instance, the
query "Find all persons who were in the United States from day 4 to day 6" can
be answered by locating indexing point 4, and reconstructing the list of valid
tuples from the leading bucket and subsequent entries right up to indexing
point 6. To insert or delete a long time interval, the number of leading id
buckets to be read and updated can be high, with the order of O(n).
The time-index is likely to be efficient for short query intervals and short time
intervals. For long data intervals, the amount of duplication can be significant.
TEMPORAL DATABASES 125
o
(11,14,17)
(+110)
(+12,-11,-110)
(12,17)
(+18,+111,-17)
(12,111,112)
(+114) t
(+113)
Figure 4.3. The time index constructed from the tourist relation.
This will affect query efficiency as the tree becomes taller and the number of
leaf nodes increases. In addition, index support is provided for only a single
notion of time (in this case, valid time) and it is not clear how this can be
naturally extended to support temporal queries involving both transaction and
valid time. Elmasri et al. [Elmasri et al., 1990] also suggested that their time
index can be appended to regular indexes to facilitate processing of historical
queries involving other non-temporal search conditions. For example, if queries
such as "Find all persons who entered United States via LA and remained from
day 4 to day 6" is expected on a regular basis, such queries may be supported by
attaching a time index structure to each leaf entry of a B+-tree constructed for
the attribute city. Answering the above query involves traversing the first B+-
tree to identify the leaf entry corresponding to attribute value "LA", followed
by an interval search on the time index found there. However, this approach
may not be scalable since the number of time indexes will certainly grow to be
exorbitantly large in any nontrivial database.
The Append-Only tree. The Append-Only tree (AP-tree) [Gunadhi and
Segev, 1993] introduced by Gunadhi and Segev is a straightforward extension
of the B+-tree for indexing append-only valid time data. In an AP-tree, leaf
nodes of the tree contain all the start times of a temporal relation. In a non-leaf
node, pointer of each time value points to a child node in which this time value
is the smallest value (this rule does not apply to the first child node of each
index node). The AP-tree is illustrated in Figure 4.4.
Since both the update of an existing record and insertion of a new version will
only cause incremental append to the database, every insertion to the AP-tree
126 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
o 3 H 4 5
~~tl t3 t4 t7 tlO...
tl, t3, t4 represent tuples with Ts=O; t7, tlO represent tuples with Ts=3
Figure 4.4. An AP-tree structure of order 3.
will always be performed directly at the rightmost leaf node. All the subtrees
but the rightmost one of the AP-tree are 100% full. When the rightmost leaf
node is full, the node is not split, but instead a new rightmost leaf node is
created and is attached to the most appropriate ancestor node. Therefore, the
AP-tree may not be height-balanced. One such example is shown in Figure 4.5.
The AP-tree structure is simple and is small in the sense that it does not
maintain additional information about its data space. However, searching for
a record can be fairly inefficient. To search for a record whose interval falls
within a given time interval as in a time-slice query, the end time of the search
interval is used to get the leaf node that contains the record whose start time
is just before the search end time. From that node, the leaf nodes on its left
are scanned.
To answer queries involving both key and time-slice, a two-level index tree
called the nested ST-Tree (NST) was proposed. The first level of an NST is
a B+-tree that indexes key values, and the second level is an AP-tree that
indexes temporal data that correspond to records with the same key value. In
the B+-tree, each leaf node entry has two pointers, with one pointing to the
current version of the record with this key, and the other pointing to the root
node of the AP subtree. A query involving only key value can directly access
the most recent version of the record through the B+-tree. Figure 4.6 shows the
structure of the NST. An index structure similar to the NST was also proposed
to index time-varying attribute and time. Since the temporal attribute is not
unique, the qualified tuples will overlap their associated time intervals.
TEMPORAL DATABASES 127
120 3 H 4 S H 7 1 O
(a) Insertion of start time 12 into a full AP-tree.
(b) Insertion of start times of 13 and 14.
Figure 4.5. Append in the AP-tree.
The AP-tree only supports monotonic appending with incremental time
value. Therefore, the multiplicity of the update operations will be limited.
The basic AP-tree itself can support queries involving only time-slice. Even
so, the search for time-slice queries is not efficient. A more expensive structure
such as the NST has to be used to answer key-time queries. Clearly, for the
time-slice queries, it is more efficient to use the AP-tree than the NST-tree.
On the other hand, for the key-range time-slice and past versions queries, the
NST-tree is more superior. We use the term AP-tree to refer to either of them,
and the context determines which structure we are referring to.
The Interval B-tree. The Interval B-tree [Ang and Tan, 1995] based on the
interval tree [Edelsbrunner, 1983] was proposed for indexing valid time inter-
vals. The underlying structure of the interval B-tree is a B+-tree constructed
from the end points of valid time intervals.
The interval B-tree consists of three structures: primary structure, secondary
structure and tertiary structure. The primary structure is a B+-tree which is
128 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
B +-tree for key index
AP-trees for time index
Data tuples
Figure 4.6. A nested ST-tree structure.
used to index the end points of the valid time intervals. Initially, it has one
empty leaf node. New intervals are inserted into this leaf node. When it
overflows, a parent node of this leaf is created, and the middle value of the
points, say m, is passed into the newly created index node. The valid time
intervals that fall to the left of m is in the left leaf bucket, and those falling to
the right of it are in the right leaf bucket. Intervals spanning over m will be
stored in a secondary structure attached to m in the index node. Figure 4.7
shows the interval B-tree after inserting tuples tl, t2, t3 and t4 of Table 4.7.
Suppose the bucket capacity is 3. When t4 is inserted, the leaf bucket overflows
and 6, the middle value of {O, 0, 5, 6, 7, now} is chosen as the item for the
index node. The tuple tl is stored in the left child of the new index node, while
tl, t2 and t4 are in the secondary structure of index item 6. At this moment,
the right leaf bucket is empty because no intervals fall to the right of 6.
TEMPORAL DATABASES 129
Index Bucket
I I
I
+t2[5, now]
t3[O, 6]
t4[O,7]
L:JLeaf Bucket
secondary
structure
Leaf Bucket
Figure 4.7. An interval B-tree after inserting n, t2, t3 and t4.
After the creation of the first index node, any further interval insertion will
proceed from the root node of the primary structure. If an interval spans over
an index item, it is attached to the secondary structure of this item. A long
valid time interval may span over several index items; however, it should be
attached to only one of them. The rule is as follow. All the items in the index
node can be maintained as a binary search tree called a tertiary structure. The
first item that entered this index node is the root of the binary search tree, and
the subsequent items having smaller (larger) values will be in the left (right)
subtree. Thus, in this binary search tree, the first item found to be spanned by
the valid time interval is used to hold it. Figure 4.8 shows insertion of the rest
of the tuples in Table 4.7.
After insertion, the root of the binary tree in the tertiary structure is 6.
Suppose we have a tuple tl6 with time interval [5, 15] to insert. Although the
period covers both 6 and 12 in the index node, since 6 is encountered first in
the binary tree of the tertiary structure, the tuple is attached to 6.
The efficiency of the index is heavily dependent on the distribution of data
and the values picked as index. A poor choice of index values may cause most
of the intervals being stored in the secondary structure, resulting in a small
B+-tree with large secondary structures.
130 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Primary Structure
Index Bucket
,6 -----..::::::.:a....__ 12
,
t
.------- Tertiary Structure
2[5, now)
3[0,6)
4[0,7)
5[4,6)
7[3, now)
10[3,6)
14[3,8)
6[7, now)
8[10, now)
9[12, now)
13[10, 12)
Secondary
Structure
Leaf Bucket Leaf Bucket
Figure 4.8. The interval B-tree after insertion of all tuples.
B+-tree with Linear Order. Temporal data can also be linearized so that
the B+-tree structure can be employed without any modification. Goh et al.
[Goh et al., 1996] adopted this approach which involves three steps: mapping
temporal data into a two-dimensional space, linearizing the points, and building
a B+-tree on the ordered points.
In the first step, the temporal data is mapped into points in a triangular
two-dimensional space: a time interval [Ts , Te] is transformed to a point [Ts ,
Te - Ts ]. Figure 4.9 illustrates the transformation of the time interval to the
spatial representation for the tourist relation. The x-axis denotes the discrete
time points in the interval [0, now], and the y-axis represents the time duration
of a tuple. The points on the line named time frontier represent tuples with
ending time of now. The time frontier will move dynamically along with the
progress of time.
In the second step, points in the two-dimensional space is mapped to a
one-dimensional space by defining a linear order on them. Given two points,
PdXl,Yl) and P2(X2,Y2), the paper proposes three linear orders:
• D(iagonal)-order «D). Pl <D P2 iff (a) (Xl + yd < (X2 +Y2); or (b) (Xl +
yd = (X2 +Y2) and Xl < X2·
y
20
18
now
14
12
10
8
6
4
2
o
TEMPORAL DATABASES 131
, ,,
, , / :rime Frontier
v/ ,/ ,,,,, ,,
2 ~,Tn (outside now)
, ,,, ,,,, ,
,, , ,,
2 4 6 8 10 12 14 now 18 20 X
Figure 4.9. Spatial representation of the tourist relation.
• V(ertical)-order « V). PI <v P2 iff (a) X2 + Y2 = now and Xl < X2; or
(b) Xl + YI :f:. now and X2 + Y2 :f:. now and Xl < X2; or (c) Xl + YI :f:. now
and X2 + Y2 :f:. now and Xl = X2, and YI < Y2·
• H(orizontal)-order «H). PI <H P2 iff (a) X2 + Y2 = now and YI < Y2; or
(b) Xl + YI :f:. now and X2 + Y2 :f:. now and YI < Y2; or (c) Xl + YI :f:. now
and X2 +Y2 :f:. now and YI =Y2, and Xl < X2·
now
(a) D-order
now
(b) V-order
now
(c) H-order
Figure 4.10. The three orderings for points in the two-dimensional space.
Figure 4.10 provides a graphic representation of the three linear orders de-
fined above. Clearly, by linearizing the points using any of the above orders,
we can construct a B+-tree on the temporal data. For instance, if we order the
132 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
IX
Figure 4.11. Organizing the spatial representation of the tourist relation using a S+-tree
and linearizing using the D-order.
points of the tourist relation using the D-order, the resultant B+-tree structure
is depicted in Figure 4.11.
A temporal query can be mapped to a spatial search on the two-dimensional
space, which in turn can be translated to a range search operation on the linear
space defined by the ordering relation. For example, consider the query "Find
all persons who left the United States on or after day 5." This query can be
efficiently handled by traversing the D-order B+-tree and retrieving all points in
the interval [(0,5), (14, 0)]. However, not all temporal queries can be efficiently
handled using the D-order. For example, consider the query "List all persons
who entered the United States on or before day 5". The D-order performs
poorly for this query, while the V-order is superior. The paper suggests that
different indexes (constructed using different ordering relations) be used to
support the various types of queries.
The main advantage of this method is the ease with which this indexing
scheme can be implemented using existing DBMSs. The performance analysis
shows that it is more efficient than the time index in terms of both storage
utilization and query efficiency. However, the index is more suitable for valid
times, which are mostly closed intervals. For data with open intervals, expen-
sive reorganization is necessary.
4.3.2 Spatial index based indexing methods
The R-tree. Unlike spatial applications where non-spatial data are usually
stored and indexed separately from spatial data, temporal attribute data such
as time-invariant key and time-varying key are indexed together with temporal
data. The time dimension can be viewed as one of the dimensions in a multi-
TEMPORAL DATABASES 133
dimensional space and indexed using some existing methods [Rotem and Segev,
1987].
In this section, we discuss how the R-tree [Guttman, 1984] can be used to
index temporal data. The R-tree is a multi-dimensional generalization of the
B-tree, that preserves the height-balance property. Detailed description of the
R-tree can be found in Chapter 2.
For temporal applications, to index temporal data and its key, the R-tree
can be implemented as a two-dimensional R-tree (2-D R-tree) or a three-
dimensional R-tree (3-D R-tree). To use a 2-D R-tree, time intervals [Ts ,
Te] are treated as line segments in a two-dimensional space, with keys on the
other dimension. To index temporal data using a 3-D R-tree, the time intervals
and keys have to be mapped into points (key, T s , Te ) in a three-dimensional
space. Figure 4.12 shows examples of data partitioning for the tourist relation
(see Table 4.7).
Both implementations can handle the pure time query, key-time query and
pure key query of the query set. For the 2-D R-tree, all searches are performed
as intersection search. For the 3-D R-tree, search intervals must be mapped
into the search regions in the triangular space. Figure 4.13 shows the query
regions on the time dimension for the four search operations. As an example,
consider the intersection search. Let the query time interval be [QTs , QTe].
For an interval in the database to intersect the query interval, either its end
time must be in the interval or its start time must be in the interval. Thus, no
record with end time less than QTs needs to be considered, and no record with
start time after QTe needs to be examined. We then have the query region as
indicated by the shaded portion.
Here it is important to note that the R-tree cannot directly handle intervals
with open end-time. An entry in the internal node of the R-tree contains an
MBR that describes the data space of its child node. When data intervals are
not closed, the MBR cannot be defined properly, and these affect the splitting
algorithm that makes use of space coverage to distribute the data into two
groups. It is possible to use the current time or the largest time due to the
proactive insertion as an estimate during node splitting and data insertion.
One of the characteristics of temporal databases is that the historical data
is stored for a long time, and no deletion of past data is allowed. The size
of the database grows as time progresses, and so are its indexes. Kolovson
and Stonebraker proposed variants [Kolovson, 1993, Kolovson and Stonebraker,
1991] of the R-tree to index historical data. The R-tree is used to index time
intervals on one dimension and non-temporal attribute on the other. Three
variants that store some of the nodes on optical disk were proposed. The
first variant (MD-RT) maintains the whole R-tree based index structure on
a magnetic disk. There is no migration from a magnetic disk to an optical
L34 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Ke
p9 ,-rtk ------------------------:
, ,
p8 ' ,
p7 : w...:, ,
p6 : _ !7 '
--------::t5....:..:.=-- -,(,----------------,
p5 I '
p4 ---------1 18 :
p3 +~ - - - - - - - - - -: ~ ~() :
-,9 -,- - - - - - - - - - - - - - - - - - - ,
p2 , ,
~ '10 1 I ~ ...Ll1...- ...1l.i...- '
pi ----- -----r_-_-~ .!
o 2 3 4 5 6 7 8 9 10 II 12 13 14 now
Time
(a) Tuples represented as lines in 2-dimensional space
Te
no
4
13
12
II .17
10
~
.tI5,'
9 "18· ,
. ".- ,
, ,,11'1
,,
(5 "
.'
,
,
3 4 5 6 7 8 9 10 II 12 13 14 w Ts
Key
(b) Tuples represented as points in 3-dimensional space
Figure 4.12. Space partitioning in the R-tree.
y
Tmax
QTe
QTs
QTs QTe TmaxX
TEMPORAL DATABASES 135
y
Tmax
QTe
QTs ..
QTs QTe Tmax X
(a) Intersection search (b) Inclusion search
yy
Tmax
QTe
QTs
QTs QTe Tmax X
(c) Containment search
Tmax
QTs
QTs
(d) Point search
Tmaxx
Figure 4.13. Query regions for R-tree on the time dimension.
:lisk needed. The second variation (MDjOD-RT-1) has the R-tree and its root
node on the magnetic disk, and moves the left-most part of the leaf nodes to
an optical disk if the size of the R-tree index reaches the pre-defined size. All
internal nodes, except the root node, whose child nodes are entirely on the
:>ptical disk are recursively vacuumed to the optical disk.
The third variant (MDjOD-RT-2) maintains two R-trees, both rooted on
magnetic disk. The first resides entirely on the magnetic disk whereas the
second stores the root node on the magnetic disk and the nodes of lower level
:>n the optical disk. When the size of the first R-t.ree reaches the expected size,
all the nodes below its root node are moved to the optical disk. Meanwhile, the
references of the first R-tree's root node are inserted into the proper position of
136 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
the second R-tree. The new records will be inserted into the first R-tree while
the search operations will be performed on both R-trees.
The data is stored in the leaf nodes, and nodes do overlap in their data
space for long intervals. In the case that the interval data collections have non-
uniform length distributions, overlap between bounding rectangles can be quite
severe due to some long intervals. To handle this shortcoming, the Segment R-
tree (SR-tree) [Kolovson and Stonebraker, 1991, Kolovson, 1993] was proposed.
The SR-tree stores interval records in both non-leaf nodes and leaf nodes. An
interval I is stored in the highest level node N of a tree if it spans at least one
of the intervals represented by N's child nodes. If an interval segment spans
the region covered by a node and extends the boundary of its parent node, it
will be cut into a spanning portion and one or more remnant portions. The
portions are stored in the separate parts of the index structure. Figure 4.14
shows the case in point.
Line segment P spans - - - -
C and extends A's boundary
root
A[i] D
~rc- -- ->
-.JL
root
spanning portion - - - - ~
- - remnant portion
Figure 4.14. A SR-tree with spanning portion and remnant portion.
An improved version of the of SR-tree, called Skeleton SR-tree, was proposed
to pre-partition the entire domain of the interval data into several sub-regions
based on estimation of the number of data records and approximation of dis-
TEMPORAL DATABASES 137
tribution of intervals. The overlap between data space of leaf nodes is reduced.
Such an estimation may be easy to derive for certain applications (for example,
video rental) that have little variations in version lifespan. For applications
with wide variance of interval lifespan, the pre-partitioning is not effective.
The Time-Polygon index. The Time-Polygon Index (TP-Index) was pro-
posed to index valid time databases [Shen et al., 1994]. Like the B+-tree
with linear order, the TP-Index maps the time interval [Ts , Te] into a point
[Ts , Te - T s ] in a triangular two-dimensional space. However, the triangular
temporal space is partitioned into some groups such that each group is the clus-
ter of data points suited for a certain search pattern. Partitioning along X- and
V-dimensions, and parallel to the time frontier produce five polygonal shapes
as shown in Figure 4.15. Polygons used in the TP-index are not minimum
bounding polygons. The polygons are derived through recursive partitioning,
and can be easily merged when the tree is collapsing. The structure of the TP-
index is like that of an R-tree. Figure 4.16 shows the partition of the temporal
space and the TP-tree structure of the tourist relation. To support proactive
additions ofrecords (for example, Tn in Figure 4.16(a)), a virtual time frontier
that assumes the largest Te (Tmax ) has to be introduced, and partitions that
are adjacent to the time frontier have to be extended outward.
A-shape B-shape C-shape D-shape E-shape
Figure 4.15. The five polygon shapes in TP-tree.
The TP-index was designed solely to index valid time and handle time-
slice queries. To enable the TP-index to support the time-invariant key, it is
extended to index data in a three-dimensional space [Jiang et al., 1996]. In the
data space, the x-axis and y-axis hold the same definitions as before; the z-axis
denotes the key values of the data points in the space (see Figure 4.17).
Initially, data points are bounded in the three-dimensional temporal space.
When overflow occurs, these data points are partitioned into groups such that
each group can be stored in one data page. Partitions must cluster the data
points to be suited for temporal search patterns. There are three partitions for
the TP-tree: y-partition introduces a plane parallel to the x-z plane (called the
138 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
y
20
18
now
14
12
IO
8
6
4
2
o
,
" / Time Frontier (now)
, /
'/
"/ ,
/ "
, ,, ,
+Tn (outside now)
6 ",
, , ,, , ,,, ,,,
2 4 6 8 IO 12 14 now 18 20 X
(a) Partitioning of the temporal space for tourist
relation
~ ~
/ ~
D:J th
~t t
polygon 2 polygon 3
(b) A TP-tree structure
data bucket for
polygon I
data bucket for data bucket for data bucket for
polygon 4
Figure 4.16. An TP-tree for the tourist relation.
TEMPORAL DATABASES 139
Y Jmax
,,
Time Frontier (now)
now,' Tmax
X
I
I
I
I
I
, I 0 -- data points, I
Z (key dimension)
Figure 4.17. A three-dimensional spatial rendition of the TP-tree.
y-plane); time-partition introduces a plane parallel to the time frontier (called
the time-plane); and key-partition introduces a plane parallel to the x-y plane
(called the key-plane). The y-partition and time-partition for different bound-
ing polygons are similar to those described in [Shen et al., 1994]. Note that
after the key-partition, the shapes of the resultant bounding polygons are the
same as that before the partitioning. Searching based on time is similar to that
proposed in [Shen et al., 1994]' where the search time intervals must be mapped
into appropriate query regions. The query regions for the various search oper-
ations on the time-dimension are shown in Figure 4.18. For example, consider
the query interval [QTs , QTe] for an inclusion search. Since all matching in-
tervals must start from QTs , those intervals that start before QTs should be
excluded. Similarly, since the query interval ends at QTe , all intervals that end
after QTe should be excluded. The resultant query region is thus the shaded
region as shown in the figure.
4.3.3 Methods for bi-temporal databases
Until recently most indexing and temporal researchers have been working on
the indexing problem along one of the two time dimensions. Kumar, Tsotras
and Faloutsos [Kumar et al., 1995] proposed two access methods, Bitemporal
140 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
y
(a) Intersection search
QTs
QTs QTe TmaxX
Y
QTe - QTs
QTs QTe Tmax X
(b) Inclusion search
y
QTe
QTs QTe Tmax X
y
QTs
QTs Tmaxx
(c) Containment search (d) Point search
Figure 4.18. Query regions for the TP-tree.
Interval Tree and Dual R-trees, for indexing both transaction and valid time
dimensions.
The Bitemporal Interval Tree makes use ofInterval Tree [Edelsbrunner, 1983]
to index a finite set U that contains V valid time points. An interval tree
consists of a full binary tree and a number of doublely-linked lists. The V time
points are in the leaf, and each internal node contains the middle value of its
two immediate children. If the starting point of an interval falls in the left
subtree of an internal node and the ending point falls in the right subtree, the
interval is stored in the doublely-linked lists associated to this internal node.
The left and right lists contain the starting and ending points respectively.
In the Bitemporal Interval Tree, the lists are transformed into "conceptual"
lists of pages to facilitate the splitting policies of the MVBT [Becker et aI.,
TEMPORAL DATABASES 141
1993] so as to answer bitemporal pure-time-slice (BPT) query. By elaborately
pagenating the whole indexing structure, the index can answer BPT query in
o(10gb V +10gb n +a) I/O operations.
The authors also proposed a method that employs two R-trees (2-R) to
divide bitemporal records on transaction time. This method aims to eliminate
the large overlapping of the mix of rectangles with known ending transaction
time and those extending to now. A front R-tree indexes the records whose
transaction time is up to now, whereas a back R-tree indexes the records whose
transaction time lifespan is closed.
7
6
5
4
3
2
I
,--- t2
I
tl
13
12345678
T
transaction time
(a) Original representation of the time dimensions
T'--''--''--1'--1---'---'--1.--1. _
" Vg
.",
~
> 7
6
5
4
3
2 t3
I
12345678
"g
:=!
..> 7
6
5
4
3
2
T I
transaction time 12345678
12
transaction time
(b) lhe back R-tree (c) the fronl R-tree
Figure 4.19. The two R-tree method.
In Figure 4.19(a), there are three records in the bitemporal space. Records
tl and t2 have open transaction time lifespan, and the transaction time of t3 is
closed at time 3. Note that the three records overlap along the transaction time
axis. To avoid this kind of overlapping so as to improve the performance of
the R-tree, the dual R-tree method keeps records with closed transaction time
range, that is t3, in the back R-tree (Figure 4.19(b)) and records with open
transaction time range, that is il and t2, in the front R-tree (Figure 4.19(c)).
142 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
In the front R-tree, a bitemporal record can be represented as an interval line
parallel to the valid time axis. As a result, the overlapping is reduced. A
bitemporal query is answered by two searches, one is for reCtangles in the back
R-tree and the other is for intervals in the front R-tree. The front R-tree needs
a slightly more expensive search algorithm due to the open intervals.
While it is difficult to extend index structures such as the AP-tree and TSB-
tree for bitemporal indexing, the R-tree and the TP-tree can be extended with
additional dimensions. For example, a 5-D R-tree or TP-tree could be used to
index time-invariant key, transaction point interval and valid time time inter-
vals. However, the extension entails redesign of more complex node splitting
algorithms and query retrieval algorithms. With an increase in the number of
dimensions, spatial indexes may not perform as well.
4.4 Experimental study
Indexes are data structures that quickly identify the locations at which indexed
data items are stored. Indexes are therefore used as a speed up device in query
evaluation algorithms. Properties desired for these indexes include efficient
storage utilization, and efficient query retrieval. In other words, the use of disk
space should be efficient, which indirectly determines the query efficiency of
an index, and an index must be able to answer basic queries efficiently. In
addition, index construction and update cost should not be too high although
they are often treated as less important selection factors.
Various performances have been conducted. The TP-index was shown to be
more superior than the Time Index for valid time databases [Shen et aI., 1994].
The result is expected as replication in the Time Index could be very bad, and
it results in a much bigger tree. The Interval B-tree was shown to be more
efficient than the Time Index and the R-tree [Ang and Tan, 1995]. It is argued
that the query efficiency of the interval tree is in the order of 0 (log n + F)
where F is the time for reporting intersections.
4.4.1 Implementation of index and buffer management
Four indexes, the TSB-tree, AP-tree, 2-D R-tree and TP-tree were implemented
in C on a SUN SPARC workstation. In this section, we restrict ourselves to
the study on the indexes built on time-invariant key and transaction time.
For a large collection of temporal data (such as one million versions), the
index size can become fairly large, and it is unlikely that the entirety of the
index fits in memory. Instead, some index pages will be paged out as the tree is
traversed, and have to be re-fetched at a later time when they are re-referenced.
To reduce page re-fetching, a priority-based buffer replacement strategy [Chan
et aI., 1992] is used. The strategy employs the least useful policy (LUF policy)
TEMPORAL DATABASES 143
and has been designed based on the wayan index is traversed. For a fair
comparison, the replacement algorithm was extended for the two-level NST
index structure. Under the strategy, priorities are assigned to index pages. An
index page is useful if it will be referenced again in a traversal of an index
structure; otherwise, the index page is useless in the current traversal. Useful
pages have higher priorities than useless pages. As the main concern ofthe work
is in minimizing the page re-fetching effect on the performance comparison, the
work fixed the buffer size at 32 pages, which is sufficient for traversing the trees
with height of up to 5 levels.
4.4.2 Data and query sets
The data sets employed in the study was generated using an extended ver-
sion of the Time-Integrated Testbed of the Department of Computer Science,
University of Arizona. The temporal relations were generated using Poisson
distributions with different mean values in arrival time (start time of an in-
terval) and version lifespan. Each database contains 1,000,000 versions. The
time-invariant attribute is uniformly distributed over [1,10000], and the number
of versions per key is randomly determined. For each version, its time-varying
attribute value is uniformly distributed in [1, 100000]. For each different set of
mean arrival and duration time, the data is generated with the constraint that
simulates transaction time. The data is generated in one go and pre-sorted
based on the start time. Each tuple is then inserted into the index. By doing
so, we did not have to modify existing R-tree splitting algorithm. This is not
ideal as the latest versions of transaction time data give rise to open rather
than closed intervals. However, apart from the R-tree, the presence or absence
of open intervals does not affect the other three indexes.
Among the basic queries, we shall look at just two of them: time-slice in-
tersection queries and key-range time-slice intersection queries. Being more
general, an intersection query is expected to yield more results than the inclu-
sion, containment and point queries.
Each set of queries contains 100 queries with different keys and time ranges.
The keys are randomly picked from its domain (that is [1,10000)). Where there
is a key-range search, a predetermined fixed range is used to determine the
end of the range. The starting time of the time ranges is generated using the
Poisson distribution, and a fixed range. Should the ending time exceeds the
current time, then the ending time is set to the current time.
144 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
4.4.3 Some experimental results on indexing invariant keys and transaction
time
We report on some experimental results on the performance of the indexes
that are built on the time-invariant key and transaction time. For time-slice
intersection queries, the mean inter-arrival time is fixed at A = 5, and the
mean duration time are fixed at J.L = 200, 500, 1000. For key-range time-slice
intersection queries, the key range is fixed at 1000 (10% of domain) when the
effect of time range is studied, and the time range is fixed at 15000 when the
effect of key range is studied.
On time-slice intersection query. Figures 4.20a and b show the perfor-
mance of the TSB-tree, AP-tree, TP-tree and (2-D) R-tree for time-slice inter-
section search queries under mean inter-arrival times of 2 and 5, and a fixed
mean duration time of 200. Figure 4.21 shows the effect of longer lifespan on
the four indexes. The performance of all four indexes is affected by the search
time range used in the query - the longer the search range the worse the
performance.
Comparison of results summarized in Figures 4.20 and 4.21, reveals that
while the mean duration time has little effect on a few indexes, the inter-arrival
time has significant effect on the performance of most indexes. Longer mean
inter-arrival time means less overlap in time intervals. For indexes such as
the TSB-tree and TP-tree, shorter inter-arrival times mean time intervals of
different keys are clustered closely, and the same search range intersects more
intervals and hence more pages are accessed. The performance of the 2-D R-
tree and the AP-tree are affected by the duration of time intervals. For the
R-tree which indexes time intervals as line segments, the degraded performance
is due to the fact that the minimum bounding rectangles (MBR) in the internal
nodes have more overlap for longer line segments. For the AP-tree, the opposite
effect is observed. Two factors contributed to this. First, (recall that) the data
set is non-overlapping for each key value. Second, a longer duration essentially
"stretches" the lifespan of the relation. As a result, the number of nodes to be
scanned by AP-tree is smaller for longer duration for the same query range.
It is clear that the TSB-tree performs the best. This can be interpreted by
the fact that the TSB-tree has a high degree of data clustering in both key
and time dimensions. On the contrary, the AP-tree is inferior to all the other
techniques. Its page accesses exceeds 2500 pages! This is because to search for
the intervals intersecting with the query interval [Ts , Te] in the AP-tree, a leaf
node is first determined using T e . All leaf nodes on its right, which contain
intervals whose start time is larger than Te , are ignored. Leaf nodes on its left
must be searched.
TEMPORAL DATABASES 145
(2,200)
3500
3000 TSB-tree -+-
R-tree -+-
TP-tree -s-
2500 AP-tree "*-
"Ql
(f)
(f)
2000Ql
()
()
<1l
(f)
1500Ql
OJ
<1l
0-
1000
500
0
1000 5000 10000 15000 20000 25000 30000
time interval
(a) (A, JL) =(2, 200)
(5,200)
3000
2500
'0 2000 TSB-tree -+-
Ql
R-tree -+-(f)
(f)
TP-tree -s-Ql
()
AP-tree "*-()
1500<1l
(f)
Ql
OJ
<1l
0- 1000
500
0
1000 5000 10000 15000 20000 25000 30000
time interval
(b) (A, JL) =(5,200)
Figure 4.20. Effect of arrival rate on time-slice intersection query.
146 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
3000
(5,500)
2500
"0 2000 TSB-tree --<r-
Q)
AP-tree """*-(/)
(/)
TP-tree -e-Q)
<.> R-tree -I--<.> 1500co
(/)
Q)
en
co
0- 1000
500
0
1000 5000 10000 15000 20000 25000 30000
time interval
Figure 4.21. Effect of longer lifespan on time-slice intersection query, (.., J-L) = (5 500).
On key-range time-slice intersection query. The results for the key-
range time-slice intersection queries are very similar to that for the time-slice
queries. Here, we shall present the results when (.., J.1) has value of (5, 200).
In order to see the effect of key range, the query time range is kept constant,
and similarly, to see the effect of the query time range, the key range is fixed.
Figure 4.22a shows the result when the key range is fixed at 1000, while Fig-
ure 4.22b looks at the effect of varying the key range when the time range is
fixed at 15000 time units. Like the time-slice query results, it can be observed
the AP-tree is also more expensive than the others due to its two level struc-
ture. With such a structure, each AP-tree in the second level of the nested
structure is small, and many of such small trees must be searched. It can be
seen also that the AP-tree is more sensitive to the key ranges than time ranges
(see Figure 4.22b). This is logical since the first level of the nested structure
is the B+-tree for keys and the key range determines the number of AP-trees
in the second level that need to be searched. As the key range increases, the
performance deteriorates. Whereas for a fixed time range, the average number
of leaf nodes that need to be searched do not differ greatly.
The TSB-tree retains its good performance in key-range time-slice query
because of its high degree of data clustering in both key and time dimensions.
TEMPORAL DATABASES 147
(5,200)
600
500 TSB-tree -+-
AP-tree -
R-tree -+-
400
TP-tree -B-
'0
(])
(J)
(J)
(])
(J
(J
300Cll
(J)
(])
Ol
Cll
a. 200
100
0
1000 5000 10000 15000 20000 25000 30000
time interval
(a) Key range = 1000
(5,200)
2500
2000
'0
(])
~ 1500
(])
(J
al(J)
~ 1000
Cll
a.
TSB-tree -+-
AP-tree -
R-tree -+-
TP-tree -B-
500
50004000
o~-iil-.-liI===~==~~;;;;;;;;1;;;;;;;;;;;;;;;;;;;;;;;;;;;;i
100 500 1000 2000 3000
key range
(b) Time range = 15000
Figure 4.22. Performance of intersection search in key-range time-slice query (.., JL) =
(5, 200).
148 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
To answer past versions query efficiently, it is important to cluster data by
the time-invariant key in an indexing structure. By linking all the past versions
of a given key together, the best performance of this query can be expected.
However, although the TSB-tree, AP-tree, R-trees and TP-tree do have some
feature of data clustering by key, none of them provide an explicit method to
link the historical versions of a given key. Hence, a search based on the key is
required. Among these four indexes, the AP-tree is likely to be more efficient
for the past versions query. For each key that satisfies the search condition, the
whole second level AP-tree is retrieved for all the versions.
4.5 Summary
In this chapter, we have surveyed a number of promising temporal indexes.
Many of these indexes were proposed either for valid time or transaction time
database. Researchers only started to work on indexing in a bitemporal database
recently. For transaction time databases, the TSB-tree approach is very efficient
as it manages to keep the volume of I/O accesses low and uses tight bounding
intervals to support fast search. However, it cannot handle disjoint intervals (or
overlapping intervals) that may be present in the valid time databases. Direct
application of B-trees such as the AP-tree by indexing on a single time point
(starting or ending) is efficient in terms of storage space but is not efficient
for any search that involves interval. Its inefficiency is due to the fact that no
information of the actual data space in the child nodes is captured for pruning
the search space. Hence, a simple time-slice search requires the scanning of a
large proportion of leaf nodes.
Spatial indexes such as the R-tree can be used for indexing both transaction
times and valid times. To index open intervals that move with current time
NOW, splitting algorithms that split nodes based on area of data space must
be re-designed to handle the situation where one side of the MBR is moving
with time. The R-tree can be used to index temporal data as line segments
or points. As indicated by the experiments, the performance of the R-tree
indexing lines is not as ideal as that of the TP-tree. However, should the lines
be mapped into points, its efficiency should become comparable to that of the
TP-tree.
Like other applications, data distribution affects the performance of tem-
poral indexes. For bitemporal databases, different distributions may exist for
the time-invariant keys, time-varying keys, the number of versions per key,
the arrival of new time-invariant keys, and for each key, the arrival of next
transaction-time versions and next valid-time versions, and relationships be-
tween two times, whether they are strongly bound [Jensen and Snodgrass,
1994]. Generally, the distribution of time-invariant key is likely to be depen-
TEMPORAL DATABASES 149
dent on the applications where they can be mapped into some sequential order.
Likewise, the distribution of time-varying key is fairly dependent on the appli-
cation, some may be on increasing order (for example, salary) while others are
likely to be more random. The arrival of new keys and arrival of new versions
tend to follow Poisson distribution.
5 TEXT DATABASES
Text databases provide rapid access to collections of digital documents. Such
databases have become ubiquitous: text search engines underlie the online text
repositories accessible via the Web and are central to digital libraries and online
corporate document management.
Perhaps the key feature distinguishing text databases from other kinds of
database is the way in which they are accessed. Queries to conventional
databases are exact logical expressions used to satisfy information needs such
as "how many accounts have a negative balance" or "which students are en-
rolled in computer science". In contrast, queries to text databases are used
to satisfy inexact information needs such as "what is the economic impact of
recycling" or "what factors led to George Bush's loss in the 1992 presidential
election". This inexactness is not because users are unable to express needs
precisely; it is because the needs deal with imprecise real-world concepts that
cannot be described in a formal system. That is, it is usually not possible to
translate such information needs into a logical query expression that will fetch
only the documents that are answers-an information need and its answers are
not mathematically related.
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997
152 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Thus there is no exact mechanism for determining whether a document is
an answer; instead, queries to text databases are used to identify documents
that are likely to be pertinent to the query, that is, likely to be relevant. These
documents may even contradict each other-commentators may disagree as to
why Bush lost the election, for example. Thus document databases must be
designed to answer informal queries and produce the most likely answers. The
study of techniques for identifying documents that are relevant to an informa-
tion need is known as information retrieval.
Since answers have only a loose, informal correspondence to queries it follows
that the performance of query evaluation techniques is not just a consequence of
how fast they are or how economical they are with system resources. It is also
necessary to cOllsider how good they are at identifying relevant documents,
that is, their effectiveness. The effectiveness of query evaluation techniques
can be formally measured by the proportion of retrieved documents that are
relevant and by the proportion of the relevant documents that are retrieved;
determination of relevance must be made by a human assessor. (It follows
that experiments in information retrieval are expensive, and tend to rely on
standard document collections and query sets for which relevance judgments
have been made.)
Text databases can also be used for more traditional forms of access to data.
For example, in a database of newspaper articles each document will include
the article's text; but will also include information such as authorship, date of
creation, and so on. A possible entry in a database of correspondence is shown
in Figure 5.1. Fields such as date could be queried in conventional ways and do
not require exotic query evaluation methods. It is the use of informal querying
that makes information retrieval systems different to other kinds of database.
In this chapter we describe the ways in which text databases might be ac-
cessed, kinds of queries, index structures to support these queries, and query
evaluation techniques.
5.1 Querying text databases
Simple text engines are familiar to anyone who uses the document repositories
available via the web. These engines can be used to find information about,
say, some individual-to find their home page perhaps-or to search for re-
search papers on a given topic. Typical queries are a list of keywords that the
user guesses will identify the desired information; the system responds with a
list of hits, some of which are relevant and some of which are (in the context
of the query) obviously junk. Based on information retrieval theory, the bet-
ter systems use efficient query evaluation techniques that return relatively few
irrelevant documents.
TEXT DATABASES 153
From: Albert Einstein
Sender address: Old Grove Rd, Nassau Point, Peconic, Long Island
To: F.D. Roosevelt, President of the United States
Recipient address: White House, Washington D.C.
Date: 2nd August 1939
Sir:
Some recent work by E. Fermi and L. Szilard, which has been
communicated to me in manuscript, leads me to expect that the element
uranium may be turned into a new and important source of energy in
the immediate future. Certain aspects of the situation seem to call for
watchfulness and, if necessary, quick action on the part of the
administration. I believe, therefore, that it is my duty to bring to your
attention the following facts and recommendations.
In the course of the last four months it has been made
probable-through the work of Joliot in France as well as Fermi and
Szilard in America-that it may become possible to set up nuclear chain
reactions in a large mass of uranium, by which vast amounts of power
and large quantities of new radium-like elements would be generated.
Now it appears almost certain that this could be achieved in the
immediate future.
This new phenomenon would also lead to the construction of bombs, and
it is conceivable-though much less certain-that extremely powerful
bombs of a new type may thus be constructed ...
Figure 5.1. Example entry in newspaper database.
At the most abstract level, text databases are like conventional databases:
given a query, each entry in the database is compared to the query to determine
whether it is an answer. To allow this process to be efficient a data structure
known as an index is used. Central to effective information retrieval is the
ability to use all the terms (that is, words) in a document to compare it to a
query. That is, it is necessary to index every term in every document.
It is possible to automatically select a subset of the words in a document
to represent its content and to index these words only, or to manually assign
descriptive words or subject categories. However, automatic selection of key-
words is in general not successful; and perhaps surprisingly automatic indexing
of all words gives more effective retrieval than does manual indexing [Salton,
1989]. Moreover the cost of manual indexing for a realistically-sized database
154 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
is prohibitive. Thus searches on document databases use content-the full text
of each document-rather than descriptors of some kind.
5.1.1 Boolean queries
There are two principal approaches to querying to text databases: Boolean and
ranked. Boolean query languages were for many years chosen for commercial
information retrieval systems. The basic concept is straightforward-queries
are Boolean expressions in which the atoms are words and are combined with
Boolean operators. For example the query
uranium AND
( (nuclear AND energy) OR (atomic AND bomb) )
could be used to retrieve the example document in Figure 5.1. Such queries are
effectively equivalent to conventional database queries (and as we discuss below
are evaluated in a similar way) but it is not easy for a typical user to translate
an information need into a Boolean query. Making good use of Boolean infor-
mation retrieval systems requires professional information providers who are
experts at interpreting user requests and translating them into formal queries.
There are several ways in with Boolean query languages for text retrieval
can be extended to give the potential for better effectiveness. One extension of
particular value to English text is stemming or suffixing. In its simplest form,
suffixing allows partial match on strings, so that for example the query term
bomb*
would match any word starting with the string bomb. This allows users to match
variant forms of the same word, such as bomb, bombs, bombing, bombardier,
and so on. Alternatively automatic stemmers can be used; these are algorithms
that recognize the standard suffixes used in English'(such as -ed, -es, -ation,
and -ness) and remove them prior to indexing [Harman, 1991, Lovins, 1968,
Porter, 1980). Stemming is a form of word normalization; another, basic form
is case conversion.
Another language extension is to allow querying on word proximity, and
in particular adjacency. In the query above, there was no requirement that
nuclear and energy be nearby in the text. If it is specified in the query
that they must be proximate or adjacent then it is more likely that retrieved
documents will contain these words as a phrase. The Boolean query languages
used in commercial text databases, such as the ISO standard 8777 or Common
Command Language, allow the user to require that two words are to be located
within any fixed number of word positions from each other.
Well-designed interfaces can also help to improve effectiveness, by for ex-
ample providing access to an online thesaurus that can be used to expand
TEXT DATABASES 155
the query. Such extensions however have no impact on the underlying query
evaluation mechanism.
5.1.2 Ranked queries
The other principal approach to text retrieval is ranking, in which a query
is an expression in natural language or a list of keywords; each document is
compared to the query and assigned a numerical similarity; and the documents
with the highest similarity values are retrieved for presentation to the user. In
contrast to Boolean queries, there is no precise delineation between answers
and non-answers; potentially every document in the database has a non-zero
similarity but only the first few documents presented for viewing (or, in the
case of information filtering [Belkin and Croft, 1992], those above a chosen
threshold) are seen by the user. There is a probabilistic assumption that the
highest-ranked documents are those most likely to be relevant; thus as the user
moves through the list of ranked documents the density of relevant documents
should diminish.
In many contexts ranked queries are simply lists of keywords, but in others
they may be substantial blocks of text. For example, the abstract of a paper-
or even a whole paper-could be used as a query to find other papers with
a similar topic; experiments with ranking have shown that longer queries are
better at identifying relevant documents. Thus a typical query might be a list
of keywords such as
nuclear atomic energy power
or a natural language description such as
Relevant documents will discuss the use of nuclear or atomic
energy as a power source.
The functions used to score documents with respect to queries are known as
similarity measures. Many years of information retrieval experiments, with
both small document collections and databases of gigabytes of text, have iden-
tified several families of effective similarity measures. (These experiments have
also shown that ranking is typically more effective than Boolean retrieval, even
for queries formulated by an expert.) We do not survey similarity measures
in this chapter, but instead illustratively focus on one: the cosine measure.
This measure is one of the most effective and has proven successful across a
wide range of databases, and is interesting because it makes use of at least as
much index information as other effective similarity measures. Discussion of
the cosine measure thus allows us to explain what information an index must
store.
Intuitively, we would like a document and query to be regarded as similar
if: most of the query terms occur in the document; they are frequent in the
156 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
document; the density of these words in the document is high; some allowance
is made for the "importance" of words, where one would usually regard a word
such as uranium to be more discriminating (and therefore more important)
than a word such as the. Mathematically these concepts can be captured as
follows. The cosine similarity of a document d and query q can be computed as
C(q, d)
ZtEq&d Wq,t . Wd,t
Wq,Wd
where Wx,t is thp. importance of word t in x and Wx is the length of x. In this
formulation of the cosine measure it can be seen that the numerator is high if
important words (that is, high Wx,t words) are in both query and document,
and that division by length ensures C(q, d) is high only ifthe document is dense
with query terms. Thus, given two documents containing the same query terms
with the same frequencies, the shorter of the two will have higher similarity.
Word importance is an abstract concept, but in practical ranking is effec-
tively captured by the formulations
Wq,t (log fq,t + 1) (log~ + 1) and
Wd,t (log fd,t +1)
Here fx,t is the frequency of occurrence of t in x-that is, the number of times
term t occurs in document or query x-and there are N documents in the
database of which It contain t. Thus a word that is rare in the collection-
that is, has a high inverse document frequency-or frequent in either query or
document attracts a high weight. The lengths are usually computed as
so that length is essentially a function of the number of distinct words. Note
that for a given query Wq is a constant and thus has no impact on the ranking
and is not calculated.
In principle, then, query evaluation for a query q consists of computing the
similarity C(q, d) for every document d in the database, then returning to the
user the documents with highest similarity.
As for queries to traditional databases, it is valuable to try and improve a
ranked query before evaluating it, by removing noise and transforming it into a
better description of the information need. In particular, stopwords are usually
removed; these are frequent, non-discriminating words such as the and closed-
class or function words such as however that carry no meaning. Elimination
TEXT DATABASES 157
of stopwords has little impact on effectiveness but is important for efficiency,
because these words are so common. After stopping the query above might be
transformed to
Relevant documents discuss nuclear atomic energy power source
Stemming is as valuable for ranking as it is for Boolean queries, for example
yielding
relev document discus nuclear atom energ power source
for the query above. Elementary natural language techniques can also prove
valuable; such techniques include recognition and deletion of key phrases, such
as "we discuss" or "in this paper" , and recognition of proper names and aliases,
so that for example "USA" and "United States" are indexed together. However,
while such techniques change the set of terms available for indexing, they do
not change the methods used to construct an index or to retrieve documents.
For further information on ranking and information retrieval, there are sev-
eral good textbooks [Frakes and Baeza-Yates, 1992, Salton, 1989, Salton and
McGill, 1983, van Rijsbergen, 1979, Witten et al., 1994]. Recent research de-
velopments in the area are presented in special issues of Communications of
the ACM [Fox, 1995] and Information Processing and Management [Harman,
1995a] .
5.1.3 Indexing needs
The needs of querying determines the kinds of information that must be held
in an index. For both Boolean and ranked queries, the index must store every
distinct word occurring in the database and, for each word, the documents
the word occurs in. To support proximity queries the index must store the
positions at which each word occurs in each document; ordinal word numbers
are more useful than byte positions. To support ranked queries the index
must store the frequency of each word in each document. As we discuss later,
richer kinds of queries may require information about document structure. In
the following sections we describe index structures that have proved successful
for text databases, then explain query evaluation techniques that use these
structures.
5.2 Indexing
5.2.1 Inverted indexes
An index is a data structure for supporting a query evaluation technique. The
most commonly used structure for indexing text databases are inverted indexes,
158 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Lexicon
10
29
41
Documents
Mapping table
Inverted lists
Figure 5.2. Arrangement of a simple inverted file..
a family of structures that can be readily adapted to each of the kinds of query-
ing discussed above. Inverted indexes are well-established-they have been used
in commercial text retrieval systems since before 1970-and in recent years re-
finements to inverted indexing have dramatically improved performance.
In outline an inverted index is extremely simple, consisting of a lexicon of the
distinct words to be indexed and for each word an inverted list of information
about that word. The lexicon must be organized to allow fast search for a
given word and each list should allow rapid processing to identify matching
documents. Thus in the most basic case the lexicon could be stored as an array
of words and each list as an array of ordinal document numbers. A mapping
table, also stored in an array, can then be used to map from document numbers
to matching documents. This arrangement is illustrated in Figure 5.2.
For example, each of the three query terms nuclear, energy, and uranium
has an entry in the lexicon (found, say, by binary search in the array) and
a corresponding pointer to the inverted list. Each list contains the document
TEXT DATABASES 159
number 12; the twelfth position in the mapping table thus points to a document
containing all of the query terms.
5.2.2 Search structures
For conventional databases, design of the search structure is crucial to perfor-
mance. For text databases, the major bottleneck is usually the fetching and
processing of the inverted lists, and any structure that allows reasonably fast
access to the distinct words of the database is likely to be satisfactory.
A typical arrangement would be to use a B-tree in which internal nodes con-
tain words and pointers to children and external leaves contain words, pointers
to inverted lists, and for each word the number of documents in the database
containing the word. For many text databases such a B-tree could easily be
held in memory, but the arrangement is also effective if space considerations
force B-tree nodes out to disk. Use of a B-tree means that the words can be
accessed in lexicographic order, allowing users to scan the lexicon and placing
words with the same root but variant suffixes together. If the lexicon is not
too large it is feasible to scan it for the strings that match a given pattern.
Other search structures have been proposed for lexicons but none offers any
clear advantage, while the logarithmic worst-case performance and good space
utilization of B-trees make them a desirable choice.
As a concrete example consider the database consisting of the 3 Gb of text
used in the first three years of the ongoing TREC information retrieval ex-
periment [Harman, 1992, Harman, 1995b]1 This database contains just over
1,000,000 documents, and, coincidentally, just over 1,000,000 distinct words
at an average of about 9 characters each. There are around 480 x 106
word
occurrences in total or, discounting repetitions of words within a document,
there are 220 x 106
word-document pairs. (Note that figures of this kind are
to a certain extent dependent on how words are defined-whether punctuation
such as apostrophes are part of words or delimit them, for example, or whether
words are distinguished by case.)
Thus the complete TREC lexicon can be stored in the leaves of a B-tree of
around 20 to 24 megabytes, given 9 bytes for each word, 4 bytes each for a count
and a pointer, and making an allowance for space wastage. Assuming a block
size of 8 kilobytes, and therefore a branching factor of 28 to 29 , the total space
for all internal nodes of the B-tree would occupy no more than 128 kilobytes
and thus even in the worst case only the leaves need be held on disk. In a basic
representation the inverted lists would contain 220 x 106
document identifiers of
four bytes each, or a little under 1 gigabyte in total. This high ratio of inverted
list size to lexicon size is typical of text databases, and is the reason that-in
160 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
contrast to other database applications-inverted lists are not stored directly
in the B-tree: their size would prohibit scanning of the lexicon.
5.2.3 Inverted lists
A basic inverted list consists of a series of document identifiers, as illustrated
in Figure 5.2. But such a list does not support the kinds of queries discussed
above; ranking requires word frequencies and proximity requires word positions.
Addition of frequency information to a list is straightforward: each docu-
ment identifier is followed by a frequency count for that word in that document.
Addition of word positions is only a little more difficult, but can add consider-
ably to index size: each document identifier is followed by a frequency count f,
then by f ordinal word positions. Thus the inverted list for uranium might be
3:1(61),
10:2(14,106),
12:1(9),
29:4(22,36,98,202), ...
representing that the word uranium occurs in document 3 once, at position 61;
in document 10 twice, at positions 14 and 106; in document 12 once, at position
9; and so on. The punctuation is of course only for the benefit of the reader;
the list is stored as the sequence
3 1 61 10 2 14 106 12 1 9 29 4 22 3698 202 ...
For the 3 gigabytes of TREe data discussed above the index would contain
220 x 106 document identifiers, 220 x 106
frequencies, and 480 x 106
positions.
Query processing (explained in detail below) involves retrieving the inverted
list corresponding to each term in the query, then processing the list to extract
document numbers and, if necessary, frequencies and positions. A typical query
term occurs in up to 1%of the stored documents, and may occur in many more,
so in a larger collection the typical retrieved inverted list will contain thousands
or tens of thousands of document identifiers.2 Fetching and processing of these
lists is the major bottleneck in query evaluation, and any improvement can
yield big reductions in query evaluation time.3
The first issue to address is the physical layout of the inverted lists on disk.
The two costs of accessing data from disk are the head-positioning time (seek
and latency) and the per-bit transfer costs. A programmer cannot directly
improve transfer costs, which on current desktop machines allow transmission
of approximately 10 megabytes per second. But repositioning of the disk head
can be largely avoided by storing each inverted list contiguously, or as close to
contiguously as the operating system will allow. A contiguous file can be fetched
around ten times faster than a file of 8 kilobyte blocks randomly scattered on a
TEXT DATABASES 161
disk, so dramatic gains can result from storing each inverted list so that it can
be fetched with a single read operation. Experimental results have shown that,
despite "interference" by the underlying file system-such as organizing files
into randomly-placed blocks and employing header blocks to locate the parts of
the file-the various optimizations used by operating systems allow large files
to be fetched at close to the maximum dictated by the transfer rate.
In some early implementations of inverted files, each list was stored as a
linked list with one node per document, resulting in both appalling perform-
ance-allowing only a few kilobytes to be fetched each second-and large in-
verted files, because of the additional requirement for pointers. It was imple-
mentations such as these that gave inverted files a reputation for inefficiency;
a related problem was that use of linked lists discouraged programmers from
maintaining inverted lists in sorted order, thus adding further to query evalu-
ation costs. However, the strategy of storing inverted lists contiguously does
present problems for update. These issues are considered further below.
Even with inverted lists stored contiguously they have significant space re-
quirements, with, in a simple implementation, 4 bytes for each word occurrence
(for the in-document position) and a further 8 bytes (for the document number
and frequency) for each word-document pair, giving approximately 4 gigabytes
for the 3 gigabyte collection described above. It is clearly desirable that this
space be reduced, not only to conserve disk usage but because reduction in
size cuts transfer costs and thus, potentially at least, reduces query evaluation
times. As a simple first step to reducing size we could question our assumptions:
why, for example, have 4 bytes for the document number? At around 1,000,000
documents 20 bits is adequate, increasing the complexity of processing the in-
verted list but reducing size significantly. Similarly, 4 bytes is excessive for a
frequency or a word position. Space can also be saved by applying a stoplist,
that is, not indexing the common words that contribute most to index size.
Such ad hoc approaches, however, will at best halve the size of the index, to
perhaps 70% of the size of the indexed data.
Much greater reductions in size-that is, compression-result from more
principled methods for efficient representation of integers [Bell et al., 1993,
Bookstein et al., 1992, Choueka et al., 1988, Moffat and Zobel, 1996, Witten
et al., 1994]. We assume in the following discussion that the numbers to be
compressed are positive integers only, but it is straightforward to adapt these
coding schemes to embrace zero and negative numbers.
One simple family of representations is the Elias codes [Elias, 1975]. The
Elias codes represent integers in variable number of bits, and contiguous se-
quences of Elias codes are uniquely decodable. The basic code is unary, in
which each number x is represented by a string of x bits. For example, below
are some numbers in decimal and their equivalent in unary.
162 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
x
1
2
3
20
7, 3, 6
unary
°10
110
11111111111111111110
1111110110111110
In the last line is shown a sequence of numbers; although no punctuation is
given the sequence can be separated into the constituent numbers-that is, the
sequence is uniquely decodable, an essential property for any such compression
scheme.
Unary is not particularly efficient for large numbers-"large" in this context
means "about 4"-but it provides the first step in the Elias family. The next
step is the gamma code, in which each number x is factored as 2P- 1 + d. For
example, 1 = 21
-
1
+°and 20 = 25
-
1
+4. Storing p in unary, using p bits, and
d in binary, using p - 1 bits, gives another uniquely decodable representation.
(In all but the last line of the following table a comma is used to separate the
unary and binary parts of each gamma code, but no such separator is required
in practice.)
x
1
2
3
20
7, 3, 6
gamma
0,
10,0
10,1
11110,0100
1101110111010
The gamma code for a natural number x requires 2llog2 xJ + 1 bits, so that
(decimal) 1,000,000 requires 39 bits. The next Elias code is delta, in which x
is factored as for gamma but p is represented using gamma rather than unary.
x
1
2
3
20
7, 3, 6
delta
0,
100,0
100,1
1l001,0100
10111100110110
Using delta, 1,000,000 is represented in 29 bits; as we discuss below this saving
can, in conjunction with other manipulations, yield excellent compression.
Another family of representations is the Golomb codes [Golomb, 1966, Gal-
lager and Van Voorhis, 1975]. These codes are of particular interest because, as
TEXT DATABASES 163
we discuss below, for this application they yield optimal whole-bit compression.4
In the Golomb codes a single integer parameter b is used to model the distri-
bution of values to be represented; this value can be approximated as
b ~ 0.69 x average x.
Given b, the number x is factored as 1 + (k - 1) x b + d where a :s d < b.
The value k is represented in unary and d in binary; but since b may not be
a power of 2 the number of bits used to represent d can vary between llog2 bJ
and flog2 b1. Computing r =flog2 bland 9 =2r
- b, the value d is encoded in
r - 1 bits if d < 9 and as d +9 in r bits otherwise.
For example, suppose b is 11 so that r is 4 and 9 is 5. Then the numbers 1
to 5 are represented by the sequence of codes 0,000 to 0,100 (where the range
of suffixes is ato 4, represented in 3 bits each) and 6 to 11 are represented by
0,1010 to 0,1111 (for suffixes 5 to 10, in 4 bits each). The codes are uniquely
decodable and, as for all such codes, all sequences of bits are a valid code.
Variable-bit coding is a necessary tool for compression of inverted lists. How-
ever, applying variable-bit codes to inverted lists in their raw form does not
yield particularly good compression; for example, the average document num-
ber only requires one or two bits fewer than the maximum number, and as the
examples above show the coding schemes do not directly result in significant
reductions in size.
A simple property of inverted lists provides the basis for much greater com-
pression. Observing that most of the numbers stored in inverted lists-the
document numbers and the positions-are strictly increasing, by taking the
difference between adjacent numbers of the same kind the values to be stored
become much smaller. Our example inverted list can be written as
3:1(61),
10-3:2(14,106-14),
12-10: 1 (9),
29-12:4 (22,36-22,98-36,202-98), ...
that is,
3:1(61),
7: 2 (14, 92),
2:1(9),
17:4(22,14,62,104), ...
Considering for the moment just the document numbers, the sequence resulting
from taking differences forms a Bernoulli distribution, for which the Golomb
codes are an optimal representation [Bell et al., 1993]. An inverted index con-
sisting of a lexicon and, for each indexed word, an inverted list of Golomb-coded
164 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
document numbers occupies under 10% of the size of the indexed data. For
the 3 gigabyte database discussed above such an inverted index requires about
190 megabytes. Delta codes can also be used, at a small loss of compression
efficiency. Using gamma codes for frequencies and delta codes for word posi-
tions, an inverted file typically occupies about 22% of the size of the indexed
data, or under 700 megabytes in our practical example-one sixth of the space
required for the uncompressed index.
This space saving does come at a cost: processing effort required to de-
code inverted lists. However, on current desktop machines the time spent in
decompression is more than offset by the time saved in data transfer [Moffat
and Zobel, 1996], and in new architectures the gap between processor speed
and disk transfer rates is continuing to widen, favoring the use of compression.
Thus inverted file compression saves both space and time. Further refinements
to representation of inverted files are discussed in Section 5.3.
Although the successful application of compression to inverted files is fairly
recent, compression is already used in several commercial text database systems
and some of the Internet search engines. The public-domain MG text database
system was developed to demonstrate the application of compression to this
domain [Bell et al., 1995, Witten et al., 1994).
5.2.4 Index construction
There are several possible approaches to index construction for text databases,
which can be broadly classified as either one-pass or two-pass, that is, according
to the number of times the text is inspected during index construction. We
first outline the possibilities, then describe two of the more efficient methods
in detail.
The concept of indexing has often been described as "inversion"-provision
of access to records according to content. Inversion is often implemented as a
sorting process, and indeed a common algorithm given in textbooks for gener-
ating an inverted file is as follows:
1. For each document d in the collection and each word t in d, write a pair
(t, d) to a file.
2. Sort the file with t as a primary sort key and d as a secondary sort key.
This algorithm is, however, almost absurdly wasteful-the document numbers
are already sorted, but sorting algorithms will gain little advantage from this
partial sorting. Moreover, the volume of index information dictates an expen-
sive external sort.
Better solutions use a dynamic structure containing the distinct words in
the database, where each node in the structure points to a dynamic list of
TEXT DATABASES 165
1. While the internal buffer is not full, get documents; for each docu-
ment d, extract the distinct words and for each word t,
(a) If t has a.lready occurred in a previous document, add d to t's
document list.
(b) Otherwise add t to the structure of distinct words and create a
document list for t containing d.
2. When the internal buffer is full, write it to disk to give a partial index,
with the inverted lists stored according to word order. Clear the buffer
and return to step 1.
3. Merge the partial indexes to give the final inverted file.
Figure 5.3. Single-pass index construction algorithm using temporary files.
the document numbers containing that word. Initially the word structure is
empty; as documents are processed new words are added, and for existing
words new document numbers are added to the words' lists of occurrences
(together with the positions of the word in each document). However, in a
naive implementation the costs will still be high because of the difficulties
of maintaining structures of words and lists without frequent disk accesses.
Minimizing the use of disk is the key to fast index construction.
There are two fast index construction methods, both of which use a dedi-
cated in-memory buffer as a temporary storE:. In the first method, shown in
Figure 5.3, the buffer is used to store complete partial indexes and the database
is processed in a single pass. Note that compression is as useful during indexing
as it is in the finished index-if the partial indexes are constructed and stored
compressed, more documents can be indexed before the internal buffer is filled,
and less temporary space is required for the partial indexes.
The main disadvantage of this method in practice is the use of temporary
space for the partial indexes, which will exceed the size of the final index because
the indexed words must be repeated between files; and further space is required
for merging. Note that given a fixed-size internal buffer the asymptotic cost of
the merging grows more quickly than does the volume of data to be indexed.
This is not usually a problem in practice because, at least historically, growth
in database size has been matched by improvements in technology, but the
single-pass algorithm is not suitable for "huge" databases.
The alternative efficient method, however, has neither of these problems.
This method is outlined in Figure 5.4. Given memory for a complete lexicon
166 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
1. Extract the distinct words from each document, and for each word
count the number of documents in which it appears. (Additional statis-
tics are required if word positions are to be stored.)
2. Use the complete lexicon and occurrence counts to create an empty,
template inverted index, to be progressively filled in during the second
pass. The template index contains each distinct word and, for each
word, contiguous space for the word's document list.
3. Initialize the second pass by creating, in the internal buffer, an empty
document list for each term in the lexicon.
4. While the internal buffer is not full, get documents; for each docu-
ment d, extract the distinct words and, for each word t, add d to t's
document list.
5. When the buffer is full, write the partial index into appropriate parts
of the template index, clear the document lists, and go to step 4.
Figure 5.4. Two-pass index construction algorithm.
and for a fixed buffer to be used as a temporary store, a text database can
be rapidly indexed in two passes using no temporary disk space at all [Witten
et aI., 1994). In this method, the first pass is used to construct the lexicon
and a skeleton for the complete index. The skeleton is progressively filled in
during the second pass, by writing the contents of the buffer when it becomes
full; note that each writing of the buffer requires only a single pass through the
disk, thus minimizing disk head movement.
Both methods are highly efficient in practice, indexing about half a gigabyte
of text per hour on a large desktop machine. Indeed the principle costs tend
not to be the indexing itself but the auxiliary processes such as the parser for
extraction of words from each document.
5.2.5 Index update
Compared to records in conventional databases, each record in a text database
contains a large number of items to be indexed-usually hundreds and often
thousands or more. Index update is therefore expensive: insertion of a single
record involves changing the inverted list of every word occurring in that record.
Since these changes can increase the length of the inverted lists, so that (if
stored contiguously) they may no longer fit at their current location on disk,
TEXT DATABASES 167
update also involves moving lists to allow for such increase. The cost of update
is the most significant technical difficulty faced in implementation of a text
database system. In this section we describe approaches to update of indexes for
text database, principally considering record insertions as these are by far the
most common update operation to text databases: in contrast to conventional
databases, in which every record in a table may be modified daily by operations
such as "add interest to every account balance" , there are no bulk updates, and
a great many text databases are used to store streams of incoming data such
as newspaper articles, court transcripts, and completed documents of one kind
or another.
There is no single clever strategy that dramatically reduces update costs
(which, for similar reasons, are also a problem for the alternative technology of
signature files). There are however several strategies for ameliorating update
costs, by using temporary space, by trading update time against query evalua-
tion time, and by deferring the availability of new documents. We now outline
some of these strategies.
Updating the index as each record is inserted is costly, but the per-record
cost rapidly diminishes if insertions are batched, say into groups of R records,
and all of the corresponding index updates handled at once. Such aggrega-
tion of updates is effective because records share many words (in particular the
common words, whose inverted lists are the most expensive to access and up-
date), and because the changes to the inverted lists can be handled in order of
appearance on disk, minimizing head movement-net seek time will be almost
unchanged compared to updating the inverted file for a single record. Varying
R trades the per-record cost of update against the delay until the record be-
comes available. In some environments, for example, it may be quite reasonable
to process all insertions overnight, in which case the amortized update cost is
negligible but the database will be unavailable while the index is modified.
In other environments, the downtime and the delay in availability of new
records are unacceptable. However, simple variants of the batching strategy
can still be used. For example, if the new records are not indexed immediately
that does not mean that they are unavailable; they can be held in a pool that
is exhaustively searched during evaluation of each query.s If this pool is large
enough that exhaustive searching is an unreasonable expense, the pool can be
treated as a mini-database and indexed accordingly.
Once we grant the existence of a pool index, further cost ameliorations are
possible. In particular, the main index can be updated on the fly, with each in-
verted list updated as the opportunity arises-when that inverted list is fetched
as part of query evaluation, for example, Or when a moment of inactivity allows
the machine to schedule the update.
168 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
A further amelioration is to consider the organization of each inverted list
of disk. Contiguous storage is clearly preferable for fast query evaluation, but
does not allow the fastest update for the reasons discussed above. However, it
does allow reasonable update. A simple free-list of available space can be used
to maintain the index, for example, typically resulting in space utilization of
around 67%-an unfortunate increase in index size, but not a disaster given
the small initial size.
An alternative is to carve each list into blocks in some way. Here again
there is a trade-off, since long blocks are highly wasteful of space-the average
inverted list is kilobytes but the median is only tens of bytes-but short blocks
are in effect a linked list. One approach that has been suggested is to use a
linked list of blocks, each one twice the length of its predecessor [Faloutsos
and Jagadish, 1992]. However, if applied to all the lists this solution does not
reduce storage costs and increases query evaluation costs. To see why, consider
how the individual blocks must be allocated. Either each block size must be
stored in a separate file or blocks must be managed within a single file via a
scheme such as the buddy system; in either case significant head movements
are required to fetch a single inverted list. Moreover, in either case the trailing
block in each list will be only partially used, giving average space utilization of
75%. In the presence of update some of the blocks of each size will be unused,
further reducing space utilization. Thus the scheme uses only slightly less space
than contiguous storage but adversely impacts query evaluation. The volume
of data read and written during update is reduced (in both cases the whole
list must be read; in the contiguous case, if there is no room for expansion the
whole list must be written elsewhere, whereas in the blocked case only the end
of the list must be written), but more separate accesses are required for these
accesses for the blocked lists.
A practical compromise is to partition only the longest lists into fixed- or
variable-length blocks, and use conventional space management strategies to
manage the rest so that these lists are stored contiguously. A block size that
reflects the organization of the underlying file system is likely to give good
performance. Note that maintaining the contents of a contiguous list in sorted
order is not a significant overhead-even if updates (as opposed to insertions)
are frequent, the cost of inserting a number into an array in memory is dwarfed
by the cost of reading or writing the array to disk-and maintaining sorted
order significantly reduces the cost of query evaluation.
5.2.6 Signature files
Our presentation of inverted files has been rather clear-cut, specifying exactly
how text should be indexed with only limited options for variations that might
TEXT DATABASES 169
improve performance. We are able to present the material in this way be-
cause, currently at least, the technology is fairly settled. There is no compet-
ing methodology for indexing text that efficiently supports evaluation of query
types such as ranking and proximity. Inverted files have not always held such
a position, however. An alternative technology for more limited applications is
signature files.
In signature files, each record is represented by a fixed-length bitstring, or
signature [Pfaltz et al., 1980). The words in the record are hashed to decide
which bits are set to 1; a record is probabilistically likely to contain a given word
if all the bits in its signature that correspond to that word are set. As in all hash-
based methods an explicit vocabulary is not required. Naive query evaluation
requires inspection of all the signatures. However, only those bit positions
corresponding to the query terms need to be inspected, so, by transposing the
array of signatures into an array of bitslices, rapid evaluation of conjunctive
queries is possible [Roberts, 1979). Further improvements can be obtained by
organizing the slices into a multi-level structure [Kent et al., 1990, Sacks-Davis
et al., 1987). Once likely matches are identified these records must be retrieved
and post-processed to verify whether they contain the query terms.
Signature files are well-suited to many of the older text database appli-
cations, which featured: fixed-length documents such as abstracts; machines
with small memories and large numbers of users; and simple Boolean and adja-
cency queries. Compared to the traditional linked-list inverted files, signature
files are rather smaller and give significantly better evaluation times. How-
ever, signatures are not effective for current text applications, partly because
they are poor at indexing databases whose records vary dramatically in length;
and partly because they do not provide efficient evaluation mechanisms for the
rich query paradigms that users now expect for text databases, including not
only ranked and proximity queries but the structured-based querying discussed
below. Moreover, they are not as compact as the current inverted file imple-
mentations, which radically improve on the implementations of only a few years
ago [Zobel et al., 1992, Zobel et al., 1995a).
5.3 Query evaluation
5.3.1 Boolean queries
Boolean query evaluation is, conceptually, a straightforward application of el-
ementary algorithms. Assuming the inverted lists are stored in sorted order
(and neglecting for the moment queries involving phrases or proximity) each
operation is a simple linear merge of two sorted lists, with intersection for AND
and union for OR. The temporary space required to represent the result of the
merge is at most one slot for each document in the database.
170 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Evaluation is only made slightly more complex by introduction of proximity
queries. An intersecting merge is used to find the documents containing the
words that must be proximate; then a comparison of positions is used to check
that the words are appropriately close within the documents. Note that the
word positions should be represented as ordinal word occurrences rather than
byte positions, or it is not possible to reliably identify whether two words are
actually adjacent.
5.3.2 Ranked queries
The principle of ranking was sketched out above: a similarity measure such as
cosine is used to allocate a numerical score to each document in the collection
with respect to the query, then the documents with the highest scores are
retrieved for presentation to the user. In this section we explain how an index
can be used to rapidly compute the scores for the highest-ranked documents.
Reformulating the cosine measure as
C(q, d)
LtEq&d Sq,d,t
Wq ,Wd
where 5 q,d,t = Wq,t . Wd,t, it can be seen that, for any document d, the value
Sq,d,t is non-zero only if t occurs in q, that is, if t is a query term. The numer-
ator LtEq&d 5 q,d,t can be computed considering only query terms; thus all the
information required to compute the numerators is available in an inverted file.
(For the remainder of this discussion we assume that each inverted list consists
of pairs (d, ht) of number-frequency pairs, and that position information is
either not stored or is ignored by the ranking process.) The query length Wq is
unnecessary, but the document lengths Wd must be precomputed and stored in
a separate structure; with efficient representations these lengths can be stored
in a few bits each [Moffat et al., 1994].
Using the inverted file, the cosine similarity of a document d and query q can
be computed as in the elementary ranking algorithm in Figure 5.5. An array of
accumulators is used to store, for each document in the database, the running
totals of the partial sum LtEq&d 5q,d,t. For a typical database and query, once
index processing is complete a reasonable fraction of the accumulators will be
non-zero. These accumulators are then normalized by the document lengths,
and a partial sort such as a heapsort is used to identify the k documents with
the highest cosine values.
The elementary ranking algorithm provides reasonable performance, and in-
deed has been employed in many practical information retrieval systems. How-
ever, it does have significant costs that in many environments are unacceptable,
particularly for larger document collections. First, ranked queries are often ex-
TEXT DATABASES 171
1. Create an array A of accumulators, one for each document d in the
database, and for each d initialize Ad f- O.
2. For each term t in the query,
(a) Compute the term weight Wq,t.
(b) Retrieve the inverted list for t from disk.
(c) For each term entry (d, id,t) in the inverted list, compute Wd,t and
set Ad f- Ad + Sq,d,t.
3. Divide each non-zero accumulator Ad by the document length Wd.
4. Identify the k highest accumulator values (where k is the number of
documents to be presented to the user) and retrieve the corresponding
documents.
Figure 5.5. Elementary ranking algorithm using an array of accumulators.
pressed in natural language, and therefore contain a large number of query
terms; from the point of view of effectiveness this is beneficial because increas-
ing the number of query terms can significantly improve the likelihood that the
query will locate relevant documents. Second, some of the query terms may
occur in a good fraction of the records in the database. The inverted lists for
these query terms must be retrieved and processed in full, and some of them
may be long. Third, the array of accumulators, which contains a floating point
value for each document in the database, is accessed frequently and randomly
and hence must be stored in memory; and a separate array is required for each
simultaneous query. Fourth, the array of document lengths must be either held
in memory or fetched in full for each query.6
In combination, there is substantial use of disk traffic, for inverted list re-
trieval; memory, for accumulators and document lengths; and processor time,
for decompression, accumulator update, and accumulator normalization. We
need to consider ways to reduce all these costs.
An observation that allows savings in all of these resources is that a to-
tal ranking is unnecessary-in response to a given query users are only inter-
ested in a tiny subset of the document collection. Thus it is not necessary to
compute the similarity of every document. Using simple heuristics, several of
which are discussed below, it is straightforward to drastically prune the number
of accumulators required without degrading retrieval effectiveness. (However,
note that two methods can highly rank completely different documents. That
172 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
is, maintenance of effectiveness does not imply that the same documents are
fetched, but only that the same proportion of fetched documents are relevant.)
Once the number of accumulators is reduced, index reorganizations can be used
to reduce the other resource requirements.
A straightforward approach to reducing the number of accumulators is to
restrict their number to some fixed value Amax where Amax « N, the number
of documents. In simple versions of such algorithms [Moffat and Zobel, 1996],
query terms are processed in order of decreasing importance as measured by
their inverse document frequency; each (d, fd,t) pair is decoded and d, if not
previously encountered, is only allocated an accumulator if the limit Amax has
not yet been met. Thereafter only existing accumulators can be updated, and
(d, fd,t) pairs referring to other documents are ignored. Thus only documents
containing rare (high inverse document frequency) terms are allocated accu-
mulators, on the heuristic assumption that documents without such terms are
unlikely to be relevant.
Experimentally there was no impact on effectiveness with Amax set so that
only around 2% of the documents have an accumulator, thus reducing memory
requirements by about a factor of 15 (although there is only one-fiftieth of the
number of accumulators, each accumulator now requires a document number
and is stored in a sparse data structure), and eliminating some of the compu-
tational requirement for accumulator update. Since most of the (d, fd,t) pairs
in each inverted list are no longer used-particularly in the long inverted lists
of common terms-the decompression of these pairs is wasted effort. Most
of the decompression can be avoided by introducing a small amount of inter-
nal structure into each inverted list to allow the unused (d, fd,t) pairs to be
skipped, slightly increasing disk traffic but halving processing costs. This in-
ternal structure can also be used to accelerate Boolean query processing. With
these improvements the remaining important bottleneck in processing is the
disk traffic.
An alternative method further reduces processing costs and also reduces disk
traffic [Persin et al., 1996]. The basic idea is that by only allowing sufficiently
large Sq,d,t values to create an accumulator, the number of accumulators will be
reduced. The principle underlying "sufficiently large" is that-because accumu-
lator values grow as inverted lists are processed and because Sq,d,t values tend
to diminish if inverted lists are processed in decreasing order of inverse docu-
ment frequency--the effect of adding further Sq,d,t terms to the accumulators is
increasingly marginal, and not only are unlikely to bring new documents into
the top k but cannot even significantly perturb the ranking. By comparing each
Sq,d,t value to two current thresholds (one to check whether the value should
be considered at all and one to check whether it warrants a new accumulator),
small Sq,d,t values can be filtered and the number of accumulators restricted.
TEXT DATABASES 173
The thresholds are increased as inverted lists are processed. This method, as
for the skipping method, drastically reduces memory requirements without de-
grading retrieval effectiveness, but it requires two parameters to control the
degree of filtering.
If the inverted files are designed appropriately disk traffic can also be dra-
matically reduced. The principle of the index design is that inverted lists are
sorted by within-document frequencies rather than by document number. For
example, consider the inverted list
(5,3)(9,2)(12,2)(16,5)(21,1)(25,2)(32,4) ,
representing that the term being indexed occurs three times in document 5,
twice in document 9, and so on. If the list is ordered first by decreasing within-
document frequencies, with a secondary sort by document number, then it
becomes
(16,5)(32,4)(5,3)(9,2)(12,2)(25,2)(21,1) ,
With this ordering, all of the sufficiently large Sq,d,t values in each inverted
list are at the start; once a small Sq,d,t value is reached then fetching and
processing of that inverted list can terminate. In the experiments of Persin
et 301. this allowed a five-fold reduction in disk traffic and processing time.
A potential drawback to this reorganization of inverted lists is that the docu-
ment numbers are no longer sorted, so that the compression strategy described
above is not strictly applicable. However, a straightforward modification of it
yields equally good compression. First, the frequencies are stored in decreasing
order, so the duplicate frequencies are redundant and can be omitted. Sec-
ond, in practice most of the frequencies are either 1 or 2, and compressing
the sorted document numbers of a given frequency yields good space saving.
Overall, frequency sorting slightly reduces index size.
Another alternative, also based on frequency-sorted inverted lists, is to in-
terleave the processing of the inverted lists rather than process them sequen-
tially [Persin, 1996]. In the query evaluation methods described above, each
inverted list is processed sequentially from the beginning until either the list is
exhausted or the frequencies are judged to be sufficiently small that they will
not affect the ranking; once processing of an inverted list is complete, it is not
revisited. But consider two terms t and tl
occurring in documents d and dl
re-
spectively. Even if t is ra.rer than tl
and has higher inverse document frequency,
so that t's inverted list is processed first, it may well be that Sq,d,t is less than
Sq,d',t ' if t is much less frequent in d than tl
is in dl
. It follows that, if we are
to observe the principle that high Sq,d,t values should be processed first, it is
inappropriate to process the whole of the inverted list for t before commencing
the list for t',
174 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
1. Create an empty set of accumulators.
2. For each term t in the query, identify the highest within-document
frequency id,t for that term and compute the partial similarity Sq,d,t.
3. While the largest unprocessed Sq,d,t value is sufficiently large,
(a) Find the query term t with the largest unprocessed Sq,d,t value.
(b) If there is an accumulator Ad present in the set of accumulators,
set Ad +- Ad + Sq,d,t·
(c) Otherwise, if the number of accumulators is less than Amax , create
a new accumulator Ad and set Ad +- Sq,d,t.
(d) Compute the next highest Sq,d,t value for t.
4. Divide each accumulator Ad by the document length Wd.
5. Identify the k highest accumulator values and retrieve the correspond-
ing documents.
Figure 5.6. Interleaved ranking algorithm using limited accumulators
In interleaved ranking, processing consists of considering the partial simi-
larity values Sq,d,t in order of strictly non-increasing magnitude, independent
of the inverted lists in which they occur. Efficiency gains results from two
heuristics: limiting the number of accumulators so that only the larger Sq,d,t
values can create an accumulator; stopping when the next greatest Sq,d,t value
is sufficiently small and is unlikely to affect the relative order of the high-
est ranked documents. Whether an Sq,d,t value is "sufficiently small" can be
heuristically determined by examining the current accumulator values. An al-
ternative approach is to explicitly bound the time required to evaluate a query,
and terminate processing when the time bound is reached.
Such processing is supported by frequency-sorted indexes, in which the high-
est frequencies in each list (and thus highest Sq,d,t values in each list) are at the
start, and (el, id,t) values can be retrieved from each list in decreasing order.
Interleaved query evaluation is shown in Figure 5.6.
The main potential disadvantage of interleaved ranking is that inverted lists
are fetched on demand, piecemeal, rather than with a single read. Thus fetching
the whole list at once incurs the overhead of retrieving unnecessary data, while
fetching the list at need can incur the overhead of unnecessary disk activity. In
practice, however, the problem does not appear to be significant-in most cases
TEXT DATABASES 175
all of the required (d, Id,t) pairs are in the first few kilobytes of each inverted list,
so fetching a single disk block from the start of each list is sufficient [Brown,
1995]. Moreover, in some cases not even the first block is required; if the
maximum Id,t value for each term is held with the term in the lexicon, it is
possible to identify that, for some terms, no 5q,d,t value will be sufficiently large.
These are not the only possible approaches for improving the basic ranking
algorithm. Elimination of stopwords can be used to reduce the computation
costs. However, it is sometimes difficult to determine the correct set of stop-
words for a particular document collection. For example, in a database of
articles from the Wall Street Journal within the TREC collection, the word
"text"-not a particularly common word in English-is encountered in every
document in the collection.
Other proposals have been based on dynamic stopping conditions. One is
that the number of accumulators be limited by considering only documents that
contained a term with a sufficiently high inverse document frequency [Harman
and Candela, 1990]. Another possible stopping condition is to reduce the num-
ber of (d,ld,t) pairs by computing an upper bound for the similarity of the
current document being considered, and ignoring 5q,d,t if the computed upper
bound was smaller then the weight of the least important document in the set
of answers [Lucarella, 1988]. The efficiency of the basic ranking algorithm can
also be improved using the assumption that only k top ranked documents are
to be retrieved [Buckley and Lewit, 1985]. In this method, query processing is
terminated when the upper bound of the similarity of the k +1st document be-
comes less than the similarity of the kth document. However, these schemes do
not provide the dramatic improvements given by the methods discussed above.
5.4 Refinements to text databases
5.4.1 Structure and fields
Traditional text retrieval systems regard each document as an unstructured
sequence or bag of words. However, documents consist of fields such as titles,
sections, and paragraphs. These components often conform to a hierarchical
structure that can be represented by a formal schema such as an SGML docu-
ment type description [Goldfarb, 1990].
Compared to traditional database applications, text objects conforming to
the same schema can vary widely in both structure and size. Consider, for
example, a collection of documents relating to the technical details for the
products of a manufacturing company. These documents might include mem-
oranda, engineering reports, and surveys of technical literature, all written to
conform to the company's official proforma. They might also include other
memoranda written by office staff without reference to the official forms, let-
176 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
<letter>
<head><from>Mark Twain«from>
<to>W. D. Howells>«to>
<date>15 June 1872«date>
«head>
<body><sentence> Friend Howells
< (sentence> <sentence>
Could you tell me how I could get a copy of your portrait as
published in Hearth & Home? «sentence><sentence>
I hear so much talk
about it as being among the finest works of art which have
yet appeared in that journal, that I feel a strong desire to see it.
«sentence><sentence> Is it suitable for framing?
«sentence> ... «body>
</letter>
Figure 5.7. SGML document illustrating hierarchical structure.
tel's that have little structure in common with either of the other classes of
memoranda, documents from external sources, and so on. Yet all these docu-
ments must be searched as a single collection. The lack of uniformity among
the documents in a single collection makes indexing and retrieval more complex
than if the documents had uniform structure and size.
We illustrate structure by considering a collection of documents in which
markup (such as SGML tags) is included in the text to represent the structural
information. Consider for example the document in Figure 5.7, which is a letter
consisting of a head and body. The head consists of three fields: from, to and
date and the body consists of a number of sentences. Each structural unit is
delimited by a start tag and an end tag. For example, a sentence starts with
a <sentence> tag and ends with a </sentence> tag. The document forms a
simple tree, in which the text is in the leaves and each structural unit is a node.
Structured documents can be queried in the traditional way, as if they were
no more than a sequence of words, but query languages can take advantage of
the structure to provide more effective retrieval. A simple example of a query
involving structure is
find documents with a chapter whose title contains the phrase "metal fatigue"
If such queries are to be evaluated efficiently they require support from indexing
mechanisms. One possibility is to use conventional relational or object-oriented
database technology to store and index the leaf elements of the hierarchical
TEXT DATABASES 177
structure, and maintain the relationships between these leaf elements and the
higher level elements of the document structure in other relations (or object
classes). Join operations can then be used to reconstruct the original docu-
ments or document components. The problem with using such technology is
that a large number of database objects may be required to store the infor-
mation from a single document, so that it is expensive both to search across
the document and to retrieve it for presentation. For these reasons specialized
indexing techniques for structured documents have been developed.
Perhaps the simplest method for supporting structure is to index the docu-
ments and process queries as for unstructured docllments, so that the result of
query resolution is a set of documents that potentially match the query; these
documents can then be filtered to remove false matches. As a general prin-
ciple it is always possible to trade the size and complexity of indexes against
post-retrieval processing on fetched documents-there is a tradeoff between the
amount of information in the index and the number of false matches that must
be filtered out query time, and indeed for just about any class of data and in-
dex type it is possible to conceive of queries that cannot be completely resolved
using the index. It is often the case, however, that addition of a relatively
small amount of information to an index can greatly reduce the number of false
matches to process; consider how adding positional information eliminates the
need to check whether query terms are adjacent in retrieved documents. More-
over, the cost of query evaluation via inverted lists of known length is usually
much more predictable than the cost of processing an (unknown) number of
false matches. We therefore consider query evaluation techniques that involve
increased index complexity and reduced post-retrieval processing.
One approach is to encode document structure in the index. For each doc-
ument containing a given word, rather than storing the document number and
the ordinal positions at which it is possible to store, say, the document number;
the chapter number within document; paragraph within chapter; and finally
position within paragraph.
Indexes for hierarchically structured documents require that considerably
more information be stored for each word occurrence, but the magnitudes of
the numbers involved are rather smaller, the "take difference and encode" com-
pression strategies can be applied, and there is plenty of scope to remove re-
dundancy: if a word occurs twice in a document, the document number is only
stored once; if it occurs twice in a chapter, the chapter number is only stored
once; and so on. Experiments have shown that, compressed, the size of such an
index roughly doubles compares to storing ordinal word positions, from about
22% of the data size to 44% of the data size [Thom et al., 1995]. The resulting
indexes allow much more powerful queries to be evaluated directly, without
recourse to false matching.
178 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Rather than encode the structural information within the inverted indexes,
another approach is to maintain simple word position indexes for each term in
the database and record the structural information in separate indexes.
In order to represent the positions of the words and the markup symbols,
the words in each document are given consecutive integer numbers and the
markup symbols are given intermediate rational numbers. Thus, for example,
a certain word might occur at position 66, the start tag for paragraph occur
at position 53.5, and the end tag at 69. I-from which it can be deduced that
the word occurs in the paragraph. The positions between a start tag and the
corresponding end tag constitute an interval.
Evaluating Boolean queries with conventional text indexes involves merging
the inverted lists the query terms. In contrast, the processing of structural
queries involves merging the inverted lists of word positions and inverted lists
of intervals. For example, processing the query
find sentences containing "fatigue"
involves merging the inverted lists of word positions for the term "fatigue" and
the inverted list of intervals for the tag sentence to identify a set of intervals
containing the word.
An approach to query on structure based on text intervals was formalized
as the GCL (Generalized Concordance Lists) model [Clarke et aI., 1995]. The
GCL model includes an algebra that incorporates operators to eliminate inter-
vals that wholly contain (or are wholly contained in) other intervals. These
operators are important for efficient query processing. GCL evolved from two
earlier structured text retrieval languages developed at the University of Wa-
terloo [Burkowski, 1992, Gonnet and Tompa, 1987], one of which, the Pat text
searching system, was developed for use with the New Oxford English Dictio-
nary. Dao et al. [Dao et aI., 1996] extended the GCL model to manage recursive
structures (such as lists within lists).
Compared to the approach of incorporating document structure within the
inverted indexes, the GCL model and its variants have two important advan-
tages: queries on structure only (such as "find documents containing lists") can
be evaluated efficiently using the interval index; and the GCL model does not
require that the document structure be hierarchical. On the other hand, it is
expensive to create and manipulate inverted lists of commonly occurring tags
(such as section or paragraph) that are contained in every document so that,
for hierarchical document collections, incorporating document structure within
the inverted index is likely to have performance advantages. For example, a
simple query to find sentences containing two given terms only requires, with a
hierarchical index, that the inverted lists for the query terms be retrieved and
processed; while with the interval approach it is also necessary to fetch and
process the inverted list of sentence tags.
TEXT DATABASES 179
5.4.2 Pattern matching
Standard query languages for text databases include pattern matching con-
structs such as wildcard characters and other forms of partial specification of
query terms. In particular, in both ranking and Boolean queries users often
use query terms such as comput* to match all words starting with the letters
comput, and more general patterns may also be used. A common approach
is to scan the lexicon to find all terms that satisfy the pattern matching con-
struct and then retrieve all the corresponding inverted lists. Since the lexicon
is ordered, prefix queries, where patterns are of the form X*, can be evaluated
efficiently since, with a lexicon structure such as a B-tree, all possible matching
terms are stored contiguously. However, other pattern queries can require a
linear scan of the whole lexicon. The problem, in a large lexicon, is to rapidly
find all terms matching the specified pattern.
A standard solution is to use a trie or a suffix tree [Morrison, 1968, Gonnet
and Baeza-Yates, 1991)' which indexes every substring in the lexicon. Tries
provide extremely fast access to substrings but have a serious drawback in this
application: the need for random access means that they must be stored in
core which means that, at typically eight to ten times larger than the indexed
lexicon, for TREC up to 100 megabytes of memory is required. Unless speed
is the only constraint smaller structures are preferable.
One alternative is to use a permuted dictionary [Bratley and Choueka, 1982,
Gonnet and Baeza-Yates, 1991] containing all possible rotations of each word in
the lexicon, so that, for example, the word range would contribute the original
form Irange and the rotations range I, ange Ir, nge Ira, ge Iran, and e Irang,
where I indicates the beginning of a word. The resulting set of strings is then
sorted lexicographically. Using this mechanism, all patterns of the form X*,
*X, *X* and x*y can be rapidly processed by binary search on the permuted
lexicon. The permuted lexicon can be implemented as an array of pointers, one
to each character of the original lexicon, or about four times the size of the
indexed data. Update of the structure is fairly slow.
Another approach is to index the lexicon with compressed inverted files [Zo-
bel et al., 1993]. The lexicon is treated as a database that can be accessed using
an index of fixed length substrings of length n, or n-grams. To retrieve strings
that match a pattern, all of the n-grams in the pattern are extracted, the words
in the lexicon that contain these substrings are identified via the index; and
these words are checked against the pattern for false matches. This approach
provides the general pattern matching and a smaller overhead, with indexes
of around the same size of the indexed data; matching is significantly slower
than with the methods discussed above but still much faster than exhaustive
180 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
search. A related approach is to index n-grams with signature files [Owolabi
and McGregor, 1988], which can have similar performance for short strings.
5.4.3 Phonetic matching
Pattern matching is not the only kind of string matching of value for text
databases. Another kind of matching is by similarity of sound-to identify
strings that, if voiced, may have the same pronunciation. Such matching is
of particular value for databases of names; consider for example a telephone
directory enquiry line.
To provide such matching it is necessary to have a mechanism for determin-
ing whether two strings may sound alike-that is, a similarity measure-and, if
matching is to be fast, an indexing technique. Thus phonetic matching is a form
of ranking. Many phonetic similarity measures have been proposed. The best
known (and oldest) is the Soundex algorithm [Hall and Dowling, 1980, Kukich,
1992] and its derivatives, in which strings are reduced to simple codes and are
deemed to sound alike if they have the same encoding. Despite the popularity of
Soundex, however, it is not an effective phonetic matching method. Far better
matching is given by lexicographic methods such as n-gram similarities, which
use the number of n-grams in common between two strings; edit distances,
which use the number of changes required to transform one string to another;
and phonetically-based edit distances, which make allowance for the similarity
of pronunciation of the characters involved [Zobel and Dart, 1995, Zobel and
Dart, 1996].
An n-gram index can be used to accelerate matching, by selecting the strings
that have short sequences of characters in common with the query string to be
subsequently checked directly by the similarity measure. The speed-up available
by such indexes is limited, however, because typically 10% of the strings are
selected by the index as candidates.
5.4.4 Passage retrieval
Documents in text databases can be extremely large-one of the documents in
the TREe collection, for example, is considerably longer than Tolstoy's War
and Peace. Retrieval of smaller units of information than whole documents
has several advantages: it reduces disk traffic; small units are more likely to
be useful to the user; and they may represent blocks of relevant material from
otherwise irrelevant text. Such smaller units, or passages, could be logical units
such as sections or series of paragraphs, or might simply be any contiguous
sequence of words.
Passages can be used to determine the most relevant documents in a collec-
tion, on the principle that it is better to identify as relevant a document that
TEXT DATABASES 181
contains at least one short passage of text with a high number of query terms
rather than a document with the query terms spread thinly across its whole
length. Experiments with the TREC collection and other databases shows that
use of passages can significantly improve effectiveness [Callan, 1994, Hearst and
Plaunt, 1993, Kaszkiel and Zobel, 1997, Knaus et al., 1995, Mittendorf and
Schauble, 1994, Salton et al., 1993, Wilkinson, 1994, Zobel et al., 1995b]. Use
of passages does increase the cost of ranking, because more distinct items must
be ranked, but the various techniques described earlier for reducing the cost of
ranking are as applicable to passages as they are to whole documents.
5.4.5 Query expansion and combination of evidence
Improvement of effectiveness-finding similarity measures that are better at
identifying relevant documents-is a principal goal of research in information
retrieval. Passage retrieval is one approach to improving effectiveness. Two
other approaches of importance are query expansion and combination of evi-
dence.
The longer a query, the more likely it is to be effective. It follows that it can
be helpful to introduce further query terms, that is, to expand the query. One
such approach is thesaural expansion, in which either users are encouraged
to add new query terms drawn from a thesaurus or such terms are added
automatically. Another approach is relevance feedback: after some documents
have been returned as matches, the user can indicate which of these are relevant;
the system can then automatically extract likely additional query terms from
these documents and use them to identify further matches. A recent innovation
is automatic query expansion, in which, based on the statistical observation that
the most highly-ranked documents have a reasonable likelihood of relevance,
these documents are assumed to be relevant and used as sources of further
query terms. All of these methods can improve performance, with relevance
feedback in particular proving successful [Salton, 1989].
A curious feature of document retrieval is that different approaches to mea-
suring similarity can give very different rankings-and yet be equally effective.
That is, different measures identify different documents, because they use differ-
ent forms of evidence to construe relevance. This property can be exploited by
explicitly combining the similarities from different' measures, which frequently
leads to improved effectiveness [Fox and Shaw, 1993].
5.5 Summary
We have reviewed querying and indexing for text databases. Since queries to
text databases are inherently approximate, text querying paradigms must be
judged by their effectiveness, that is, whether they allow users to readily locate
182 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
relevant documents. Research in information retrieval has identified statistical
ranking techniques, based on similarity measures, that can be used for effective
querying. The task of text query evaluation is to compute these measures
efficiently, or to efficiently compute heuristic approximations to these measures
that allow faster response without compromising effectiveness.
The last decade has seen vast improvements in text query evaluation and
text indexes. First, compression has been successfully applied to inverted files,
reducing the space requirements of an index with full positional information to
less than 25% of that of the indexed data, or less than 10% for an index with
only the document-level information required for ranking. This compares very
favorably with the space required for traditional inverted file or signature file
implementations. Use of compression has no impact on overall query evaluation
time, since the additional processing costs are offset by savings in disk traffic.
Also, compression makes possible new efficient index construction techniques.
Second, improved algorithms have led to further dramatic reductions in the
costs of text query evaluation, and in particular of ranking, giving savings in
memory requirements, processing costs, and disk traffic.
Currently, however, the needs of document database systems are rapidly
changing, driven by the rapid expansion of the Web and in the use of intranets
and corporate databases. We have described some of the new requirements for
text databases, including the need to index and retrieve documents according
to structure and the need to identify relevant passages within text collections.
Improved retrieval methodologies are being proposed and consequently there is
a need to support new evaluation modes such as query expapsion and combina-
tion of evidence. These improvements are not yet well understood; and before
they can be used in practice new indexing and query evaluation techniques
are required. Future research in text database indexing will have to meet the
demands of these advanced kinds of querying.
Notes
1. The ongoing TREC text retrieval experiment, involving participants from around
the world, is an N'IST-funded initiative that provides queries, large test collections, and
blind evaluation of ranking techniques. Prior to TREC the cost of relevance judgments had
restricted ranking experiments to toy collections of a few thousand documents.
2. Some of the online search engines, such as AltaVista, report the number of occurrences
of each query term. Currently (the start of 1997) these numbers often run up to a million or
so, against a database of around ten million records, showing that meaningful query terms
can indeed occur in a large fraction of the database.
3. Note, however, that text databases are free of some of the costs of traditional databases.
Although text database index processing can seem exorbitantly expensive in comparison to
the cost of processing a query against, say, a file of bank account records, there is no equiv-
alent in the text domain to the concept of join. All queries are to the same table and query
evaluation has linear asymptotic complexity.
TEXT DATABASES 183
4. Fractional-bit codes such as those produced by arithmetic coding require less space,
but are not appropriate for this application because they give relatively slow decompression.
5. The effectiveness of solutions of this kind depends on the overall design of the database
system. Most current text database systems are implemented as some form of client-server
architecture, with the data and server resident one machine and, to simplify locking, with
a single server process handling all queries and updates (perhaps via multiple threads) and
communicating with multiple clients.
6. The array of document lengths is not strictly necessary. Instead of storing each
document frequency as Id,t and storing the W d values separately, it would be possible to store
normalized frequencies Id,t fWd in the inverted lists and dispense with the W d array. However,
such normalization is incompatible with compression and on balance degrades overall query
evaluation time because of the increased disk traffic. Note that the array of Wd values can
be compacted to a few bits per entry without loss of effectiveness [Moffat et aI., 1994].
6 EMERGING APPLICATIONS
Because performance is a crucial issue in database systems, indexing techniques
have always been an area of intense research and development. Advances in
indexing techniques are primarily driven from the need to support different
data models, such as the object-oriented data model, and different data types,
such as image and text data. However, advances in computer architectures
may also require significant extensions to traditional indexing techniques. Such
extensions are required to fully exploit the performance potential of new archi-
tectures, such as in the case of parallel architectures, or to cope with limited
computing resources, such as in the case of mobile computing systems. New
application areas also play an important role in dictating extensions to indexing
techniques and in offering wider contexts in which traditional techniques can
be used.
In this chapter we cover a number of additional topics, some of which are in
an early stage of research. We first discuss extensions to index organizations
required by advances in computer system architectures. In particular, in Sec-
tion 6.1 we discuss indexing techniques for parallel and distributed database
systems. We outline the main issues and present two techniques, based on B-
tree and hashing, respectively. In Section 6.2 we discuss indexing techniques
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997
186 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
for databases on mobile computing systems. In this section, we first briefly de-
scribe a reference architecture for mobile computing systems and then discuss
two indexing approaches. Following those two sections, we focus on extensions
required by new application areas. In particular, Section 6.3 and Section 6.4
discuss indexing issues for data warehousing systems and for the Web, respec-
tively. Data warehousing and Web are currently "hot" areas in the database
field and have interesting requirements with respect to indexing organizations.
We then conclude this chapter by discussing in Section 6.5 indexing techniques
for constraint databases. Constraint databases are able to store and manipu-
late infinite relations and they are, therefore, particularly suited for applications
such as spatial and temporal applications.
6.1 Indexing techniques for parallel and distributed databases
Parallel and distributed systems represent a relevant architectural approach
to efficiently support mission-critical applications, requiring fast processing of
very large amounts of data. The availability of fast networks, like 10 Mb/sec
Ethernet or 100 Mb/sec to 1 Gb/sec Ultranet [Litwin et al., 1993a], makes it
possible to process in parallel large volumes of data without any communication
bottleneck.
In a distributed or parallel database system, a set-oriented database object
such as a relation may be horizontally partitioned and each partition stored at a
database node. Such a node is called store node for the data object [Choy and
Mohan, 1996] and the number of nodes storing partitions of the data object is
called the partitioning degree. Data are accessed from application programs and
users residing on client nodes. A client node mayor may not reside on the same
physical node as a store node is located. A query addressed to a given data
object can be executed in parallel over the partitions into which the data object
has been decomposed, thus achieving substantial performance improvements.
In practice, however, efficient parallel query processing entails many issues, such
as parallel join execution techniques, optimal processor allocation, and suitable
indexing techniques. In particular, if indexing techniques are not designed
properly, they may undermine the performance gains of parallel processing.
Data structures for distributed and parallel database systems should satisfy
several requirements [Litwin et al., 1993a]. Data structures should gracefully
scale up with the partitioning degree. The addition of a new store node to a data
object should not require extensive reorganization of the data structure. There
should be no central node through which searches and updates to the data
structure must go. Therefore, no central directories or similar notions should
exist. Finally, maintenance operations on the data structure, like insertions or
deletions, should not require updates to the client nodes.
EMERGING APPLICATIONS 187
In the remainder of this section, we present two data structures. The first is
based on organizing the access structure on two levels. Given a query, the top-
most global level is used to detect the nodes where data relevant to the query
are stored; the lowest local level of the access structure is used to retrieve the
actual data satisfying the query. There is one local level of the data structure
for each partition node of the indexed data object. The second data structure
is a distributed extension of the well-known linear hashing technique [Litwin,
1980]. This data structure does not require any global component. A query is
sent by the client, issuing the query, to the store node that, according to the
information the client has, contains the required data. If the data are not found
at that store node, the query is forwarded by that node to the appropriate store
node.
6.1.1 Two-tier indexing technique
Two simple approaches to indexing data in a distributed database can be de-
vised based, respectively, on the notions of local index and global index [Choy
and Mohan, 1996]. Under the first approach, a separate local index is main-
tained at each store node of a given data object. Therefore, each local index
is maintained for the respective partition like a conventional index on a non-
partitioned object. This approach requires a number of local indexes equal to
the number of partitions. A key lookup requires sending the key value to all the
local indexes to perform local searches. Such approach is therefore convenient
when qualifying records are found in most partitions. If, however, qualifying
records are only found in a small fraction of partitions, this approach is very
inefficient and in particular does not scale up for large number of partitions.
The main advantages of this approach are that no centralized structure exists,
and updates are efficient because an update to a record in a partition only
involves modifications to the local index associated with the partition.
Under the global index approach, a single, centralized index exists that in-
dexes all records in all partitions. This approach requires globally unique record
identifiers (RID) be stored in the index entries. Indeed, two different records in
two different partitions may happen to have the same (local) RID and there-
fore at a global level, a mechanism to uniquely identify such records must be in
place. A simple approach is to concatenate each local RID with the partition
identifier [Choy and IVlohan, 1996]. The global index can be stored at any node
and may be partitioned.
The global approach allows the direct identification, without requiring use-
less local searches, of the records having a given key value. However, it has sev-
eral disadvantages. First, remote updates are required whenever a partition is
modified. Remote updates are expensive because of the two-phase commit pro-
188 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
tocols that must be applied whenever distributed transactions are performed.
Second, a remote shared lock must be acquired on the index, whenever a par-
tition is read, to ensure serializability. Third, the global index approach is
not efficient for complex queries requiring intersection or union of lists of RIDs
returned by searches on different global indexes, if these global indexes are lo-
cated at different sites. In such a case, long lists of RIDs must be exchanged
among sites. Storing all the global indexes at the same site would not be a
viable solution. The site storing all the global indexes would become an hot
spot, thus reducing parallelism.
An alternative approach, called two-tier index, has been proposed [Choyand
Mohan, 1996], trying to combine the advantages of the above two approaches.
Under the two-tier index approach, a local index is maintained for each parti-
tion. An additional coarse global index is superimposed on the local indexes.
Such a global index keeps for each key value the identifier of the partition stor-
ing records with this key value. The coarse global index is, however, optional.
Its allocation mayor may not be required by the database administrator de-
pending on the query patterns. The coarse global index may be located at any
site and may be partitioned.
An important requirement is that the overall index structure should be main-
tained consistent with respect to the indexed objects. Therefore, updates to
any of the local indexes have to be propagated, if needed, to the coarse global
index. However, compared to the global index approach, the two-tier index
approach is much more efficient with respect to updates. Whenever a record
having a key value v is removed from a partition, the global coarse index needs
to be modified only if the removed record is the last one in its partition having
v as key value. By contrast, if other records with key value v are stored in the
partition, the coarse global index needs not to be modified. Of course, the local
index needs to be modified in both cases. Insertions are handled according to
the same principle. Whenever a new record is inserted into a partition, the
coarse global index needs to be modified only if the newly inserted record has
a key value which is not already in the local index. Algorithms for efficient
maintenance operations and locking protocols have also been proposed [Choy
and Mohan, 1996].
With respect to query performance, the two-tier index approach has the
same advantage as the global index approach. The coarse global index allows
the direct identification of the partitions containing records with the searched
key value. Then, the search is routed to the identified partitions where the
local indexes are searched to determine the records containing the key value.
However, unlike the global index approach, the two-tier approach maximizes
opportunity for parallelism. Once the partitions are identified from the coarse
global index, the search can be performed in parallel on the local indexes of
EMERGING APPLICATIONS 189
the identified partitions. In addition, the two-tier approach provides more
opportunities for optimization. For example, if a search condition is not very
selective with respect to the number of partitions, the coarse global index can be
bypassed and the search request be simply broadcasted to all the local indexes
(as in the case of the local indexes approaches).
It has been shown that the two-tier index represents a versatile and scalable
indexing technique for use in distributed database systems [Choy and Mohan,
1996]. Many issues are still open to investigation. In particular, the two-tier
index structure can be extended to a multi-tier index structure, where the index
organization consists of more than two levels. Query optimization strategies
and cost models need to be developed and analyzed.
6.1.2 Distributed lineal' hashing
The distributed linear hashing technique, also called LH*, has been proposed
in a precise architectural framework. Basically, the availability of very fast
networks makes it more efficient to retrieve data from the RAM of another
processor than from a local disk [Litwin et al., 1993a]. A system consisting of
hundreds, or even thousands, of processors interconnected by a fast network
would be able to provide a large, distributed RAM store adequate to large
amount of data. By exploiting parallelism in query execution, such a system
would be much more efficient than systems based on more tJ;aditional archi-
tectures. Such an architecture may be highly dynamic with new nodes added,
as more storage is required. Therefore, there is the need of access structures
for use in systems with very large number of nodes, hundreds or thousands,
and able to gracefully scale. A given file, in such a system, may be shared by
several clients. Clients may issue both retrieval and update operations.
The distributed linear hashing has been proposed with the goal of addressing
the above requirements. An important feature of this organization is that it
does not require any centralized directory and is rather efficient. It has been
proved [Litwin et al., 1993a] that retrieval of a data item given its key value
usually requires two messages, and four in the worst case. In the remainder of
this section, we first briefly review the linear hashing technique and then we
discuss the distributed linear hashing in more detail.
Linear hashing. Linear hashing organizes a file into a collection of buckets.
The number of buckets linearly increases as the number of data items in the
file grows. In particular, whenever a bucket b overflows, an additional bucket
is allocated. Because of the dynamic bucket allocation, the hash function must
be dynamically modified to be able to address also the newly allocated buckets.
Therefore, as in other hashing techniques, different hashing functions need to be
190 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
used because more bits of the hashed value are used as the address space grows.
In particular, the linear hashing uses two functions hi and hi+1 , i =0,1,2, ....
Function hi generates addresses in the range (0, N x 2i
- 1),1 where N is the
number of buckets that are initially allocated (N can be also equal to 1). A
commonly used function [Litwin et al., 1993a] is:
hi: C mod N x 2i
where C is the key value. Each bucket has a parameter called bucket level
denoting which hash function, between hi and hi +1, must be used to address
the bucket.
Whenever a bucket overflows, a new bucket is added and a split operation is
performed. However, the bucket which is split is not usually the bucket which
generated the overflow. Rather, another bucket is split. The bucket to split
is determined by a special parameter n, called split pointer. Once the split is
performed, the split pointer is properly modified. It always denotes the leftmost
bucket which uses function hi. Once a bucket is split, the bucket level of the
two buckets involved in the splitting, is incremented by one, thus replacing
function hi with hi+1 for these two buckets.
Consider the example in Figure 6.1(a) adapted from [Litwin et al., 1993a].
In the example, we assume that N = 1. Suppose that the key value 145 is
added. The insertion of such a key results in an overflow for the second bucket
and in the addition of a third bucket. However, the bucket which is split is not
the second one; it is the first one. Figure 6.1(b) illustrates the structure after
the insertion and splitting. Note that a special overflow bucket is added to the
second bucket to store the record with key value 145. Because n is equal to 0,
the first bucket is split; the hash function to use for the first and third buckets
(the newly allocated one) is h2 . Figure 6.1(c) illustrates the organization after
the insertion of records with key values 6, 12, 360, and 18. Those insertions
do not cause any overflow. Suppose now that a record with key value 7 is
inserted. Such insertion results in an overflow for the bucket 1. Because n is
equal to 1, the bucket number 1 is split. Figure 6.1(d) illustrates the resulting
organization. Note that the hash functions to use for the second and fourth
buckets became now h2 . Because all buckets have the same local level, that is,
2, the split pointer is assigned O.
Retrieval of a record, given its key, is very efficient. It is performed according
to the following simple algorithm (AI).
Let C be the key to be searched, then
a f- hi(C);
if a < n then a f- hi+dC). (AI)
EMERGING APPLICATIONS 191
216 153 10 7
32 145 18 251
12 321 6 215
360
bucket 0
number
216 251
32 153
10 215
321
overflow
bucket
o
216 251 10
32 153
215
321
overflow
bucket
o
216 251 10
32 153 18
12 215 6
360 321
o 2 3
hI hi
split pointer =0
(a)
h2 hi h2
split pointer = I
(b)
h2 hI h2
split pointer = I
(c)
h2 h2 h2 h2
split pointer =0
(d)
Figure 6.1. Organization of a file under linear hashing.
Basically, the second step checks whether the bucket, obtained by applying
function hi to the key, has already been split. If so, the function hi+1 is to be
used. The index i or i + 1 to be used for a bucket is the bucket level, whereas
i + 1 is the file level.
LH* . In the distributed version of linear hashing, each bucket of the dis-
tributed file is actually the RAM of a node in the system. Therefore, the hash
function returns identifiers of store nodes. Note that LH* could be used also
if the data were stored in the disks of the various nodes rather than in RAM.
However, LH* is particularly suited for systems with a very large number of
nodes, as is the case when using RAM for storing a (large) database.
Data stored at the various nodes are directly manipulated by clients. A
client can perform searches or updates. Whenever a client issues an operation,
for example a search, the first step to perform is the address calculation to de-
termine the store node interested by the operation. Calculating such addresses
requires, according to algorithm (AI), that the client be aware of the up-to-date
values of nand i. Satisfying such constraints in an environment where there is
a large number of clients and store nodes is quite difficult. Propagating those
values, whenever they change, is not feasible given the large number of clients.
Therefore, LH'" does not require that clients have a consistent view of i and n.
Rather, each client may have its own view for such parameters, and therefore
each client may have an image of the file that may differ from the actual file.
Also, the image of a file a client has may differ from the images other clients
have. We denote by i' and n' the view that a client has of the file parameters
i and n.
The basic principle of LH* is to let a client use its own local parameters
for computing the identifier of the node interested by the operation the client
wishes to perform on the file. Therefore, the address calculation is performed
192 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
by using algorithm (AI) with the difference that the client's local parameters
are used. That is, the address is computed in terms of parameters i' and n'
instead of i and n. The request is then forwarded to the store node, whose
address is returned by the address calculation step. Because a client may not
have correct values for the file parameters, the store node may not be the correct
one. An addressing error thus arises. In order to handle such error, another
basic principle is that each store node performs its own address calculation;
such step is called server address calculation. Note that each store node knows
the level of the bucket it stores; however, it does not know the current value of
n. The server address calculation is thus performed according to the following
algorithm (A2).
Let C be the key to be searched
Let a be the address of store node s
Let j be the level of the bucket stored at s, then
a' f- hj(C);
if a i= a' then
a" f- hj -1 (C);
if a" > a and a" < a' then a' f- a". (A2)
The address a' returned by the above algorithm is the address of the store
node to which the request should be forwarded if an addressing error has oc-
curred.
Therefore, whenever a store node receives a request, it performs its own
address calculation. If the calculated address is its own address, the address
calculated by the client is the correct one (therefore, the client has an up-to-
date image of the file). If not, the server forwards the request to the store node
whose address has been returned by the server address calculation, according to
the above algorithm. The recipient of the forwarded operation checks again the
address, by performing again the server address calculation, and may perhaps
forward the request to a third store node. It has been, however, formally
proved [Litwin et al., 1993a] that the third recipient is the final one. Therefore,
delivering the request to the correct store node requires forwarding the request
at most twice.
As final step, a client image adjustment is performed by the store node firstly
contacted by the client, if an addressing error occurred. The store node simply
returns to the client its own values for i and n, so that the client image becomes
closer to the actual image.
To illustrate, consider the example in Figure 6.2(a). The example includes
a client having 0 as value for both n' and i'. Suppose that the client wishes
to insert a new record with key value 7. The client address calculation returns
oas store node. The request is then sent to store node O. Such store node
EMERGING APPLICATIONS 193
(a)
insert
key 7
(b)
Figure 6.2. Message exchanges in distributed linear hashing when performing insertion of
a new key.
performs the address calculation according to algorithm (A2). The first step of
the calculation returns 3 (as it can be easily verified by performing 7 mod 4).
Note, however, that sending the request to store node 3 would result in an error
because there is no such store node. The check performed by the other steps of
the algorithm prevents such a situation by generating the address of store node
1 (by applying function hj _ d. The request is then forwarded to store node 1.
Store node 1 again performs the calculation. The calculation returns 1 and the
record can therefore be inserted at store node 1.
To illustrate a situation where two forwards are performed, consider the
example in Figure 6.2(b) where four store nodes are allocated and each store
node has a local level equal to 2. As in the above case, the request is forwarded
from store node 0 to store node 1. Store node 1 performs the address calculation
which returns 3. The request is then forwarded again to store node 3 where
the key is finally stored.
Whenever an overflow occurs at one store node, a split operation must be
performed. As for linear hashing, the store node to split is not necessarily the
one where the overflow occurs. To determine the store node to split the values
of nand i must be known. One of the proposed approaches to splitting [Litwin
194 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
et aI., 1993a) is based on maintaining such information at a fixed store node
called split coordinator. Whenever an overflow occurs at a store node, such
node notifies the coordinator that then starts the splitting of the proper node
and calculates the new values for nand i, according to what follows:
nt-n+l
if n ~ 2i then n t- 0, i t- i + 1.
Retrieval in LH* is extremely effkient. It takes a minimum of two messages-
one for sending the request and the other for receiving the reply-and a max-
imum of four. The worst case, with a cost of four messages, arises when two
forward messages are required. Extensive simulation experiments have shown,
however, that the average performance is very close to the optimal performance.
Other indexing techniques have been also proposed, as variations of the same
principles of LH*, to support order-preserving indexing [Litwin et aI., 1994) and
multi-attribute indexing [Litwin and Neimat, 1996).
6.2 Indexing issues in mobile computing
Cellular communications, wireless LAN, radio links, and satellite services are
rapidly expanding technologies. Such technologies will make it possible for
mobile users to access information independently from their actual locations.
Mobile computing refers to this new emerging technology extending computer
networks to deal with mobile hosts, retaining their network connections even
while moving. This kind of computation is expected to be very useful for
mail enabled applications, by which, using personal communicators, users will
be able to receive and send electronic mail from any location, as well as be
alerted about certain predefined conditions (such as a train being late or traffic
conditions on a given route), irrespective of time and location [Imielinski and
Badrinath, 1994].
The typical architecture of a mobile network (see Figure 6.3) consists of two
distinct sets of entities: mobile hosts (MRs) and fixed hosts (FRs). Some of the
fixed hosts, called Mobile Support Stations (MSSs) are equipped with a wireless
interface. By using such wireless interface, a MSS is able to communicate with
MHs residing in the same cell. A cell is the area in which the signal sent
by a MSS can be received by MRs. The diameter of a cell, as well as the
available bandwidth, may vary according to the specific wireless technology.
For example, the diameter of a cell spans a few meters for infrared technology
to 1 or 2 miles for radio or satellite networks. With respect to the bandwidth,
LANs using infrared technology have transfer rates of the order of 1-2 Mb/sec,
whereas WANs have poorer performance [Lee, 1989, Salomone, 1995).
The message sent by a MSS is broadcasted within a cell. The MHs filter
the messages according to their destination address. On the other hand, MHs
EMERGING APPLICATIONS 195
FH
//@
: '
".,,'
, MSS
1 ' 1 '
" ," " ;,""';,---'--'- J.:.::.:::.;::.L./
/,----78-:,@':,>...
'6 ' M '
:~ , :
 0 /, ,
- -
,,,
.@ ,
Figure 6.3. Reference architecture of a mobile network.
located in the same cell can communicate only by sending messages to the MSS
associated with that cell. MSSs are connected to other FMs through a fixed
network, used to support communication among cells. The fixed network is
static, whereas the wireless network is mobile, since MHs may change their
position (and therefore the cell in which they rely) in the time.
MSSs provide commonly used application software, so that a mobile user
can download the software from the closest MSS and run it on the palmtop or
execute it remotely on the MSS. Each MH is associated with a specific MSS,
called Home MSS. A Home MSS for a MH maintains specific information about
the MH itself, such as the user profile, logic files, access rights, and user private
files. The association between a MH and a MSS is replicated through the
network. Additionally, a user may register as a visitor under some other MSSs.
Thus, a MSS is responsible for keeping track of the addresses of users who are
currently residing in the cell supervised by the MSS itself.
MHs can be classified in dumb terminals or walkstations [Imielinski and
Badrinath, 1994]. In the first case, they are diskless hosts (such for instance
196 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
palmtops) with reduced memory and computing capabilities. Walkstations are
comparable to classical workstations, and can both receive and send messages
on the wireless network. In any case, MRs are not usually connected to any
direct power source and run on small batteries and communicate on narrow
bandwidth wireless channels.
The communication channel between a MSS and MRs consists of a down-
link, by which information flows from the MSS to MRs, and an uplink, by
which information flows from MRs to the MSS. In general, information can be
acquired by a MR under two different modes:
• Interactive/On-demand: The client requests a piece of data on the uplink
channel and the MSS responds by sending these data to the client on the
downlink channel.
• Data broadcasting: Periodic broadcasting of data is performed by the MSS
on the downlink cannel. This type of communication is unidirectional. The
MRs do not send any specific data requests to the MSS. Rather, they filter
data coming from the downlink channel, according to user specified filters.
In general, combined solutions are used. However, the most frequently de-
manded items will be periodically broadcasted, by creating a sort of storage on
the air [Imielinski et aI., 1994a]. The main advantage of data broadcasting is
that it scales well when the number of MRs grows, as its cost is independent
from the number of MRs. The on-demand mode should be used for data items
that are seldom required.
The main problem of broadcasting is related to energy consumption. Indeed,
MRs are in general powered by a battery. The lifetime of a battery is very short
and is expected to increase only 20% over the next 10 years [Sheng et aI., 1992].
When a MR is listening to the channel, the CPU must be in active mode for
examining data packets. This operation is very expensive from an energy point
of view, because often only few data packets are of interest for a particular MR.
It is therefore important for the MR to run under two different modes:
• Doze mode: The MR is not disconnected from the network but it is not
active.
• Active mode: The MR performs its usual activities; when the MR is listening
to the channel, it should be in active mode.
Clearly, an important topic is to switch from doze mode to active mode in a
clever way, so that energy dissipation is reduced without incurring in a loss of
information. Indeed, if a MR is in doze mode when the information of interest
is being broadcasted, such information is lost by the MR.
MH
EMERGING APPLICATIONS 197
,...T~un::i=n;:g:= Broadcast8ss
Filtering
Broadcast Channel
Figure 6.4. MH and MSS interaction.
Approaches to reduce energy dissipation are therefore important for several
reasons. First of all, they make it possible to use smaller and less powerful
batteries to run the same applications for the same time. Moreover, the same
batteries can also run for a longer time, resulting in a monetary saving. In order
to develop such efficient solutions, allowing MRs to timely switch from doze
mode to active mode and vice versa, indexing approaches have been proposed.
In the next subsection, the general issues related to the development of an
index structure for data broadcasting is described, whereas Subsection 6.2.2
illustrates some specific indexing data structures. The discussion follows the
approaches presented in [Imielinski et al., 1994a].
6.2.1 A general index structure for broadcasted data
We assume that, without leading the generality of the discussion, broadcasted
data consist of a number of records identified by a key. Each MSS periodically
broadcasts the file containing such data, on the downlink channel (also called
broadcast channel). Clients receive the broadcasted data and filter them. Fil-
tering is performed by a simple pattern matching operation against the key
value. Thus, clients remain in doze mode most of the time and tune in periodi-
cally to the broadcast channel, to download the required data (see Figure 6.4).
To provide selective tuning, the server must broadcast, together with data, also
a directory that indicates the point of time in the broadcast channel when par-
ticular records are broadcasted. The first issue to address is how MRs access
the directory. Two solutions are possible:
L. MRs cache a copy of the directory.
This solution has several disadvantages. First of all, when MRs change the
cell where they reside, the cached directory may not be any longer valid
and the cache must be refreshed. This problem, together with the fact
that broadcasted data can change between successive broadcasts, with a
consequent change of the directory, may generate an excessive traffic between
clients and the server. Moreover, if many different files are broadcasted on
different channels, the storage occupancy at clients may become too high,
and storage in MRs is usually a scarce resource.
198 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Current BCAST
Previous
BCAST
Data Bucket ~ Index Bucket
Figure 6.5. A general organization for broadcasted data.
!. The directory is broadcasted in the form of an index on the broadcast chan-
nel.
This solution has several advantages. When the index is not used, the client,
in order to filter the required data records, has to tune into the channel,
on the average, half the time it takes to broadcast the file. This is not
acceptable, because the MH, in order to tune into the channel, must be
in active mode, thus consuming scarce battery resources. Broadcasting the
directory together with the data allows the MH to selectively tune into the
channel, becoming active only when data of interest are being broadcasted.
Because of the above reasons, broadcasting the directory together with data
is the preferred solution. It is usually assumed that only one channel exists.
Multiple channels always correspond to a single channel with capacity equiva-
lent to the combined capacity of the corresponding channels.
Figure 6.5 shows a general organization for broadcasted data (including the
directory). Each broadcasted version of the file, together with all the interleaved
index information, is called beast. A bcast consists of a certain number of
buckets, each representing the smaller unit that can be read by a MH (thus, a
bucket is equivalent to the notion of block for disk organizations). Pointers to
specific buckets are specified as an offset from the bucket containing the pointer
to the bucket to which the pointer points to. The time to get the data pointed
by an offset s is given by (s - 1) x T, where T is the time to broadcast a bucket.
Figure 6.6 shows the general protocol for retrieving broadcasted data:
L. The MH tunes into the channel and looks for the offset pointing to the
next index bucket. During this operation, the MH must be in active mode.
A common assumption is that each bucket contains the offset to the next
index bucket. Thus, this step requires only one bucket access. Let n be the
determined offset.
EMERGING APPLICATIONS 199
TIME
Figure 6.6. The general protocol for retrieving broadcasted data.
2. The MH switches to doze mode until time (n - 1) x T. At that time, the
MH tunes into the channel (thus, it is again in active mode) and, following a
chain of pointers, determines the offset m, corresponding to the first bucket
containing data of interest (with respect to the considered key value).
3. The MH switches to doze mode until time (m - 1) x T. At that time, the
MH tunes into the channel (thus, it is again in active mode) and retrieves
data of interest.
In general, no new indexing structures are required to implement the pre-
vious protocol. Rather, existing data structures can be extended to efficiently
support the new data organization. The main issues are therefore related to
how define efficient data organizations, that is, how data and index buckets
must be interleaved and which are the parameters to use in order to compare
different data organizations. The considered parameters are the following:
• Access time: It is the average duration from the instant in which a client
wants to access records with a specific key value to the instant when all
required records are downloaded by the client. The access time is based on
the following two parameters:
Probe time: The duration from the instant in which a client wants to
access records with a specific key value to the instant when the nearest
index information related to the relevant data is obtained by the client.
Beast wait: The duration from the point the index information related
to the relevant data is encountered to the point when all required records
are downloaded.
200 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Note that if one parameter is reduced, the other increases.
• Tuning time: It is the time spent by a client listening to the channel. Thus
it measures the time during which the client is in active mode and therefore
determines the power consumed by the client to retrieve the relevant data.
The use of a directory reduces the tuning time, increasing at the same time
the access time. It is therefore important to determine good bucket interleaving
in order to obtain a good trade-off between access time (thus reducing the time
the client has to wait for relevant data) and tuning time (thus reducing battery
consumption) .
With respect to disk organization, the tuning time corresponds to the access
time, in terms of block accesses. However, the tuning time is fixed for each
bucket, whereas the disk access time depends on the position of the head. There
is no disk parameter corresponding to the access time. Finally, we recall that
other indexing techniques, based on hash functions, have also been proposed
[Imielinski et al., 1994b). However, in the remainder of this chapter we do not
consider such techniques.
6.2.2 Specific solutions to indexing broadcasted data
With respect to the general data organization proposed in Subsection 6.2.1,
several specific indexing approaches have been proposed. In the following,
we survey some of these approaches [Imielinski et al., 1994a, Imielinski et al.,
1994b).
With respect to how parameters are chosen, index organizations can be
classified in configurable indexes and non-configurable indexes. In the latter
case, parameter values are fixed. In the former case, the organizations are
parameterized: by changing the parameter values, the trade-off between the
costs changes. This allows to use the same organization to satisfy different user
requirements.
Index organizations can also be classified in clustered and non-clustered or-
ganizations. In the first case, all records with the same value for the key
attribute are stored consecutively in the file. Non-clustered organizations are
often obtained from clustered organizations, by decomposing the file in clus-
tered subcomponents. For this reason, in the following, we do not consider
organizations for non-clustered files.
Non-configurable indexing. Non-configurable index organizations can be
classified according to their behavior with respect to access and tuning time.
An optimal strategy with respect to the access time can be simply obtained
by not broadcasting the directory. On the other hand, an optimal strategy
EMERGING APPLICATIONS 201
• Full Index
Figure 6.7. Beast organization in the (l-m) indexing method.
Previous
IIII
Next
BCAST
..........
BCAST
'-----
• Relevant Index
Figure 6.8. Beast organization in the distributed indexing method.
with respect to the tuning time is obtained by broadcasting the complete index
at the beginning of the bcast. Since in practice both access and tuning time
are of interest, the above algorithms have only theoretical significance. Several
intermediate solutions have therefore been devised.
The (l-m) indexing [Imielinski et al., 1994a) is an index allocation method
in which the complete index is broadcasted m times during a bcast (see Fig-
ure 6.7). All buckets have an offset to the beginning of the next index segment.
The first bucket of each index segment has a tuple containing in the first field
the attribute value of the record that was broadcasted as the last and in the
second field an offset pointing to the beginning of the next bcast.
The main problem of the (l-m) index organization is related to the repli-
cation of the index buckets. The distributed indexing [Imielinski et al., 1994a)
is a technique in which the index is partially replicated (see Figure 6.8). In-
deed, there is no need to replicate the complete index between successive data
blocks. Rather, it is sufficient to make available only the portion of index re-
lated to the data buckets which follow it. Thus, the distributed index, with
respect to the (l-m) index, interleaves data buckets with relevant index buckets
only. Several distributed indices can be defined by changing the degree of the
replication [Imielinski et al., 1994a).
The distributed index guarantees a performance comparable to those of the
optimal algorithms, with respect to both the q.ccess time and the tuning time.
202 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
p=4
1 2 3 4
Previous Next
BCAST BCAST
f~Binary Contro
Control Index
Index
Local Index
Data!
Records
1
Figure 6.9. Beast organization in the flexible indexing method.
The (I-m) index has a good tuningtime. However, due to the index replication,
the access time is high.
Configurable indexing. Configurable index organizations are parameter-
ized in such a way that, depending on the values of the parameters, the ratio
between the access and tuning time can be modified. The first configurable in-
dex that has been proposed is called flexible indexing [Imielinski et aI., I994b].
In such organization, data records are assumed to be sorted in ascending (or
descending) order and the data file is divided into p data segments. It is as-
sumed that each bucket contains the offset to the beginning of the next data
segment. Depending on the chosen value for p, the trade-off between access
time and tuning time changes. The first bucket of each data segment contains
a control part, consisting of the control index, as well as some data records
(see Figure 6.9). The control index is a binary index which helps locating data
buckets containing records with a given key value.
Each index entry is a pair, consisting of a key value and an offset to a data
bucket. The control index is divided in two parts, the binary control index and
the local index. The binary control index supports searches for keys preceding
the ones stored in the current data segment and in the following ones. It
contains flog2 il tuples, where i is the number of data segments following the
one under consideration. The first tuple of the binary control index consists of
EMERGING APPLICATIONS 203
the key of the first data record in the current data bucket and an offset to the
beginning of the next bcast. The following tuples consist of the key of the first
data record of the (llog2 i/2k
-
1
J+l),th data segment followed by the offset to
the first data bucket of that data segment.
The local index supports searches inside the data segment in which it is
contained. It consists of m tuples, where m is a parameter which depends on
several factors, including the number of tuples a bucket can hold. The local
index partitions the data segment into m+ 1 subsegments. Each tuple contains
the key of the first data record of a subsegment and the offset to the first data
bucket of that subsegment.
The access protocol is the following:
1. First, the offset of the next data segment is retrieved and the MH switches
to doze mode.
2. The MH tunes in again at the beginning of the designed next data segment
and performs the following steps:
• If the search key k is lower than the value contained in the first field
of the first tuple of the binary control index, the MH switches to doze
mode, waiting for the offset specified by the tuple, and again executes
step (2).
• If the previous condition is not satisfied, the MH scans the other tuples
of the binary control index, from top to bottom, until it reaches a tuple
whose key value is lower than k. If such tuple is reached, the MH
switches to doze mode, waiting for the offset specified by the tuple, and
again executes step (2).
• If the previous condition is not satisfied, the'MH scans the local index, to
determine whether records with key value k are contained in the current
data segment. If this search succeeds, the offset is used to determine
the bucket contained in the current data subsegment, from which the
retrieval of the data segments starts. The retrieval terminates when the
last bucket of the searched subsegment is reached.
6.3 Indexing techniques for data warehousing systems
Recent years have witnessed an increasing interest in database systems able
to support efficient on-line analytical processing (OLAP). OLAP is a crucial
element of decision support systems in that essential decisions are often taken on
the basis of information extracted by very large amount of data. In most cases,
such data are stored in different, possibly heterogeneous, databases. Examples
of typical queries are [Chauduri and Dayal, 1996]:
204 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
• What are the sale volumes by region and product categories for the last
year?
• How did the share price of computer manufactures correlate with quarterly
profits over the past 10 years?
Because requirements of OLAP applications are quite different with respect
to traditional, transaction-oriented applications, specialized systems, known as
data warehousing systems, have been developed to effectively support these
applications. A data warehouse is a large, special-purpose database containing
data integrated from a number of independent sources and supporting users in
analyzing the data for patterns and anomalies [O'Neil and Quass, 1997]. With
respect to traditional database systems, historical data and not only current
data values must be stored in a data warehouse. Moreover, data are updated
off-line and therefore no transactional issues are relevant here. By contrast,
typical OLAP queries are rather complex, often involving several joins and
aggregation operations. OLAP queries are in most cases "ad-hoc" queries as
opposed to repetitive transactions, typical of traditional applications. It is
therefore important to develop sophisticated, complex indexing techniques to
provide adequate performance, also exploiting the fact that the update costs of
indexing structures is not a crucial problem.
A possible approach to efficiently process OLAP queries is to use material-
ization techniques to precompute queries. This approach has the main incon-
venience that precomputing all possible queries along all possible dimensions
is not feasible, especially if there is a very large number of dynamically vary-
ing selection predicates. Therefore, even though more frequent queries may be
precalculated, techniques are required to efficiently execute non-precalculated
querIes.
In the remainder of this section, we first briefly review logical data organi-
zations in data warehousing systems and exemplify typical OLAP queries. We
then discuss a number of techniques supporting efficient query execution for
data warehousing systems. Some of those techniques, namely the join index and
the domain index, had initially been developed for traditional DBMSs. They
have, however, recently found a relevant application scope in data warehousing
systems. Other techniques, namely bitmap and p1'Ojection indexes, have been
specifically developed for data warehousing systems. Some of them have been
incorporated in commercial systems [Edelstein, 1995, French, 1995]. Another
relevant technique which we do not discuss here is the bit-sliced index, whose
aim is the efficient computation of aggregate functions. We refer the reader
to [O'Neil and Quass, 1997] for a description of such technique.
EMERGING APPLICATIONS 205
6.3.1 Logical data organization
In a data warehouse, data are often organized according to a star schema
approach. Under this approach, for each group of related data there exist
a central fact table, also called detail table, and several dimension tables. The
fact table is usually very large, whereas each dimension table is usually smaller.
Every tuple (fact) in the fact table references a tuple in each of the dimension
tables, and may have additional attributes. References from the fact table to
the dimension tables are modeled through the usual mechanism of external
keys. Therefore, each tuple in the fact table is related to one tuple from each
of the dimension tables. Vice versa, each tuple from a dimension table may
be related to more than one tuple in the fact table. Dimension tables may, in
turn, be organized into several levels. A data warehouse may contain additional
summary tables containing pre-computed aggregate information.
As an example, consider a (classical) example of data concerning product
sales [O'Neil and Quass, 1997]. Such data are organized around a central
fact table, called Sales, and the following dimension tables: Time, contain-
ing information about the dates of the sales; Product, containing informa-
tion on the products sold; and finally, Customer, containing information about
the customers involved in the sales. The schema is graphically represented
in Figure 6.10. Alternative schema organization approaches exist, including
the snowflake schema and the fact constellation schema [Chauduri and Dayal,
1996]. The following discussion is however quite independent on the specific
schema approach adopted.
Many typical OLAP queries are based on placing restrictions on the dimen-
sion tables that result into restrictions on the tuples of the fact table. As an
example consider the query asking for all sales of products, with price higher
than $50,000, from customers residing in California during July 1996. Such
type of query is often referred to as star-join query because it involves the join
of the same central fact table with several dimension tables. Another important
characteristic of OLAP queries is that aggregates must often be computed on
the results of a star-join query and aggregate functions may also be involved
in selecting relevant groups of tuples. An example of query including aggre-
gate calculation is the query asking for the total dollar sales that were made
for a brand of products during the past 4 weeks to customers residing in New
England [O'Neil and Quass, 1997].
6.3.2 Join index and domain index
The join index technique [Valduriez, 1987] aims at optimizing relational joins
by precalculating them. This technique is optimal when the update frequency
206 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
table
CUSTOMER table
customer_id
PRODUCT
gender
1 /
producUd
city brand
state table size
zip SALES weight
hobby package_type
customer_id
/
producUd
table day
TIME dollacsales
dollar30st
day uniCsales
week
month
year
holiday_fig
week fig
Figure 6.10. An example of star-schema database with a central fact table (SALES) and
several dimension tables.
is low. Because in OLAP applications joins are very frequent and the update
frequency is low, the join index technique can be profitably used here.
There are several variations of join index. The basic one is the binary join
index which is formally defined as follows:
Given two tables Rand S, and attributes A and B, respectively from Rand S,
a binary equijoin index is
Bi1= {(ri, sk)lri.A = Sk.B}
where ri (Sk) denotes the row identifier (RID) of a tuple of R (5), and ri.A
(Sk .B) denotes the value of attribute A (B) of the tuple whose RID is ri (Sk)'
Note that comparison operators, different than equality, can be used in a join
index. However, because most joins in OLAP queries are based on equijoins on
external keys, we restrict our discussion to the binary join index. Moreover, in
some variants of the join index technique, the primary key values for tuples in
one table can be used instead of the RIDs of these tuples.
A BlI can be implemented as a binary relation and two copies may be kept,
one clustered on RIDs of R and the other clustered on RIDs of S. A Ell
may also include the actual values of the join columns thus resulting in a set
of triples {(ri.A,ri,sk)lri.A = Sk.B}. This alternative is useful when given a
value of the join column, the tuples from R and from S must be determined
that join with that value.
EMERGING APPLICATIONS 207
Join indexes are particularly suited to relate a tuple from a given dimen-
sion table to all the tuples in the fact table. For example, suppose that a
join index is allocated on relations Sales and Customer for the join predicate
Customer.customerjd =Sales.customerjd. Such join index would list for each
tuple of relation Customer (that is, for each customer), the RIDs of tuples of
Sales verifying the join predicates (that is, the sales of the customer). Join
indexes may also be extended to support precomputed joins along several di-
mensions [Chauduri and Dayal, 1996].
Another relevant generalization of the join index notion is represented by
the domain index. A domain index is defined ona domain (for example, the
zip code) and it may index tuples from several tables. It associates with a value
of the domain the RIDs of the tuples, from all the indexed tables, having this
value in the indexed column. Therefore, a domain index may support equality
joins among any number of tables in the set of indexed tables.
6.3.3 Bitmap index
In a traditional index, each key value is associated with the list of RIDs of
tuples having this value for the indexed column. RIDs lists can be quite long.
Moreover, when using multiple indexes for the same table, intersection, union or
complement operations must be performed on such lists. Therefore, alternative,
more efficient implementations of RID lists are relevant.
The notion of bitmap index has been proposed as an efficient implementation
of RID lists. Basically, the idea is to represent the list of RIDs associated with
a key value through a vector of bits. Such vector, usually referred to as bitmap,
has a number of elements equal to the number of tuples in the indexed table.
Each tuple in the indexed table is assigned a distinct, unique bit position in
the bitmap; such position is called ordinal number of the tuple in the relation.
Different tuples have different bit positions, that is, different ordinal numbers.
The ith element of the bitmap associated with a key value is equal to 1 if the
tuple, whose ordinal number is i, has this value for the indexed column; it is
equal to 0 otherwise. Figure 6.11 presents an example of a bitmap index entry
for an index allocated on the column package_type of relation Product. Because
the Product relation has 150 tuples, the bitmap consists of 150 bits. Consider
the entry related to key value equal to A; the bitmap contains 1 in position 1
to denote that the tuple, whose ordinal number is 001, has such value for the
indexed column. By contrast, the bitmap contains 0 in position 2 to denote
that the tuple, whose ordinal number is 002, does not have such value for the
indexed column.
The bitmap representation is very efficient when the number of key values
in the indexed column is low (as an example, consider a column sex of a table
208 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
table
PRODUCT producUd brand size weight package_type
120 XXX 30 50 A
122 XXX 30 40 B
124 YYY 20 30 A
127 XXX 30 20 A
130 XXX 20 70 C
131 YYY 30 80 C
................................................
970 ZZZ 80 80 B
ordinal
number
001
002
003
004
005
006
150
Entry of key value A for an index on column package_type
bitmap - 150 bits
position I
position 2
position 3
o
position 150
Figure 6.11. An example of a bitmap index entry.
Person having only two values: Female and Male) [O'Neil and Quass, 1997]. In
such case, the number of O's in each bitmap is not high. By contrast, when the
number of values in the indexed column is very high, the number of l's in each
bitmap is quite low, thus resulting in sparsely.populated bitmaps. Compression
techniques must then be used. The main advantage of bitmaps is that they
result in significant improvement in processing time, because operations such
as intersection, union and compl~ment of RID lists can be performed very
efficiently by using bit arithmetic. Operations required to compute aggregate
functions, typically counting the number of RIDs in a list, are also performed
very efficiently on bitmaps. Another important advantage of bitmaps is that
they are suitable for parallel implementation [O'Neil and Quass, 1997].
Note that the bitmap representation can be combined with the join index
technique, thus resulting in a bitmap join index [O'Neil and Graefe, 1995]. An
entry in a bitmap join index, allocated on a fact table and a dimension table,
will associate the RID of a tuple t from the dimension table with the bitmap of
dimension table
PRODUCT
EMERGING APPLICATIONS 209
fact table
SALES
RID
POOl
P002
P003
P004
producUd brand size weight package_type producUd customer_id ..
120 XXX 30 50 A 120 C25
122 XXX 30 40 B 122 C25
124 YYY 20 30 A 120 C26
127 XXX 30 20 A 120 C28
130 XXX 20 70 C 130 C25
131 YYY 30 80 C 120 C37
..................................................... 122 C40
970 ZZZ 80 80 B 120 C70
.....•................
130 C40
ordinal
number
0001
0002
0003
0004
0005
0006
0007
0008
1800
Entry of key value POOl for a bitmap join index allocated
on the join between tables PRODUCT and SALES and
inverted on RID's of table PRODUCT
bitmap - 1800 bits
I I 0 I
position I 1position 2
position 3
RID of a tuple of
table PRODUCT
o
position 1800
Figure 6.12. An example of a bitmap join index entry.
the tuples in the fact table that join with t. Figure 6.12 presents an example
of a bitmap join index.
6.3.4 Projection index
Projection index is an access structure whose aim is to reduce the cost of
projections. The basic idea of this technique is as follows. Consider a column
C of a table T. A projection index on C consists of a vector having a number of
elements equal to the cardinality of T. The ith element of the vector contains
the value of C for the ith tuple of R. Such technique is thus based,. as is
the bitmap representation, on assigning ordinal numbers to tuples in tables.
Determining the value of column C for a tuple, given the ordinal number of
210 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
fact table
projection index on column
SALES
unit_sales
ordinal ordinal number
number producUd customecid .... uniCsales
of index entries
0001 120 C25 50 50 0001
0002 122 C25 20 20 0002
0003 120 C26 30 30 0003
0004 120 C28 70 70 0004
0005 130 C25 50 50 0005
0006 120 C37 50 50 0006
0007 122 C40 70 70 0007
0008 120 C70 20 20 0008
.............................................
1800 130 C40 50 50 1800
Figure 6.13. An example of projection index.
this tuple, is very efficient. It only requires accessing the ith entry of the
vector. When the key values have a fixed length, the secondary storage page
containing the relevant vector entry is determined by a simple offset calculation.
Such calculation is function of the number· of entries of the vector that can be
stored per page and the ordinal number of the tuple. When the key values have
varying lengths, alternative approaches are possible. A maximum length can
be fixed for the key values. Alternatively, a B-tree can be used, having as key
values the ordinal numbers of tuples and associating with each ordinal number
the corresponding value of column C. Figure 6.13 presents an example of a
projection index.
Projection indexes are very useful when very few columns of the fact table
must be returned by the query and the tuples of the fact table are very large or
not well clustered. For typical OLAP queries, projection indexes are typically
best used in combination with bitmap join indexes. Recall that a typical query
restricts the tuples in the fact table through selections on the dimension tables.
The ordinal numbers of fact tuples satisfying the restrictions on the dimension
tables are retrieved from the bitmap join indexes. By using these ordinal num-
bers, projection indexes can then be accessed to perform the actual projection.
Note that the actual tuples of the fact table need not to be accessed at all.
6.4 Indexing techniques for the Web
In the past five years, the World Wide Web has completely reshaped the
world of communication, computing and information exchange. By introduc-
ing graphical user interfaces and an intuitively simple concept of navigation,
EMERGING APPLICATIONS 211
the Web facilitated access to the Internet which during about ten years was re-
stricted to a few universities and research laboratories. Appearance of advanced
navigation tools like Netscape and Microsoft Explorer made it easy for everyone
on the Internet to roam, browse and contribute to the Web information space.
With the rapid explosion of the amount of data available through Inter-
net, locating and retrieving relevant information becomes more difficult. To
facilitate retrieval of information, many Internet providers (for example, stock
markets, private companies, universities) offer users the possibility of using so
called search engines which facilitate the search process. Search engines offer a
simple interface for the query formulation and refinement, and a wide range of
search options and result reporting.
Moreover, with the growth of data on the Web, a number of special services
has appeared on the Internet whose major goal is searching through many differ-
ent information sources. Even the raw information they return to users becomes
the starting point for retrieval of relevant information (for example, e-mail ad-
dresses, phone numbers, Frequently Asked Questions files). Popular general
purpose searching tools, such as Altavista (http://www.altavista.com/). We-
bcrawler (http://www.webcrawler.com),InfoSeek (http://www.infoseek.com/).
Excite (http://www.excite.com/) become indispensable in the toolkit of every-
body working with the Internet information sources.
Internet technology poses some specific requirements to the tools both in
terms of time and space. Some indexing techniques used in standard text
databases were adopted to meet those requirements. Also, several new ap-
proaches were developed to overcome some limitations of standard techniques.
In the remainder of this section we present a short overview and classification
of indexing methods used in some Internet information systems such as WAIS,
Gopher, Archie, which became popular in the late 80s and early 90s. Then we
discuss some problems related to search engines on the Web. We conclude the
section with a brief overview of the main ideas underlying the Internet spiders
which combine indexing and navigation techniques on the Web.
6.4.1 WAfS, Gopher, Archie, Whois++
The importance of searching the information available through the Internet
was realized by the Internet community from the very first years. Searching
and retrieval tools were growing in both quantity and quality together with
the growth of the Internet itself. Such popular tools as Archie, Gopher, Whois,
WAIS [Bowman et al., 1994, Cheong, 1996] represented a good starting point for
a new generation of the Internet searching tools. Archie is a tool which searches
for relevant information in a distributed collection of FTP sites.2 Gopher is
a distributed information system which makes available hierarchical campus-
212 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
wide data collections and provides a simple text search interface. Whois (and
its advanced version Whois++) is a popular tool to query Internet sources
about people and other entities (for example, domains, networks, and hosts).
WAIS (Wide Area Information Server) is a distributed service with a simple
natural-language interface for looking up information in Internet databases.
Indexing techniques used in those tools are quite different. In particular, the
various tools can be classified in three groups [Bowman et al., 1994J depending
on the amount of information which is included in the indexes. The first group
includes tools which have very space efficient indexes, but only represent the
names of the files or menus they index. For example, Archie and Veronica index
the file and menu names of FTP and Gopher servers. Because these indexes
are very compact, a single index is able to support advanced forms of search.
Yet, the range of queries that can be supported by these systems is limited to
file names only, and content-based searches are possible only when the names
happen to reflect some of the contents.
The second group includes systems providing full-text indexing of data lo-
cated at individual sites. For example, a WAIS index records every keyword
in a set of documents located at a single site. Similar indexes are available for
individual Gopher and WWW servers.
The third group includes systems adopting solutions which are a compro-
mise between the approaches adopted by the systems in the other two groups.
Systems in the third group represent some of the contents of the objects they
index, based on selection procedures for including important keywords or ex-
cluding less important keywords. For example, Whois++ indexes templates
that are manually constructed by site administrators wishing to describe the
resources at their sites.
6.4.2 Search engines
The two main types ofsearch against text files are based on sequential searching
and inverted indexes. The sequential search works well only when the search
is limited to a small area. Most pattern-based search tools like Unix's grep
use the sequential search. Inverted indexes (see Chapter 5 for an extensive
presentation) are a common tool in information retrieval systems [Frakes and
Baeza-Yates, 1992J. An inverted index stores in a table all word occurrences in
the set of the indexed documents and indexes the table using a hash method
or a B-tree structure. Inverted indexes are very efficient with respect to query
evaluation but have a storage occupancy which, in the worst case, may be
equal to the size of the original text. To reduce the size of the table, storing
the word occurrences, advanced inverted indexes use the trie indexing method
[Mehlhorn and Tsakalidis, 1990J which stores together the words with common
EMERGING APPLICATIONS 213
initial characters (like "call" and "capture"). Moreover, the use of various
compression methods allows to reduce the index size to 10%-30% of the text
size (see Chapter 5).
Another drawback of standard inverted indexes is that their basic data struc-
ture requires the exact spelling of the words in the query. Any misspelling (for
example, when typing "Bhattacharya" or "Clemenc;on") would result in the
empty result set. To provide a correct spelling, users should try different pos-
sibilities by hand, which is frustrating and time consuming.
An example of the search engine which allows the word misspelling is Glimpse
[Manber and Wu, 1994]. Glimpse is based on the agrep search program [Wu
and Manber, 1992] which is similar in use to Unix's grep search program. Es-
sentially, Glimpse is an hybrid between the sequential search and the inverted
index techniques. It is index-based but it uses the sequential search (agrep
program) for approximation matching when the search area is small. To check
a possible word misspelling, it allows a specified number of errors which can be
insertions, deletions or substitutions of characters in a word. Also, it supports
wild cards, regular expressions and Boolean queries like OR and AND. In most
cases, Glimpse requires a very small index, 2%-4% of the original text. How-
ever, the cost of the combination of indexing and sequential search is a longer
response time. For most queries, the search in Glimpse takes 3-15 seconds.
Such response time is unacceptable for classical database applications but is
quite tolerable in most personal applications like the navigation through the
Web.
Intensive development of different techniques for indexing Web documents
has resulted in the appearance of a number of advanced search engines. They
offer a wide list of features for the query formulation and provide a small index
size along with the fast response time. However, building metasearchers which
provide unified query interfaces to multiple search engines is still a hard task.
This is because most search engines are largely incompatible. They propose dif-
ferent query languages and use secret algorithms for ranking documents which
make hard merging data from different sources. Moreover, they do not export
enough information about the source's contents which may be helpful for a bet-
ter query evaluation. All these problems have led to the Stanford protocol pro-
posal for Internet retrieval and search (STARTS) [Gravano et 301., 1997]. This
proposal is a group effort involving 11 companies and organizations. The proto-
col addresses and analyzes metasearch requirements and describes the facilities
that a source needs to provide in order to help a metasearcher. If implemented,
STARTS can significantly streamline the implementation of metasearchers, as
well as enhance the functionality they can offer.
214 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
6.4.3 Internet spiders
Users usually navigate through the Web to find information and resources by
following hypertext links. As the Web continues to grow, users may need to
traverse more and more links to locate what they are looking for. Indexing
tools like search engines only help when searching on a single site or predefined
set of sites. Therefore, a new family of programs, often called Web robots or
spiders, has been developed with the aim of providing more powerful search
facilities. Web spiders combine browsing and indexing [Cheong, 1996]. They
traverse the Web space by following hypertext links and retrieve and index new
Web documents. The most well-known Internet spiders are WWW Worm, Web
Crawler and Harvest.
The World Wide Web Worm (http://wwww.cs.colorado.comjwwwwj) was
the first widely used Internet spider. It navigates through Web pages and
builds an index of titles and hypertext links of over 100,000 Web documents.
It provides users with a search interface. Similar to the systems in the first
group in our classification, the WWW Worm does not index the content of
documents.
Webcrawler (http://www.webcrawler.com/) is a resource discovery tool which
is able to speedily search for resources on the Web. It is able to build indexes
on the Web documents and to automatically navigate on demand. WebCrawler
uses an incomplete breath-first traversal to create an index (on both titles and
data content) and relies on an automatic navigation mechanism to find the rest
of information.
The Harvest project [Bowman et al., 1995] addresses the problem of how
to make effective use of the Web information in the face of a rapid growth
in data volume, user base and data diversity. One of the Harvest goals is
to coordinate retrieval of information among a number of agents. Harvest
provides a very efficient means of gathering and distributing index information
and supports the construction of very different types of indexes customized
to each particular information collection. In addition, Harvest also provides
caching and replication support and uses Glimpse as a search engine.
6.5 Indexing techniques for constraint databases
The main idea of constraint languages is to state a set of relations (constraints)
among a set of objects in a given domain. It is a task of the constraint satisfac-
tion system (or constraint solver) to find a solution satisfying these relations.
An example of constraint is F = 1.80 + 32, where 0 and F are respectively
the Celsius and Fahrenheit temperature. The constraint defines the existing
relation between F and O. Constraints have been used for different purposes,
for example they have been successfully integrated with logic programming
EMERGING APPLICATIONS 215
[Jaffar and Lassez, 1987]. The constraint programming paradigm is fully declar-
ative, since it specifies computations by specifying how these computations are
constrained. Moreover, it is very attractive as often constraints represent the
communication language of several high-level applications.
Even if constraints have been used in several fields, only recently this
paradigm has been used in databases. Traditionally, constraints have been used
to express conditions on the semantic correctness ofdata. Those constraints are
usually referred to as semantic integrity constraints. Integrity constraints have
no computational implications. Indeed, they are not used to execute queries
(even if they can be used to improve execution performance) but they are only
used to check the database validity.
Constraints intended in a broader sense have lately been used in database
systems. Constraints can be added to relational database systems at different
levels [Kanellakis et aI., 1995]. At the data level, they finitely represent infi-
nite relational tuples. Different logical theories can be used to model different
information. For example, the constraint X < 21 Y > 3, where X and Yare
integer variables, represents the infinite set of tuples having the X attribute
lower than 2 and the Y attribute greater than 3. A quantifier-free conjunc-
tion of constraints is called generalized tuple and the possibly infinite set of
relational tuples it represents is called extension of the generalized tuple. A
finite set of generalized tuples is called generalized relation. Thus, a general-
ized relation represents a possibly infinite set of relational tuples, obtained as
the union of the extension of the generalized tuples contained in the relation.
A generalized database is a set of generalized relations. When constraints are
used to retrieve data, they allow to restrict the search space of the computa-
tion, increasing the expressive power of simple relational languages by allowing
arithmetic computations.
Constraints are a powerful mechanism for modeling spatial [Paredaens, 1995,
Paredaens et al., 1994] and temporal concepts [Kabanza et al., 1990, Koubarakis,
1994], where often infinite information should be represented. Consider for ex-
ample a spatial database consisting of a set of rectangles in the plane. A
possible representation of this database in the relational model is that of hav-
ing a relation R, containing a tuple of the form (n, a, b, c, d) for each rectangle.
In such tuple, n is the name of the rectangle with corners (a, b), (a, d), (c, b)
and (c, d). In the generalized relational model, rectangles can be represented by
generalized tuples of the form (Z = n) 1 (a :::; X :::; c) 1 (b :::; Y :::; d), where X
and Yare real variables. The latter representation is more suitable for a larger
class of operations. Figure 6.14 shows the rectangles representing the extension
of the generalized tuples contained in a generalized relation rl (white) and in a
generalized relation r2 (shadow). rl contains the following generalized tuples:
216 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
r2,1
Figure 6.14. Relation rl (white) and r2 (shadow).
'"1,1 : 1 :::; X :::; 4 AI:::; Y:::; 2
rl,2 : 2 :::; X :::; 7 A 2 :::; Y:::; 3
rl,3 : 3 :::; X :::; 6 A -1 :::; Y :::; 1.5.
r2 contains the following tuples:
r2,1 : -3 :::; X :::; -1 AI:::; Y :::; 3
r2,2 : 5 :::; X :::; 6 A -3 :::; Y :::; O.
Usually, spatial data are represented using the linear constraint theory. Lin-
ear constraints have the form p(X1 , ... , X n ) () 0, where p is a linear polynomial
with real coefficients in variables Xl, ..., X nand () E {=,:f, ::;, <, 2::, >}. Such
class of constraints is of particular interest. Indeed, a wide range of applications
use linear polynomials. Moreover, linear polynomials have been investigated in
various fields (linear programming, computational geometry) and therefore sev-
eral techniques have been developed to deal with them [Lassez, 1990].
From a temporal perspective, constraints are very useful to represent situ-
ations that are infinitely repeated in time. For example, we may think of a
train, leaving each day at the same time. In such case, dense-order constraints
are often used. Dense-order constraints are all the formulas of the form X()Y
or X()c, where X,Y are variables, c is a constant and () E {=,:f,::;,<, 2::,>}.
The domain D is a countably infinite set (for example, rational numbers) with
a binary relation which is a dense linear order.
It has been recognized [Kanellakis et al., 1995] that the integration of con-
straints in traditional databases must not compromise the efficiency of the sys-
tem. In particular, constraint query languages should preserve all the good fea-
EMERGING APPLICATIONS 217
tures of relational languages. For example, they should be closed and bottom-
up evaluable. With respect to relational databases, constraint databases should
also preserve efficiency. Thus, data structures for querying and updating con-
straint databases must be developed, with time and space complexities com-
parable to those of data structures for relational databases. Complexity of the
various operations is expressed in terms of input-output (I/O) operations. An
I/O operation is the operation of reading or writing one block of data from or
to a disk. Other parameters are: B, representing the number of items (gen-
eralized tuples) that can be stored in one page; n, representing the number of
pages to store N generalized tuples (thus, n = N/B); t, representing the num-
ber of pages to store T generalized tuples, representing the result of a query
evaluation (thus, t = T/B).
At least two constraint language features should be supported by index struc-
tures:
• ALL selection. It retrieves all generalized tuples contained in a specified
generalized relation whose extension is contained in the extension of a given
generalized tuple, specified in the query (called query generalized tuple).
From a spatial point of view, such selection corresponds to a range query.
• EXIST selection. It retrieves all generalized tuples contained in a specified
generalized relation whose extension has a non-empty intersection with the
extension of a query generalized tuple. Equivalently, it finds a generalized
relation that represents all relational tuples, implicitly represented by the
input generalized relation, that satisfy the query generalized tuple.
From a spatial point of view, such selection corresponds to an intersection
query.
Consider for example the generalized tuples representing the objects pre-
sented in Figure 6.14. The EXIST selection with respect to the query gen-
eralized tuple Y ~ X-I and relation 1'1 returns all three generalized tuples
1'1.1,1'1,2 and 1'1,3· The ALL selection with respect to the query generalized
tuple Y ~ X-I and relation 1'1 returns only the generalized tuple 1'1,3.
As constraints support the representation of infinite information, data struc-
tures defined to index relations (such as B-trees and B+-trees [Bayer and Mc-
Creight, 1972, Comer, 1979]) cannot be used in constraint databases, since they
rely on the assumption that the number of tuples is finite. For this reason, spe-
cific classes of constraints for which efficient indexing data structures can be
provided must be determined.
Due to the analogies between constraint databases and spatial databases,
efficient indexing techniques developed for spatial databases can often be ap-
plied to (linear) constraint databases. Efficient data structures are usually
218 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
required to process queries in G(logB n + t) I/O operations, use G(n) blocks
of secondary storage, and perform insertions and deletions in G(logB n) I/O
operations (this is the case ofB-trees and B+-trees). Note that all complexities
are worst-case. For spatial problems, by contrast, data structures with optimal
worst-case complexity have been proposed only for some specific problems, in
general dealing with 1- or 2- dimensional spatial objects. Nevertheless, several
data structures proposed for management of spatial data behave quite well on
the average for different source data. Examples of such data structures are grid
files [Nievergelt et al., 1984], various quad-trees [Samet, 1989], z-orders [Oren-
stein, 1986], hB-trees [Lomet and Salzberg, 1990a], cell-trees [Gunther, 1989],
and various R-trees [Guttman, 1984, Sellis et al., 1987] (see Chapter 2).
Symmetrically, in the context of constraint databases two different classes of
techniques have been proposed, the first consisting of techniques with optimal
worst-case complexity, and the second consisting of techniques with good aver-
age bounds. Techniques belonging to the first class apply to (linear) generalized
tuples representing 1- or 2- dimensional spatial objects and often optimize only
EXIST selection. Techniques belonging to the second class allow to index more
general generalized tuples, by applying some approximation. In the following,
both approaches will be surveyed.
6.5.1 Generalized 1-dimensional indexing
In relational databases, the I-dimensional searching problem on a relational
attribute X is defined as follows:
Find all tuples such that their X attribute satisfies the condition a1 ::; X ::; a2.
The problem of I-dimensional searching on a relational attribute X can be
reformulated in constraint databases, defining the problem of i-dimensional
searching on the generalized relational attribute X, as follows:
Find a generalized relation that represents all tuples of the input generalized
relation such that their X attribute satisfies the condition a1 ::; X ::; a2.
A first trivial, but inefficient, solution to the generalized I-dimensional search-
ing problem is to add the query range condition to each generalized tuple. In
this case, the new generalized tuples represent all the relational tuples whose
X attribute is between a1 and a2. This approach introduces a high level of
redundancy in the constraint representation. Moreover, several inconsistent
(with empty extension) generalized tuples can be generated.
A better solution can be defined for convex theories. A theory <I> is convex
if the projection of any generalized tuple defined using <I> on each variable X is
one interval b1 :S X :S h. This is true when the extension of the generalized
tuple represents a convex set. The dense-order theory and the real polynomial
inequality constraint theory are examples of convex theories. The solution is
EMERGING APPLICATIONS 219
based on the definition of a generalized I-dimensional index on X as a set of
intervals, where each interval is associated with a set of generalized tuples and
represents the value of the search key for those tuples. Thus, each interval in
the index is the projection on the attribute X of a generalized tuple. By using
the above index, the determination of a generalized relation, representing all
tuples from the input generalized relation such that their X attribute satisfies a
given range condition al :s X :s a2, can be performed by adding the condition
to only those generalized tuples whose associated interval has a non-empty
intersection with al :s X :s a2. Insertion (deletion) of a given generalized tuple
is performed by computing its projection and inserting (deleting) the obtained
interval into (from) a set of intervals.
From the previous discussion it follows that the generalized I-dimensional
indexing problem reduces to the dynamic interval management problem on
secondary storage. Dynamic interval management is a well-known problem in
computational geometry, with many optimal solutions in internal memory [Chi-
ang and Tamassia, 1992]. Secondary storage solutions for the same problem
are, however, non-trivial, even for the static case. In the following, we survey
some of the proposed solutions for secondary storage.
Reduction to stabbing queries. A first class of proposals is based on the
reduction of the interval intersection problem to the stabbing query problem
[Chiang and Tamassia, 1992]. Given a set of I-dimensional intervals, to answer
a stabbing query with respect to a point x, all intervals that contain x must be
reported.
The main idea of the reduction is the following [Kanellakis and Ramaswamy,
1996]. Intervals that intersect a query interval fall into four categories (see
Figure 6.15). Categories (1) and (2) can be easily located by sorting all the
intervals with respect to their left endpoint and using a B+-tree to locate all
intervals whose first endpoint lies in the query interval. Categories (3) and (4)
can be located by finding all data intervals which contain the first endpoint of
the query interval. This search represents a stabbing query.
By regarding an interval [Xl, X2] as the point (Xl, X2) in the plane, a stabbing
query reduces to a special case of the 2-dimensional range searching problem.
Indeed, all points (Xl, X2), corresponding to intervals, lie above the line X = Y.
An interval [Xl, X2] belongs to a stabbing query with respect to a point X if and
only if the corresponding point (Xl, X2) is contained in the region of space
represented by the constraint X :s X 1 Y 2: x. Such 2-sided queries have their
corner on line X = Y. For this reason, they are called diagonal corner queries
(see Figure 6.16).
220 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
y
(xl,x2)
Data intervals
2---------
3 - - - - -
4-------------
"'-------;..-------X
,x
Query interval xl x2
Figure 6.15. Categories of possible In-
tersections of a query interval with a
database of intervals.
Figure 6.16. Reduction of the interval
intersection problem to a diagonal-corner
searching problem with respect to x.
The first data structure that has been proposed to solve diagonal-corner
queries is the meta-block tree, and it does not support deletions (it is semi-
dynamic) [Kanellakis and Ramaswamy, 1996). The meta-block tree is fairly
complicated, has optimal worst-case space G(n) and optimal I/O query time
G(logE n + t). Moreover, it has G(logE n + (log~ n)/B) amortized insert I/O
time.
A dynamic (thus, also supporting deletions) optimal solution to the stab-
bing query problem [Arge and Vitter, 1996) is based on the definition of an
external memory version of the internal memory interval tree. The interval
tree for internal memory is a data structure to answer stabbing queries and
to store and update a set of intervals in optimal time [Chiang and Tamassia,
1992). It consists of a binary tree over the interval endpoints. Intervals are
stored in secondary structures, associated with internal nodes of the binary
tree. The extension of such data structure to secondary storage entails two
issues. First, the fan-out of nodes must be increased. The fan-out that has
been chosen is VB [Arge and Vitter, 1996). This fan-out allows to store all the
needed information in internal nodes, increasing only of 2 the height of the tree.
If interval endpoints belong to a fixed set E, the binary tree is replaced by a
balanced tree, having VB as branching factor, over the endpoints E. Each leaf
represents B consecutive points from E. Segments are associated with nodes
generalizing the idea of the internal memory data structure. However, since
now a node contains more endpoints, more than two secondary structures are
required to store segments associated with a node.' The main problem of the
previous structure is that it requires the interval endpoints to belong to a fixed
set. In order to remove such assumption, the weight-balanced B-tree has been
EMERGING APPLICATIONS 221
introduced [Arge and Vitter, 1996]. The main difference between a B-tree and
a weight-balanced B-tree is that in the first case, for each internal node, the
number of children is fixed; in the second case, only the weight, that is, the
number of items stored under each node, is fixed. The weight-balanced B-tree
allows to remove the assumption on the interval endpoints, still retaining opti-
mal worst-case bounds for stabbing queries.
Revisiting a Chazelle's algorithm. The solutions described above to solve
stabbing queries in secondary storage are fairly complex and rely on reduc-
ing the interval intersection problem to special cases of the 2-dimensional
range searching problem. A different and much simpler approach to solve the
static (thus, not supporting insertions and deletions) generalized I-dimensional
searching problem [Ramaswamy, 1997] is based on an algorithm developed by
Chazelle [Chazelle, 1986] for interval intersection in main memory and uses
only B+-trees, achieving optimal time and using linear space.
The proposed technique relies on the following consideration. A straightfor-
ward method to solve a stabbing query consists of identifying the set of unique
endpoints of the set of input intervals. Each endpoint is associated with the set
of intervals that contain such endpoint. These sets can then be indexed using
a B+-tree, taking endpoints as key values. To answer a stabbing query it is
sufficient to look for the endpoint nearest to the query point, on the right, and
examine the intervals associated with it, reporting those intervals that intersect
the query point.
This method is able to answer stabbing queries in G(logE n). However, it
requires G(nZ) space. It has been shown [Ramaswamy, 1997] that the space
complexity can be reduced to G(n) by appropriately choosing the considered
endpoints. More precisely, let el, ez, ... , eZn be the ordered lists of all end-
points. A set of windows Wi, ... ,Wp should be constructed over endpoints
Wi = el, ... , Wp+l = e2n such that Wj = [Wj, Wj+d, j = 1, ... ,p. Thus, windows
represent a partition of the interval between el and e2n into p contiguous in-
tervals. Each window Wj is associated with the list of intervals that intersect
Wj.
Window-lists can be stored in a B+-tree, using their starting points as key
values. A stabbing query at point p can be answered by searching for the
query point and retrieving the window-lists associated with the windows that
it falls into. Each interval contained in such lists is then examined, reporting
only the intervals intersecting the query point. Some algorithms have been
proposed [Ramaswamy, 1997] to appropriately construct windows, in order to
answer queries by applying the previous algorithm in G(logE n), using only
G(n) pages.
222 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
6.5.2 Indexing 2-dimensionallinear constraints
The approaches briefly illustrated in Subsection 6.5.1 rely on the assumption
that index values are represented by intervals. Thus, they are able to index
generalized tuples using information about only one variable. Less work has
been done in order to define techniques for 2-dimensional generalized tuples,
having optimal worst-case complexity. One of these techniques [Bertino et aI.,
1997] deals with index values represented by generalized tuples with two vari-
ables, say X and Y, having the form G1 1 ... 1 Gn , where each Gi , i = 1, ... , n
has the form Gi == Y B aiX + bi, B E {:S, ~}.
Besides the application to different types of generalized tuples, the main dif-
ference of this technique with respect to the ones presented in Subsection 6.5.1
is that it is defined for solving not only EXIST selection but also ALL selection.
In both cases, the query generalized tuple must represent a half-plane.
The main novelty of the approach is the reduction of both EXIST and ALL
selection problems, under the above assumptions, to a point location problem
from computational geometry [Preparata and Shamos, 1985]. The proof of such
reduction is based on the transformation of the extension of generalized tuples
from a primal plane to a dual plane. In particular, each generalized tuple is
transformed in a pair of non-intersecting, but possibly touching, open polygons3
in the plane, whereas a half-plane Y BaX +b, BE {:S,~} is translated in point
(a, b).
This translation satisfies an interesting property. Indeed, the EXIST and the
ALL selection problems with respect to a half-plane query Y B aX +b reduce
to the point location problem of point (a, b) with respect to the constructed
open polygons. In particular, it can be shown that point (a, b) belongs to one
of the open polygons that have been constructed for a generalized tuple t iff
line Y = aX + b does not intersect the interior of the figure representing the
extension oft (see Figure 6.17). Using this property, point location algorithms
for the dual plane, equivalent to the EXIST and ALL selections in the Euclidean
plane, have been proposed.
The same open polygons have then be used to show that an optimal dy-
namic solution to ALL and EXIST selection problems exists, using simply data
structures such as B+-trees, if the angular coefficient of the line associated with
the half-plane query belongs to a predefined set.
6.5.3 Filtering
To facilitate the definition of indexing structures for arbitrary objects in spatial
databases, a filtering approach is often used. The same approach can be used
in constraint databases to index generalized tuples with complex extension.
(a)
EMERGING APPLICATIONS 223
(b)
Figure 6.17. (a) A polygon p representing the extension of a linear generalized tuple;
(b) A pair of open polygons representing p in the dual plane, together with the points
representing lines ql, qz, q3, q4 in the dual plan.
Under the filtering approach, an object is approximated by using some other
object, having a simpler shape. The approximated objects are then used as
index objects. The evaluation of a query under such approach consists of two
steps, filtering and refinement. In the filtering step, an index is used to retrieve
only relevant objects, with respect to a certain query. To this purpose, the
approximated figures are used instead of the objects themselves. During the
refinement step, the set of objects retrieved by the filtering step is directly tested
with respect to the query, to determine the exact result. Here, the main topic
is the definition of "good" approximated objects, ensuring a specific degree of
filtering.
The use of minimum bounding box (MBB) in spatial databases to filter ob-
jects is of common use. In 2-dimensional space, the MBB of a given object is
the smallest rectangle that encloses the object and whose edges are perpendicu-
lar to the standard coordinate axes. The previous definition can be generalized
to higher dimensions in a straightforward manner.
The filtering method based on MBB is simple and has a number of advan-
tages over index methods working directly on objects:
• It has a low storage cost, because only a small number of intervals are main-
tained in addition to each object.
224 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
• There is a clear separation between the complexity of the object geometry
and the complexity of the search. Index structures for (multidimensional) in-
tervals have better worst-case performance with respect to index techniques
working on arbitrary objects. Indeed, several index structures having close to
optimal worst-case bounds for managing (multidimensional) intervals have
been proposed (see Chapter 2). However, similar approaches have not been
defined yet for arbitrary objects.
The filtering approach based on MBBs, even if appealing, has some draw-
backs. In particular, it may be ineffective if the set of objects returned by the
filtering step is too large. This means that there are too many intersecting
MBBs. Moreover, it does not scale well to large dimensions. The issue of han-
dling objects in spaces of large dimension is less crucial for spatial databases,
where we can generally rely on a dimension of 3 or less, but it is critical for
constraint databases.
In order to improve the selectivity of filtering, an approach has been pro-
posed, based on the notion of minimum bounding polybox [Brodsky et al., 1996].
A minimum bounding polybox for an object 0 is the minimum convex polyhe-
dron that encloses 0 and whose facets are normal to preselected axes. These
axes are not necessarily the standard coordinate axes and, furthermore, their
number is not determined by the dimension of the space. Algorithms for com-
puting optimal axes (according to specific optimality criteria with respect to
storage overhead or filtering rate) in d-dimensions have also been proposed
[Brodsky et al., 1996].
Notes
1. We assume that buckets are numbered starting from O.
2. FTP is the Internet standard high-level protocol for the file transfer.
3. An open polygon is a finite chain of line segments with the first and last segments
approaching 00. An open polygon is upward (downward) open if both segments approach
+00 (-00).
REFERENCES 225
References
Abel, D. J. and Smith, J. L. (1983). A data structure and algorithm based
on a linear key for a rectangle retrieval problem. International Journal of
Computer Vision, Graphics and Image Processing, 24(1):1-13.
Abel, D. J. and Smith, J. L. (1984). A data structure and query algorithm for
a database of areal entities. Australian Computing Journal, 16(4):147-154.
Achyutuni, K. J., Omiecinski, E., and Navathe, S. (1996). Two techniques for
on-line index modification in shared-nothing parallel systems. In Proc. 1996
ACM SIGMOD International Conference on Management of Data, pages
125-136.
Ang, C. and Tan, K. (1995). The Interval B-tree. Information Processing Let-
ters, 53(2):85-89.
Arge, L. and Vitter, J. (1996). Optimal dynamic interval management in exter-
nal memory. In Pmc. 37th Symposium on Foundations of Computer Science,
pages 560-569.
Aslandogan, Y. A., Yu, C., Liu, C., and Nair, K. R. (1995). Design, implemen-
tation and evaluation of SCORE. In Proc. 11th International Conference on
Data Engineering, pages 280-287.
Bancilhon, F. and Ferran, G. (1994). ODMG-93: The object database standard.
IEEE Bulletin on Data Engineering, 17(4):3-14.
Banerjee, J. and Kim, W. (1986). Supporting VLSI geometry operations in a
database system. In Proc. 3rd International Conference on Data Engineer-
ing, pages 409-415.
Bartels, D. (1996). ODMG93 - The emerging object database standard. In
Proc. 12th International Conference on Data Engineering, pages 674-676.
Bayer, R. and McCreight, E. (1972). Organization and maintenance of large
ordered indices. Acta Informatica, 1(3):173-189.
Bayer, R. and Schkolnick, M. (1977). Concurrency of operations on B-trees.
Acta Informatica, 9:1-21.
Beck, J. (1967). Perceptual grouping produced by line figures. Perception and
Psychophysics, 2:491-495.
Becker, B., Gschwind, S., T. Ohler, B. S., and Widmayer, P. (1993). On op-
timal multiversion access structures. In Proc. 3rd International Symposium
on Large Spatial Databases, pages 123-141.
Beckley, D. A., Evens, M. W., and Raman, V. K. (1985a). Empirical com-
parison of associative file structures. In Proc. International Conference on
Foundations of Data Organization, pages 315-319.
Beckley, D. A., Evens, M. W., and Raman, V. K. (1985b). An experiment
with balanced and unbalanced k-d trees for associative retrieval. In Proc.
226 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
9th International Conference on Computer Software and Applications, pages
256-262.
Beckley, D. A., Evens, M. W., and Raman, V. K. (1985c). Multikey retrieval
from k-d trees and quad trees. In Proc. 1985 ACM SIGMOD International
Conference on Management of Data, pages 291-301.
Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. (1990). The R*-
tree: An efficient and robust access method for points and rectangles. In
Proc. 1990 ACM SIGMOD International Conference on Management of
Data, pages 322-331.
Belkin, N. and Croft, W. (1992). Information filtering and information retrieval:
Two sides of the same coin? Communications of the ACM, 35(12):29-38.
Bell, T., Moffat, A., Nevill-Manning, C., Witten, I., and Zobel, J. (1993). Data
compression in full-text retrieval systems. Journal of the American Society
for Information Science, 44(9) :508-531.
Bell, T., Moffat, A., Witten, I., and Zobel, J. (1995). The MG retrieval system:
Compressing for space and speed. Communications of the ACM, 38(4):41-42.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative
searching. Communications of the ACM, 18(9):509-517.
Bentley, J. L. (1979a). Decomposable searching problems. Information Process-
ing Letters, 8(5):244-251.
Bentley, J. L. (1979b). Multidimensional binary search trees in database appli-
cations. IEEE Transactions on Software Engineering, 5(4):333-340.
Bentley, J. L. and Friedman, J. H. (1979). Data structures for range searching.
ACM Computing Surveys, 11(4):397-409.
Berchtold, S., Keirn, D., and Kriegel, H. (1996). The X-tree: An index structure
for high-dimensional data. In Proc. 22nd International Conference on Very
Large Data Bases, pages 28-39.
Bertino, E. (1990). Query optimization using nested indices. In Proc. 2nd In-
ternational Conference on Extending Database Technology, pages 44-59.
Bertino, E. (1991a). An indexing technique for object-oriented databases. In
Proc. 7th International Conference on Data Engineering, pages 160-170.
Bertino, E. (1991b). Method precomputation in object-oriented databases. In
Proc. A CM-SIGOIS and IEEE- TC-OA International Conference on Orga-
nizational Computing Systems, pages 199-212.
Bertino, E. (1994). On indexing configuration in object-oriented databases.
VLDB Journal, 3(3):355-399.
Bertino, E., Catania, B., and Shidlovsky, B. (1997). Towards optimal two-
dimensional indexing for constraint databases. Technical Report TR-196-97,
Dipartimento di Scienze dell'Informazione, University of Milano, Italy.
REFERENCES 227
Bertino, E. and Foscoli, P. (1995). Index organizations for object-oriented
database systems. IEEE Transactions on Knowledge and Data Engineering,
7(2):193-209.
Bertino, E. and Guglielmina, C. (1991). Optimization of object-oriented queries
using path indices. In Proc. International IEEE Workshop on Research Is-
sues on Data Engineering: Transaction and Query Processing, pages 140-
149.
Bertino, E. and Guglielmina, C. (1993). Path-index: An approach to the effi-
cient execution of object-oriented queries. Data and Knowledge Engineering,
6(1):239-256.
Bertino, E. and Kim, W. (1989). Indexing techniques for queries on nested
objects. IEEE Transactions on Knowledge and Data Engineering, 1(2):196-
214.
Bertino, E. and Martino, L. (1993). Object-Oriented Database Systems - Con-
cepts and Architectures. Addison-Wesley.
Bertino, E. and Quarati, A. (1991). An approach to support method invoca-
tions in object-oriented queries. In Proc. International IEEE Workshop on
Research Issues on Data Engineering: Transaction and Query Processing,
pages 163-169.
Blanken, H., Ijbema, A., Meek, P., and Akker, B. (1990). The generalized grid
file: Description and performance aspects. In Proc. 6th International Con-
ference on Data Engineering, pages 380-388.
Bookstein, A., Klein, S., and Raita, T. (1992). Model based concordance com-
pression. In Proc. IEEE Data Compression Conference, pages 82-91.
Bowman, C., Danzig, P., Hardy, D., Manber, D., and Schwartz, M. (1995). The
harvest information discovery and access system. Computer Networks and
ISDN Systems, 28(1-2):119-125.
Bowman, C., Danzig, P., Manber, D., and Schwartz, M. (1994). Scalable inter-
net discovery: Research problems and approaches. Communications of the
ACM,37(8):98-107.
Bratley, P. and Choueka, Y. (1982). Processing truncated terms in document
retrieval systems. Information Processing fj Management, 18(5):257- 266.
Bretl, R., Maier, D., Otis, A., Penney, D., Schuchardt, B., Stein, J., Williams,
E., and Williams, M. (1989). The GemStone data management system.
In Object-Oriented Concepts, Databases, and Applications, pages 283-308.
Addison-Wesley.
Brinkhoff, T., Kriegel, H.-P., Schneider, R., and Seeger, B. (1994). Multi-step
processing of spatial joins. In Proc. 1994 ACM SIGMOD International Con-
ference on Management of Data, pages 197-208.
Brodsky, A., Lassez, C., Lassez, J., and Maher, M. (1996). Separability of poly-
hedra and a new approach to spatial storage. In Proc. 14th ACM SIGACT-
228 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
SIGMOD-SIGART Symposium on Principles of Database Systems, pages
54-65.
Brown, E. (1995). Fast evaluation ofstructured queries for information retrieval.
In Proc. 18th ACM-SIGIR International Conference on Research and De-
velopment in Information Retrieval, pages 30-38.
Buckley, C. and Lewit, A. (1985). Optimization of inverted vector searches. In
Proc. 8th ACM-SIGIR International Conference on Research and Develop-
ment in Information Retrieval, pages 97-110.
Burkowski, F. (1992). An algebra for hierarchically organized text-dominated
databases. Information Processing fj Management, 28(3):333-348.
Callan, J. (1994). Passage-level evidence in document retrieval. In Proc. 17th
ACM-SIGIR International Conference on Research and Development in In-
formation Retrieval, pages 302-309.
Cattell, R. (1993). The Object Database Standard: ODMG-93 Release 1.2. Mor-
gan Kaufmann Publishers.
Cesarini, F. and Soda, G. (1982). Binary trees paging. Information Systems,
7(4):337-344.
Chan, C., Goh, C., and Ooi, B. C. (1997). Indexing OODB instances based on
access proximity. In Proc. 13th International Conference on Data Engineer-
ing, pages 14-21.
Chan, C. Y., Ooi, B. C., and Lu, H. (1992). Extensible buffer management of
indexes. In Proc. 18th International Conference on Very Large Data Bases,
pages 444-454.
Chang, J. M. and Fu, K. S. (1979). Extended k-d tree database organization:
A dynamic multi-attribute clustering method. In Proc. 3rd International
Conference on Computer Software and Applications, pages 39-43.
Chang, S. K. and Fu, K. S., editors (1980). Pictorial Information Systems.
Springer-Verlag.
Chang, S. K. and Hsu, A. (1992). Image information systems: Where do we
go from here? IEEE Transactions on Knowledge and Data Engineering,
4(5):431-442.
Chang, S. K., Jungert, E., and Li, Y. (1989). Representation and retrieval of
symbolic pictures using generalized 2D strings. In Proc. Visual Communi-
cations and Image Processing Conference, pages 1360-1372.
Chang, S. K., Shi, Q. Y., and Van, C. W. (1987). Iconic indexing by 2-d string.
IEEE Transaction on Pattern Analysis and Machine Intelligence, 9(3):413-
428.
Chang, S. K., Van, C. W., Dimitroff, D. C., and Arndt, T. (1988). An intel-
ligent image database system. IEEE Transaction on Software Engineering,
15(5):681-688.
REFERENCES 229
Chauduri, S. and Dayal, U. (1996). Decision support, data warehousing, and
olap (tutorial notes). In Proc. 22nd International Conference on Very Large
Data Bases.
Chazelle, B. (1986). Filtering search: A new approach to query-answering.
SIAM Journal of Computing, 15(3):703-724.
Cheong, C. (1996). Internet agents. New Riders - Macmillan Publishing.
Chiang, Y. and Tamassia, R. (1992). Dynamic algorithms in computational
geometry. Proceedings of the IEEE, 80(9):1412-1434.
Chiu, D. K. Y. and Kolodziejczak, T. (1986). Synthesizing knowledge: A cluster
analysis approach using event-covering. IEEE Transactions on Systems, Man
and Cybernetics, 16(2):462-467.
Choenni, S., Bertino, E., Blanken, H., and Chang, T. (1994). On the selection
of optimal index configuration in 00 databases. In Proc. 10th International
Conference on Data Engineering, pages 526-537.
Choueka, Y., Fraenkel, A., and Klein, S. (1988). Compression of concordances in
full-text retrieval systems. In Proc. 11th ACM-SIGIR International Confer-
ence on Research and Development in Information Retrieval, pages 597-612.
Choy, D. and Mohan, C. (1996). Locking protocols for two-tier indexing of
partitioned data. In Proc. International Workshop on Advanced Transaction
Models and Architectures, pages 198-215.
Chua, T. S., Lim, S. K., and Pung, H. K. (1994). Content-based retrieval of
segmented images. In Proc. 2nd ACM Multimedia Conference, pages 211-
218.
Chua, T. S., Tan, K. 1., and Goi, B. C. (1997). Fast signature-based color-
spatial image retrieval. In Proc. 4th International Conference on Multimedia
Computing and Systems.
Chua, T. S., Teo, K. C., Goi, B. C., and Tan, K. L. (1996). Using domain
knowledge in querying image database. In Proc. 3rd Multimedia Modeling
Conference, pages 339-354.
Clarke, C., Cormack, G., and Burkowski, F. (1995). An algebra for structured
text search and a framework for its implementation. Computer Journal,
38(1):43-56.
Cluet, S., Delobel, C., Lecluse, C., and Richard, P. (1989). Reloop, an algebra
based query language for an object-oriented database system. In Proc. 1st
International Conference on Deductive and Object Oriented Databases, pages
313-332.
Comer, D. (1979). The ubiquitous B-tree. ACM Computing Surveys, 11(2):121-
137.
Costagliola, G., Tucci, M., and Chang, S. K. (1992). Representing and retrieving
symbolic pictures by spatial relations. In Visual Database Systems II, pages
49-59.
230 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Dao, T., Sacks-Davis, R, and Thorn, J. (1996). Indexing structured text for
queries on containment relationships. In Pmc. 7th Australasian Database
Conference, pages 82-91.
Deux, O. (1990). The story of O2 . IEEE Transactions on Knowledge and Data
Engineering, 2(1):91-108.
Eastman, C. M. and Zemankova, M. (1982). Partially specified nearest neighbor
using kd trees. Information Processing Letters, 15(2) :53-56.
Easton, M. (1986). Key-sequence data sets in indeiible storage. IBM Journal
of Research and Development, 30(12).
Edelsbrunner, H. (1983). A new approach to rectangular intersection. Interna-
tional Journal of Computational Mathematics, 13:209-219.
Edelstein, H. (1995). Faster data warehouses. In Information Week, pages 77-
88.
Elias, P. (1975). Universal codeword sets and representations of the integers.
IEEE Transactions on Information Theory, IT-21(2):194-203.
Elmasri, R, Wuu, G. T., and Kouramajian, V. (1990). The Time Index: An
access structure for temporal data. In Proc. 16th International Conference
on Very Large Data Bases, pages 1-12.
Fagin, R, Nievergelt, J., Pippenger, N., and Strong, H. R (1979). Extendible
hashing - A fast access method for dynamic files. A CM Transactions on
Database Systems, 4(3):315-344.
Faloutsos, C. (1988). Gray-codes for partial match and range queries. IEEE
Transactions on Software Engineering, 14(10):1381-1393.
Faloutsos, C., Equitz, W., Flickner, M., Niblack, W., Petkovic, D., and Bar-
ber, R. (1994). Efficient and effective querying by image content. Journal of
Intelligent Information Systems, 3(3):231-262.
Faloutsos, C. and Jagadish, H. (1992). On B-tree indices for skewed distI'i-
butions. In Proc. 18th International Conference on Very Large Databases,
pages 363-374.
Faloutsos, C. and Roseman, S. (1989). Fractals for secondary key retrieval. In
Proc. 1989 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, pages 247-252.
Finkel, R. A. and Bentley, J. L. (1974). Quad trees: A data structure for retrieval
on composite keys. Acta Informatica, 4:1-9.
Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dam, B., Gorkani,
M., Hafner, J., Petkovic, D. L. D., Steele, D., and Yanker, P. (1995). Query
by image and video content: The QBIC system. IEEE Computer, 28(9):23-
32.
Fox, E., editor (1995). Communications of the ACM, volume 38(4). Special
issue on Digital Libraries.
REFERENCES 231
Fox, E. and Shaw, J. (1993). Combination of multiple searches. In Proc. Text
Retrieval Conference (TREC), pages 35-44. National Institute of Standards
and Technology Special Publication 500-215.
Frakes, W. and Baeza-Yates, R., editors (1992). Information Retrieval: Data
Structures and Algorithms. Prentice-Hall.
Francos, J. M., Meiri, A. Z., and Porat, B. (1993). A unified texture model based
on a 2-d wold like decomposition. IEEE Transactions on Signal Processing,
pages 2665-2678.
Freeston, M. (1987). The BANG file: A new kind of grid file. In Proc. 1987
ACM SIGMOD International Conference on Management of Data, pages
260-269.
Freeston, M. (1995). A general solution of the n-dimensional B-tree problem.
In Proc. 1995 ACM SIGMOD International Conference on Management of
Data, pages 80-91.
French, C. (1995). One size fits all. In Proc. 1995 ACM SIGMOD International
Conference on Management of Data, pages 449-450.
Friedman, J. H., Bentley, J. L., and Finkel, R. A. (1987). An algorithm for
finding best matches in logarithmic expected time. ACM Transactions on
Mathematical Software, 3(3):209-226.
Gallager, R. and Van Voorhis, D. (1975). Optimal source codes for geometrically
distributed integer alphabets. IEEE Transactions on Information Theory,
IT-21(2):228-230.
Gargantini, I. (1982). An effective way to represent quadtrees. Communications
of the ACM, 25(12):905-910.
Goh, C. H., Lu, H., Ooi, B. C., and Tan, K. L. (1996). Indexing temporal data
using B+-tree. Data and Knowledge Engineering, 18:147-165.
Goldfarb, C. (1990). The SGML Handbook. Oxford University Press.
Golomb, S. (1966). Run-length encodings. IEEE Transactions on Information
Theory,IT-12(3):399-401.
Gong, Y., Chua, H. C., and Guo, X. (1995). Image indexing and retrieval based
on color histograms. In Proc. 2nd Multimedia Modeling Conference, pages
115-126.
Gonnet, G. and Baeza-Yates, R. (1991). Handbook of data structures and algo-
rithms. Addison-Wesley, second edition.
Gonnet, G. and Tompa, F. (1987). Mind your grammar: A new approach
to modeling text. In Proc. 13th International Conference on Very Large
Databases, pages 339-346.
Graefe, G. (1993). Query evaluation techniques for large databases. ACM Com-
puting Surveys, 25(2) :73-170.
232 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Gravano, L., Chang, C., Garcia-Molina, H., and Paepcke, A. (1997). STARTS:
Stanford proposal for internet meta-searching. In Proc. 1997 ACM SIGMOD
International Conference on Management of Data.
Greene, D. (1989). An implementation and performance analysis of spatial data
access methods. In Proc. 5th International Conference on Data Engineering,
pages 606-615.
Gudivada, V. and Raghavan, R. (1995). Design and evaluation of algorithms
for image retrieval by spatial similarity. ACM Transactions on Information
Systems, 13(1):115-144.
Gunadhi, H. and Segev, A. (1993). Efficient indexing methods for temporal
relation. IEEE Transactions on Knowledge and Data Engineering, 5(3):496-
509.
Gunther, O. (1988). Efficient Structures for Geometric Data Management.
Springer-Verlag.
Gunther, O. (1989). The design of the cell tree: An object-oriented index struc-
ture for geometric databases. In Proc. 5th International Conference on Data
Engineering, pages 598-605.
Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching.
In Proc. 1984 ACM SIGMOD International Conference on Management of
Data, pages 47-57.
Hall, P. and Dowling, G. (1980). Approximate string matching. Computing
Surveys, 12(4):381-402.
Harman, D. (1991). How effective is suffixing? Journal of the American Society
for Information Science, 42(1):7-15.
Harman, D., editor (1992). Proc. TREC Text Retrieval Conference. National
Institute of Standards Special Publication 500-207.
Harman, D., editor (1995a). Information Processing 0 Management, volume
31(3). Special Issue: The Second Text Retrieval Conference (TREC-2).
Harman, D. (1995b). Overview of the second text retrieval conference (TREC-
2). Information Processing 0 Management, 31(3):271-289.
Harman, D. and Candela, G. (1990). Retrieving records from a gigabyte of
text on a minicomputer using statistical ranking. Journal of the American
Society for Information Science, 41 (8) :581-589..
Hearst, M. and Plaunt, C. (1993). Subtopic structuring for full-length document
access. In Proc. 16th ACM-SIGIR International Conference on Research and
Development in Information Retrieval, pages 59-68.
Henrich, A., Six, H.-W., and Widmayer, P. (1989a). The LSD tree: spatial access
to multidimensional point and non-point objects. In Proc. 15th International
Conference on Very Large Data Bases, pages 45-53.
REFERENCES 233
Henrich, A., Six, H.-W., and Widmayer, P. (1989b). Paging binary trees with
external balancing. In Proc. International Workshop on Graphtheoretic Con-
cepts in Computer Science.
Hinrichs, K. (1985). Implementation of the grid file: Design concepts and ex-
perience. BIT, 25:569-592.
Hinrichs, K. and Nievergelt, J. (1983). The grid file: A data structure designed
to support proximity queries on spatial objects. In Proc. International Work-
shop on Graphtheoretic Concepts in Computer Science, pages 100-113.
Hirata, K., Hara, Y., Takano, H., and Kawasaki, S. (1996). Content-oriented
integration in hypermedia systems. In Proc. 1996 ACM Conference on Hy-
pertext, pages 11-21.
Hoel, E. and Samet, H. (1992). A qualitative comparison study of data struc-
tures for large line segment databases. In Proc. 1992 ACM SIGMOD Inter-
national Conference on Management of Data, pages 205-214.
Hsu, W., Chua, T. S., and Pung, H. K. (1995). An integrated color-spatial
approach to content-based image retrieval. In Proc. 3rd ACM Multimedia
Conference, pages 305-313.
Hutflesz, A., Six, H.-W., and Widmayer, P. (1990). The R-file: An efficient
access structure for proximity queries. In Proc. 6th International Conference
on Data Engineering, pages 372-379.
Iannizzotto, G., Vita, L., and Puliafito, A. (1996). A new shape distance for
content-based image retrieval. In Proc. 3rd Multimedia Modeling Conference,
pages 371-386.
Imielinski, T. and Badrinath, B. (1994). Mobile wireless computing: solutions
and challenges in data management. Communications of the ACM, 37(10):18-
28.
Imielinski, T., Viswanathan, S., and Badrinath, B. (1994a). Energy efficient
indexing on air. In Proc. 1994 ACM SIGMOD International Conference on
Management of Data, pages 25-36.
Imielinski, T., Viswanathan, S., and Badrinath, B. (1994b). Power efficient
filtering of data on air. In Proc. 4th International Conference on Extending
Database Technology, pages 245-258.
Ioka, M. (1989). A method of defining the similarity of images on the basis of
color information. Technical Report RT-0030, IBM Tokyo Research Lab.
Jaffar, J. and Lassez, J. (1987). Constraint logic programming. In Proc. 14th
Annual ACM Symposium on Principles of Programming Languages, pages
111-119.
Jagadish, H. V. (1991). A retrieval technique for similar shape. In Proc. 1991
ACM SIGMOD International Conference on Management of Data, pages
208-217.
234 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Jea, K. F. and Lee, Y. C. (1990). Building efficient and flexible feature-based
indexes. Information Systems, 16(6):653-662.
Jenq, P., Woelk, D., Kim, W., and Lee, W. (1990). Query processing in dis-
tributed ORION. In Proc. 2nd International Conference on Extending Data-
base Technology, pages 169-187.
Jensen, C. S., editor (1994). A consensus glossary of temporal database concepts.
Jensen, C. S., Mark, L., and Roussopoulos, N. (1991). Inc'remental implemen-
tation model for relational databases with transaction time. IEEE Transac-
tions on Knowledge and Data Engineering, 3(4):461-473.
Jensen, C. S. and Snodgrass, R. (1994). Temporal specialization and generaliza-
tion. IEEE Transactions on Knowledge and Data Engineering, 6(6):954-974.
Jhingran, A. (1991). Precomputation in a complex object environment. In Proc.
7th IEEE International Conference on Data Engineering, pages 652-659.
Jiang, P., Ooi, B. C., and Tan, K. L. (1996). An experimental study of
temporal indexing structures, unpublished manuscript, available at
http://www.iscs.nus.sg/ooibc/tp.ps.
Kabanza, F., Stevenne, J., and Wolper, P. (1990). Handling infinite temporal
data. In Proc. 9th ACM SIGACT-SIGMOD-SIGART Symposium on Prin-
ciples of Database Systems, pages 392-403.
Kanellakis, P., Kuper, G., and Revesz, P. (1995). Constraint query languages.
Journal of Computer and System Sciences, 51(1):26-52.
Kanellakis, P. and Ramaswamy, S. (1996). Indexing for data models with con-
straints and classes. Journal of Computer and System Sciences, 52(3) :589-
612.
Kaszkiel, M. and Zobel, J. (1997). Passage retrieval revisited. In Proc. 20th
A CM-SIGIR International Conference on Research and Development in In-
formation Retrieval.
Kemper, A., Kilger, C., and Moerkotte, G. (1994). Function materialization
in object bases: Design, realization and evaluation. IEEE Transactions on
Knowledge and Data Engineering, 6(4):587-608.
Kemper, A. and Kossmann, D. (1995). Adaptable pointer swizzling strategies in
object bases: Design, realization, and quantitative analysis. VLDB Journal,
4(3):519-566.
Kemper, A. and Moerkotte, G. (1992). Access support relations: An indexing
method for object bases. Information Systems, 17(2):117-145.
Kent, A., Sacks-Davis, R., and Ramamohanarao, K. (1990). A signature file
scheme based on multiple organizations for indexing very large text databases.
Journal of the American Society for Information Science, 41(7):508--534.
Kilger, C. and Moerkotte, G. (1994). Indexing multiple sets. In Proc. 20th
International Conference on Very Large Data Bases, pages 180-191.
REFERENCES 235
Kim, K., Kim, W., Woelk, D., and Dale, A. (1988). Acyclic query processing in
object-oriented databases. In Proc. 7th International Conference on Entity-
Relationship Approach, pages 329-346.
Kim, W. (1989). A model of queries for object-oriented databases. In Proc. 15th
International Conference on Very Large Data Bases, pages 423-432.
Kim, W., Kim, K., and Dale, A. (1989). Indexing techniques for object-oriented
databases. In Object-Oriented Concepts, Databases, and Applications, pages
371-394. Addison-Wesley.
Knaus, D., Mittendorf, E., Schauble, P., and Sheridan, P. (1995). Highlighting
relevant passages for users of the interactive SPIDER retrieval system. In
Proc. 4th Text Retrieval Conference (TREC), pages 233-243.
Knuth, D. E. (1973). Fundamental Algorithms: The art of computer program-
ming, Volume 1. Addison-Wesley.
Knuth, D. E. and Wegner, L. M., editors (1992). Proc. IFIP TC2/WG2.6 2nd
Working Conference on Visual Database Systems. North-Holland.
Kolovson, C. (1993). Indexing techniques for historical databases. In Temporal
Databases: Theory, Design and Implementation, Chapter 17, pages 418-432.
A. Benjamin/Cummings.
Kolovson, C. and Stonebraker, M. (1991). Segment indexes: Dynamic indexing
techniques for multi-dimensional interval data. In Proc. 1991 ACM SIGMOD
International Conference on Management of Data, pages 138-147.
Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., and Protopapas, Z. (1996).
Fast nearest neighbor search in medical image databases. In Proc. 22nd In-
ternational Conference on Very Large Data Bases, pages 215-226.
Koubarakis, M. (1994). Database models for infinite and indefinite temporal
information. Information Systems, 19(2): 141-173.
Kriegel, H. (1984). Performance comparison of index structures for multi-key
retrieval. In Proc. 1984 ACM SIGMOD International Conference on Man-
agement of Data, pages 186-196.
Kriegel, H. and Seeger, B. (1986). Multidimensional order preserving linear
hashing with partial expansion. In Proc. 1st International Conference on
Database Theory, pages 203-220.
Kriegel, H. and Seeger, B. (1988). PLOP-Hashing: A grid file without directory.
In Proc. 4th International Conference on Data Engineering, pages 369-376.
Kroll, B. and Widmayer, P. (1994). Distributing a search tree among a growing
number of processors. In Proc. 1994 ACM SIGMOD International Confer-
ence on Management of Data, pages 265-276.
Kukich, K. (1992). Techniques for automatically correcting words in text. Com-
puting Sw'veys, 24(4):377-440.
236 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Kumar, A., Tsotras, V. J., and Faloutsos, C. (1995). Access methods for bi-
temporal databases. In Proc. International Workshop on Temporal Databases,
pages 235-254.
Kunii, T., editor (1989). Proc. IFfP TC2/WG2.6 1st Working Conference on
Visual Database Systems. North-Holland.
Larson, P. (1978). Dynamic hashing. BIT, 13:184-201.
Lassez, J. (1990). Querying constraints. In Proc. 9th ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems, pages 288-298.
Lee, D. T. and Wong, C. K. (1977). Worst-case analysis for region and partial
region searches in multidimensional binary search trees and balanced quad
trees. Acta Informatica, 9(1):23-'-29.
Lee, S. Y. and Hsu, F. J. (1990). 2D C-String: A new spatial knowledge repre-
sentation for image database system. Pattern Recognition, 23(10):1077-1087.
Lee, S. Y. and Leng, C. (1989). Partitioned signature files: Design issues and
performance evaluation. ACM Transactions on Office Information Systems,
7(2):158-180.
Lee, S. Y, Yang, M. C., and Chen, J. W. (1992). Signature file as a spatial
filter for iconic image database. Journal of Visual Languages and Computing,
3(4):373-397.
Lee, W. (1989). Mobile cellular telecommunication systems. McGraw-Hill.
Lin, K., Jagadish, H., and Faloutsos, C. (1995). The TV-tree: An index struc-
ture for high-dimensional data. VLDB Journal, 3(4):517-542.
Litwin, W. (1980). Linear hashing: A new tool for file and table addressing. In
Proc. 6th International Conference on Very Large Data Bases, pages 212-
223.
Litwin, W. and Neimat, M. (1996). k-RP*S: A scalable distributed data struc-
ture for high-performance multi-attribute access. In Proc. 4th Conference on
Parallel and Distributed Information Systems, pages 35-46.
Litwin, W., Neimat, M., and Schneider, D. (1993a). LH* - Linear hashing for
distributed files. In Proc. 1993 ACM SIGMOD International Conference on
Management of Data, pages 327-336.
Litwin, W., Neimat, M., and Schneider, D. (1994). RP*: A family of order-
preserving scalable data structures. In Proc. 20th International Conference
on Very Large Data Bases, pages 342-353.
Litwin, W., Neimat, N. A., and Schneider, D. A. (1993b). LH* - Linear hashing
for distributed files. In Proc. 1993 ACM SIGMOD International Conference
on Management of Data, pages 327-336.
Lomet, D. (1992). A review of recent work on multi-attribute access methods.
ACM SIGMOD Record, 21(3):56-63.
REFERENCES 237
Lomet, D. and Salzberg, B. (1989). Access methods for multiversion data.
In Proc. 1989 ACM SIGMOD International Conference on Management of
Data, pages 315-324.
Lomet, D. and Salzberg, B. (1990a). The hB-tree: A multiattribute indexing
method with good guaranteed performance. ACM Transactions on Database
Systems, 15(4):625-658.
Lomet, D. and Salzberg, B. (1990b). The performance of a multiversion ac-
cess methods. In Proc. 1990 ACM SIGMOD International Conference on
Management of Data, pages 353-363.
Lomet, D. and Salzberg, B. (1993). Transaction time databases. In Temporal
Databases: Theory, Design and Implementation, Chapter 16, pages 388-417.
A. Benjamin/Cummings.
Lovins, J. (1968). Development of a stemming algorithm. Mechanical Transla-
tion and Computation, 11(1-2):22-31.
Low, C. C., Ooi, B. C., and Lu, H. (1992). H-trees: A dynamic associative search
index for OODB. In Proc. 1992 ACM SIGMODlnternational Conference on
Management of Data, pages 134-143.
Lu, H. and Ooi, B. C. (1993). Spatial indexing: Past and future. IEEE Bulletin
on Data Engineering, 16(3):16-21.
Lu, H., Ooi, B. C., and Tan, K. L. (1994). Efficient image retrieval by color con-
tents. In Proc. 1994 International Conference on Applications of Databases,
pages 95-108.
Lu, W. and Han, J. (1992). Distance-associated join indices for spatial range
search. In Proc. 8th International Conference on Data Engineering, pages
284-292.
Lucarella, D. (1988). A document retrieval system based upon nearest neighbor
searching. Journal of Information Science, 14:25-33.
Maier, D. and Stein, J. (1986). Indexing in an object-oriented database. In
Proc. IEEE Workshop on Object-Oriented DBMSs, pages 171-182.
Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet
representation. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 11(7):2091-2110.
Manber, U. and Wu, S. (1994). GLIMPSE: A tool to search through entire file
systems. In Proc. 1994 Winter USENIX Technical Conference, pages 23-32.
Maragos, P. (1989). Pattern spectrum and multiscale shape representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):701-
716.
Maragos, P. and Schafer, R. W. (1986). Morphological skeleton representation
and coding of binary images. IEEE Transactions on Acoustics, Speech, and
Signal Processing, 34:1228-1244.
238 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Matsuyama, T., Hao, L., and Nagao, M. (1984). A file organization for geo-
graphic information systems based on spatial proximity. International Jour-
nal on Computer Vision, Graphics, and Image Processing, 26(3):303-318.
Mehlhorn, K. and Tsakalidis, A. (1990). Data structures. In Handbook of The-
oretical Computer Science, Volume A, pages 301-341. Elsevier Publisher.
Mehrotra, R. and Gary, J. E. (1993). Feature-based retrieval of similar shapes.
In Proc. 9th International Conference on Data Engineering, pages 108-115.
Melton, J. (1996). An SQL3 snapshot. In Proc. 12th International Conference
on Data Engineering, pages 666-672.
Mittendorf, E. and Schauble, P. (1994). Document and passage retrieval based
on hidden Markov models. In Proc. 17th ACM-SIGIR International Confer-
ence on Research and Development in Information Retrieval, pages 318-327.
Miyahara, M. and Yoshida, Y. (1989). Mathematical transform of (R,G,B) color
data to Munsell (H,Y,C) color data. Journal of the Institute of Television
Engineers, 43(10):1129-1136.
Moffat, A. and Zobel, J. (1996). Self-indexing inverted files for fast text re-
trieval. ACM Transactions on Information Systems, 14(4):349-379.
Moffat, A., Zobel, J., and Sacks-Davis, R. (1994). Memory efficient ranking.
Information Processing (j Management, 30(6):733-744.
Morrison, D. (1968). PATRICIA - Practical algorithm to retrieve information
coded in alphanumeric. Journal of the ACM, 15(4):514-534.
Morton, G. (1966). A computer oriented geodetic data base and a new technique
in file sequencing. In IBM Ltd.
Moss, J. (1992). Working with the persistent objects: to swizzle or not to swiz-
zle. IEEE Transactions on Software Engineering, 18(8):657-673.
Nabil, M., Ngu, A. H. H., and Shepherd, J. (1996). Picture similarity re-
trieval using the 2D projection interval representation. IEEE Transactions
on Knowledge and Data Engineering, 8(4):533-539.
Nagy, G. (1985). Image databases. Image and Vision Computing, 3(3): 111-117.
Nascimento, M. A. (1996). Efficient Indexing of Temporal Database via B+-
trees. PhD thesis, School of Engineering and Applied Science, Southern
Methodist University.
Nelson, R. and Samet, H. (1987). A population analysis for hierarchical data
structures. In Proc. 1987 ACM SIGMOD International Conference on Man-
agement of Data, pages 270-277.
Ng, V. and Kameda, T. (1993). Concurrent accesses to R-trees. In Proc. 3rd
International Symposium on Advances in Spatial Databases, pages 142-161.
Niblack, W., Equitz, R. B. W., Glasman, M. F. E., Petkovic, D., YankeI', P.,
and Faloutsos, C. (1993). The QBIC project: Query images by content using
color, texture and shape. In Storage and Retrieval for Image and Video
Databases, Vulume 1908, pages 173-187.
REFERENCES 239
Nievergelt, J. and Hinrichs, K. (1985). Storage and access structures for geo-
metric data bases. In Proc. International Conference on Foundations of Data
Organization, pages 335-345.
Nievergelt, J., Hinterberger, H., and Sevcik, K. C. (1984). The grid file: An
adaptable, symmetric multikey file structure. A CM Transactions on Database
Systems, 9(1):38-71.
Nievergelt, J. and Widmayer, P. (1997). Spatial data structures: Concepts and
design choices. In Algorithmic Foundations of GIS, pages 1-61. Springer-
Verlag.
Nori, A. (1996). Object relational database management systems (tutorial notes)
In Proc. 22nd International Conference on Very Large Data Bases.
ObjectStore (1995). ObjectStore C++ - User Guide Release 4.0.
Ogle, V. E. and Stonebraker, M. (1995). Chabot: Retrieval from a relational
database of images. IEEE Computer, 28(9):40-48.
Ohsawa, Y. and Sakauchi, M. (1983). The BD-tree: A new n-dimensional data
structure with highly efficient dynamic characteristics. In Proc. IFIP Congres~
pages 539-544.
Ohsawa, Y. and Sakauchi, M. (1990). A new tree type data structure with
homogeneous nodes suitable for a very large spatial database. In Proc. 6th
International Conference on Data Engineering, pages 296-303.
O'Neil, P. and Graefe, G. (1995). Multi-table joins through bitmapped join
indices. ACM SIGMOD Record, 24(3):8-11.
O'Neil, P. and Quass, D. (1997). Improved query performance with variant
indexes. In Proc. 1997 ACM SIGMOD International Conference on Man-
agement of Data.
Ooi, B. C. (1990). Efficient Query Processing in Geographical Information Sys-
tems. Springer-Verlag.
Ooi, B. C., McDonell, K. J., and Sacks-Davis, R. (1987). Spatial kd-tree: An
indexing mechanism for spatial databases. In Proc. 11th International Con-
ference on Computer Software and Applications.
Ooi, B. C., Sacks-Davis, R., and Han, J. (1993). Spatial indexing structures,
unpublished manuscript, available at http://www.iscs.nus.edu.sg/ooibc/.
Ooi, B. C., Sacks-Davis, R., and McDonell, K. J. (1991). Spatial indexing by bi-
nary decomposition and spatial bounding. Information Systems, 16(2):211-
237.
Ooi, B. C., Tan, K. L., and Chua, T. S. (1997). Fast image retrieval using color-
spatial information. Technical report, Department of Information Systems
and Computer Science, NUS, Singapore.
Orenstein, J. A. (1982). Multidimensional tries for associative searching. Infor-
mation Processing Letters, 14(4):150-157.
240 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Orenstein, J. A. (1986). Spatial query processing in an object-oriented database
system. In Proc. 1986 ACM SIGMOD International Conference on Manage-
ment of Data, pages 326-336.
Orenstein, J. A. (1990). A comparison of spatial query processing techniques
for native and parameter spaces. In Proc. 1990 ACM SIGMOD International
Conference on Management of Data, pages 343-352.
Orenstein, J. A. and Merrett, T. H. (1984). A class of data structures for
associative searching. In Proc. 1984 ACM-SIGACT-SIGMOD Symposium
on Principles of Database Systems, pages 181-190.
Ouksel, M. and Scheuermann, P. (1981). Multidimensional B-trees: Analysis of
dynamic behavior. BIT, 21:401-418.
Overmars, M. H. and Leeuwen, J. V. (1982). Dynamic multi-dimensional data
structures based on Quad- and KD- trees. Acta Information, 17:267-285.
Owolabi, O. and McGregor, D. (1988). Fast approximate string matching. Soft-
ware - Practice and Experience, 18:387-393.
Papadias, D., Theodoridis, Y, Sellis, T., and Egenhofer, M. J. (1995). Topo-
logical relations in the world of minimum bounding rectangles: A study with
R-trees. In Proc. 1995 ACM SIGMOD International Conference on Man-
agement of Data, pages 92-103.
Paredaens, J. (1995). Spatial databases, the final frontier. In Proc. 5th Inter-
national Conference on Database Theory, pages 14-31.
Paredaens, J., Van den Bussche, J., and Van Gucht, D. (1994). Towards a theory
of spatial database queries. In Proc. 13th ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, pages 279-288.
Persin, M. (1996). Efficient implementation of text retrieval techniques. Mas-
ter's thesis, Department of Computer Science, RMIT, Melbourne, Australia.
Persin, M., Zobel, J., and Sacks-Davis, R. (1996). Filtered document retrieval
with frequency-sorted indexes. Journal of the American Society for Infor-
mation Science, 47(10):749-764.
Pfaltz, J., Berman, W., and Cagley, E. (1980). Partial-match retrieval using
indexed descriptor files. Communications of the ACM, 23(9):522-528.
Porter, M. (1980). An algorithm for suffix stripping. Program, 13(3):130-137.
Preparata, F. and Shamos, M. (1985). Computational Geometry: An Introduc-
tion. Springer-Verlag.
Rabitti, F. and Savino, P. (1991). Image query processing based on multi-level
signatures. In Proc. 14th ACM-SIGIR International Conference on Research
and Development in Information Retrieval, pages 305-314.
Rabitti, F. and Stanchev, P. (1989). GRIM-DBMS: A graphical image database
management system. In Proc. IFIP TC2/WG2.6 1st Working Conference on
Visual Database Systems, pages 415-430.
REFERENCES 241
Ramaswamy, S. (1997). Efficient indexing for constraints and temporal data-
bases. In Pmc. 6th International Conference on Database Theory, pages 419-
431.
Ramaswamy, S. and Kanellakis, P. (1995). OODB indexing by class-division.
In Proc. 1995 ACM SIGMOD International Conference on Management of
Data, pages 139-150.
Roberts, C. (1979). Partial-match retrieval via the method of superimposed
codes. Pmceedings of the IEEE, 67(12):1624-1642.
Robinson, J. T. (1981). The k-d-b-tree: A search structure for large multi-
dimensional dynamic indexes. In Pmc. 1981 ACM SIGMOD International
Conference on Management of Data, pages 10-18.
Rosenberg, J. B. (1985). Geographical data structures compared: A study of
data structures supporting region queries. IEEE Transactions on Computer
Aided Design, 4(1):53-67.
Rotem, D. (1991). Spatial join indices. In Pmc. 7th International Conference
on Data Engineering, pages 500-509.
Rotem, D. and Segev, A. (1987). Physical organization of temporal data. In
Pmc. 3rd International Conference on Data Engineering, pages 547-553.
Sacks-Davis, R., Kent, A., and Ramamohanarao, K. (1987). Multi-key access
methods based on superimposed coding techniques. ACM Transactions on
Database Systems, 12(4) :655-696.
Sagiv, y. (1986). Concurrent operations on B*-trees with overtaking. Journal
of Computer System Science, 33(2) :275-296.
Salomone, S. (1995). Radio days. In Byte, Special Issue on Mobile Computing,
page 107.
Salton, G. (1989). Automatic Text Processing: The Transfol'mation, Analysis,
and Retrieval of Information by Computer. Addison-Wesley.
Salton, G., Allan, J., and Buckley, C. (1993). Approaches to passage retrieval
in full text information systems. In Pmc. 16th ACM-SIGIR International
Conference on Research and Development in Information Retrieval, pages
49-58.
Salton, G. and McGill, M. (1983). Introduction to Modern Information Re-
trieval. McGraw-Hill.
Salzberg, B. (1994). On indexing spatial and temporal data. Information Sys-
tems, 19(6):447-465.
Samet, H. (1989). The design and analysis of spatial data structures. Addison-
Wesley.
Scheuermann, P. and Ouksel, M. (1982). Multidimensional B-trees for associa-
tive searching in database systems. Information Systems, 7(2):123-137.
242 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Seeger, B. and Kriegel, H. (1988). Techniques for design and implementation of
efficient spatial access methods. In Proc. 14th International Conference on
Very Large Data Bases, pages 360-371.
Sellis, T., Roussopoulos, N., and Faloutsos, C. (1987). The R+-tree: A dynamic
index for multi-dimensional objects. In Proc. 13th International Conference
on Very Large Data Bases, pages 507-518.
Serra, J. (1988). Image Analysis and Mathematical Morphology, Volume 2, The-
oretical Advances. Academic Press.
Shamos, M. I. and Bentley, J. L. (1978). Optimal algorithm for structuring
geographic data. In Proc. 1st International Advanced Study Symposium on
Topological Data Structure for Geographic Information Systems.
Sharma, K. D. and Rani, R. (1985). Choosing optimal branching factors for
k-d-B trees. Information Systems, 10(1):127-134.
Shaw, G. and Zdonik, S. (1989). An object-oriented query algebra. In Proc.
2nd International Workshop on Database Programming Languages, pages
103-112.
Shen, H., Ooi, B. C., and Lu, H. (1994). The TP-index: A dynamic and ef-
ficient indexing mechanism for temporal databases. In Proc. 10th Interna-
tional Conference on Data Engineering, pages 274-281.
Sheng, S., Chandrasekaran, A., and Broderson, R. (1992). A portable mul-
timedia terminal for personal communications. In IEEE Communications
Magazine, pages 64-75.
Shidlovsky, B. and Bertino, E. (1996). A graph-theoretic approach to indexing
in object-oriented databases. In Proc. 12th International Conference on Data
Engineering, pages 230-237.
Snodgrass, R. (1987). The temporal query language TQuel. ACM Transaction
on Database Systems, 12(2):247-298.
Sreenath, B. and Seshadri, S. (1994). The hcC-tree: An efficient index structure
for object oriented databases. In Proc. 20th International Conference on
Very Large Data Bases, pages 203-213.
Straube, D. and Ozsu, M. T. (1995). Query optimization and execution plan
generation in object-oriented data management systems. IEEE Transactions
on Knowledge and Data Engineering, 7(2):210-227.
Swain, M. J. (1993). Interactive indexing into image database. In Storage and
Retrieval for Image and Video Databases, Volume 1908, pages 95-103.
Tamminen, M. (1982). Efficient spatial access to a data base. In Proc. 1982
ACM SIGMOD International Conference on Management of Data, pages
200-206.
Tamura, H., Mori, S., and Yamawaki, T. (1978). Textural features correspond-
ing to visual perception. IEEE Transactions on Systems, Man and Cyber-
netics,8(6):460-472.
REFERENCES 243
Tamura, H. and Yokoya, N. (1984). Image database systems: A survey. Pattern
Recognition, 17(1):29-43.
Thorn, J., Zobel, J., and Grima, B. (1995). Design of indexes for structured
document databases. Technical Report TR-95-8, Collaborative Information
Technology Research Institute, RMIT and The University of Melbourne.
Treisman, A. and Paterson, R. (1980). A feature integration theory of attention.
Cognitive Psychology, 12:97-136.
Tsay, J. J. and Li, H. C. (1994). Lock-free concurrent tree structures for mul-
tiprocessor systems. In Proc. 1994 International Conference on Parallel and
Distributed Systems, pages 544-549.
Valduriez, P. (1986). Optimization of complex database queries using join in-
dices. IEEE Bulletin on Data Engineering, 9(4):10-16.
Valduriez, P. (1987). Join indices. ACM Transactions on Database Systems,
12(2) :218-246.
van Rijsbergen, C. (1979). Information Retrieval. Butterworths, second edition.
Whang, K. and Krishnamurthy, R. (1985). Multilevel grid files. Technical Re-
port RC-1l516, IBM Thomas J. Watson Research Center.
Wilkinson, R. (1994). Effective retrieval of structured documents. In Proc. 17th
ACM-SIGIR International Conference on Research and Development in In-
formation Retrieval, pages 311-317.
Witten, I., Moffat, A., and Bell, T. (1994). Managing Gigabytes: Compressing
and Indexing Documents and Images. Van Nostrand Reinhold.
Wu, S. and Manber, U. (1992). Agrep - A fast approximate pattern-matching
tool. In Proc. 1992 Winter USENIX Technical Conference, pages 153-162.
Xie, Z. and Han, J. (1994). Join index hierarchy for supporting efficient navi-
gation in object-oriented databases. In Proc. 20th International Conference
on Very Large Data Bases, pages 522-533.
Zdonik, S. and Maier, D. (1989). Fundamentals of object-oriented databases.
In Readings in Object-Oriented Database Management Systems.
Zhou, Z. and Venetsanopoulos, A. N. (1988). Morphological skeleton represen-
tation and shape recognition. In Proc. IEEE 2nd International Conference
on ASSP, pages 948-951.
Zobel, J. and Dart, P. (1995). Finding approximate matches in large lexicons.
Software - Practice and Experience, 25(3):331-345.
Zobel, J. and Dart, P. (1996). Phonetic string matching: Lessons from infor-
mation retrieval. In Proc. 19th ACM-SIGIR International Conference on
Research and Development in Information Retrieval, pages 166-173.
Zobel, J., Moffat, A., and Ramamohanarao, K. (1995a). Inverted files versus
signature files for text indexing. Technical Report TR-95-5, Collaborative
Information Technology Research Institute, RMIT and The University of
Melbourne.
244 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Zobel, J., Moffat, A., and Ramamohanarao, K. (1996). Guidelines for pre-
sentation and comparison of indexing techniques. ACM SIGMOD Record,
25(3):10-15.
Zobel, J., Moffat, A., and Sacks-Davis, R. (1992). An efficient indexing tech-
nique for full-text database systems. In Proc. 18th International Conference
on Very Large Databases, pages 352-362.
Zobel, J., Moffat, A., and Sacks-Davis, R. (1993). Searching large lexicons for
partially specified terms using compressed inverted files. In Proc. 19th In-
ternational Conference on Very Large Databases, pages 290-301.
Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R. (1995b). Efficient
retrieval of partial documents. Information Processing fj Management,
31(3):361-377.
About the Authors
Elisa Bertino is full professor of computer science in the Department of Com-
puter Science of the University of Milan. She has also been on the faculty
in the Department of Computer and Information Science of the University of
Genova, Italy. She has been a visiting researcher at the IBM Research Labo-
ratory (now Almaden) in San Jose, and at the Microelectronics and Computer
Technology Corporation in Austin, Texas. She is or has been on the editorial
board of the following scientific journals: IEEE Transactions on Knowledge and
Data Engineering, Theory and Practice of Object Systems Journal, Journal of
Computer Security, Very Large Database Systems Journal, Parallel and Dis-
tributed Database, the International Journal of Information Technology. She
is currently serving as Program co-chair of the 1998 International Conference
on Data Engineering.
Beng Chin Ooi received his B.Sc. and Ph.D in computer science from
Monash University, Australia, in 1985 and 1989 respectively. He was with
the Institute of Systems Science, Singapore, from 1989 to 1991 before joining
the Department ofInformation Systems and Computer Science at the National
University of Singapore, Singapore. His research interests include database
performance issues, database UI, multi-media databases and applications, and
GIS. He is the author of a monograph "Efficient Query Processing in Geographic
Information Systems" (Springer-Verlag, 1990). He has published many confer-
ence and journal papers and serves as a PC member in a number of international
conferences. He is currently on the editorial board of the following scientific
journals: International Journal of Geographical Information Systems, Journal
on Universal Computer Science, Geoinformatica and International Journal of
Information Technology.
Ron Sacks-Davis obtained his Ph.D. from the University of Melbourne in
1977. He currently holds the position of Professor and Institute Fellow at
245
246 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
RMIT. He has published widely in the areas of database management and
information retrieval and is an editor-in-chief of the International Journal on
Very Large Databases (VLDB) and a member of the VLDB Endowment Board.
Kian-Lee Tan received his Ph.D. in computer science, from the National
University of Singapore in 1994. He is currently a lecturer in the Depart-
ment of Information Systems and Computer Science, National University of
Singapore. He has published numerous papers in the areas of multimedia in-
formation retrieval, wireless computing, query processing and optimization in
multiprocessor and distributed systems.
Justin Zobel obtained his Ph.D. in computer science from the University
of Melbourne, where he was a member of staff from 1984 to 1990. He then
joined the Department of Computer Science at RMIT, where he is now a senior
lecturer. He has published widely in the areas of information retrieval, text
databases, indexing, compression, string matching, and genomic databases.
Boris Shidlovsky received his M.Sc. in applied mathematics and Ph.D. in
computer science from the University of Kiev, Ukraine, in 1984 and 1990 respec-
tively. He was an assistant professor in the Department of Computer Science
at the University of Kiev. From 1993 to 1996, he was with the Department of
Computer Engineering at University of Salerno, Italy and currently is a member
of the Scientific Stuff in RANK XEROX Research Center, Grenoble, France.
His research interests include design and analysis of algorithms, indexing and
query optimization in advanced database systems, processing semistructured
data on the Web.
Barbara Catania is enrolled in a Ph.D program .in computer science in the
University of Milano, Italy, since November 1993. She received with honour
the Laurea degree in computer science from the University of Genova, Italy,
in 1993. She has also been a visiting researcher at the European Computer-
Industry Research Center, Munich, Germany, where she joined in the ESPRIT
project IDEA, sponsorized by the European Economic Community. Her main
research interests include: constraint databases, deductive databases, indexing
techniques for constraint and object-oriented databases.
Index
02,4
x-tree, 25
(l-m) index, 201
I-dimensional generalized tuple, 218
2-dimensional generalized tuple, 218, 222
access support relation, 16, 19
access time, 199, 200, 202
active mode, 196
address calculation, 191
adjacency
querying on, 154
aggregation, 7, 29
aggregation graph, 3
agrep, 213
ALL selection, 217, 222
Altavista, 211
AP-tree, 125-127
Archie, 211
B+ -tree, 9, 20,30
of color-spatial index, 91
with linear order, 129-132
B-tree, 2
for lexicons, 159
battery, 196, 198, 200
beast wait, 199
BD-tree, 54-55
binary join index, 10, 206
bitemporal database, 114
bitemporal interval tree, 140
bitemporal relation, 118
bitmap, 207
bitmap join index, 209
bitslices, 169
Boolean queries
for text, 154-155
Boolean query evaluation
for text, 169-170
bounding rectangle, 40
bounding structure, 41
broadcast channel, 197
broadcasted data, 196
bucket, 198
BY-tree, 63-64
caching, 36
CG-tree,24
CH-tree,21
color, 90
CIE L*u*v, 108
color histogram, 90
Munsell HYC, 92
color index
of color-spatial index, 94
color-spatial index
for image, 91
compression
of inverted lists, 161-164
configurable index, 200, 202
constraint, 214
constraint programming, 214
constraint theory, 216, 218
content-based index
for image, 80
content-based retrieval
for image, 78
convex theory, 218
cosine measure, 155-156
247
248 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
data warehouse, 204
decisions support system, 203
delta code, 162
detail table, 205
diagonal corner query, 219
dimension table, 205
distributed index, 201
distributed RAM, 189
doze mode, 196
dual plane, 222
dual R-tree, 140
dumb terminal, 195
dynamic interval management, 219
effectiveness
of ranking, 152
Elias codes, 161-162
emerging applications, 185-224
Excite, 211
EXIST selection, 217, 218
extension, 215
fact constellation schema, 205
fact table, 205
feature
color, 90
color-spatial, 91
semantic object, 87
shape, 84
spatial relationship, 88
texture, 89
feature extraction, 78
feature-based indexing, 78
file image, 191
file image adjustment, 192
filtering, 222
for ranking, 172
fixed host, 194
flexible indexing, 202
gamma code, 162
GBD-tree, 54-55
GemStone, 4
generalized I-dimensional indexing, 218
generalized concordance lists
for text, 178
generalized database, 215
generalized relation, 215
generalized relational model, 215
generalized tuple, 215
Glimpse, 213
global index, 187
Golomb codes, 162-163
Gopher, 211
grid file, 64-67
H-tree,23-
Harvest, 214
hashing, 2
hB-tree, 49-51
hcC-tree, 24
image database, 77-112
image database system, 78
architecture, 79
index construction
for text, 164-166
index update
for text, 166-168
indexing
of documents, 153
indexing graph, 9
information retrieval, 152, 155-157
InfoSeek,211
infrared technology, 194
inheritance, 5, 20, 29
inheritance graph, 4
inheritance hierarchy, 20
interleaving
for ranking, 173
interval B-tree, 127-129
interval tree, 220
inverse document frequency, 156
inverted file
for image, 83
inverted index, 212
for text, 157-168
inverted lists
for text, 158,160-164
join explicit, 5
join implicit, 5
join index, 10
join index hierarchy, 19
K-D-B-tree, 48-49
kd-tree, 46-48
non-hon10geneous,47
lexicons, 158-160
limiting accumulators
for ranking, 172
linear hashing, 189
local index, 187
locational keys, 70-71
LSD-tree, 55-56
mapping table, 158
materialization technique, 204
meta-block tree, 220
metasearcher, 213
method invocation, 3, 36
minimum bounding polybox, 224
minimum bounding rectangle, 41, 223
mobile host, 194
mobile network, 194
multi-index, 9, 17
navigational access, 2
nested attribute, 3
nested index, 14, 17
nested predicate, 5, 10, 29
nested-inherited index, 29
non-configurable index, 200
NST-tree, 126
object identifier, 3
object query language, 2, 5
object-oriented data model, 1, 3
object-oriented database, 1-38
object-relational database, 1
ObjectStore,4
OLAP, 203
OQL,2
ordinal number, 207
palmtop, 195
partition, 186
partitioning degree, 186
passage retrieval, 180-181
path, 7
path index, 15, 17
path instantiation, 7, 15
path splitting, 18
path-expression, 5
pattern matching
for text, 179-180
perceptually similar color, 108
phonetic matching
for text, 180
PLOP-hashing, 68-69
point location, 222
pointer swizzling, 2, 36
precomputed join, 207
probe time, 199
projection, 16
INDEX 249
proximity
querying on, 154
query expansion
for text, 181
query gr.aph, 6
query precomputation, 204
R+ -tree, 25, 60-63
R*-tree,59-60
R-file,67-68
R-tree, 25, 56-59, 132-137
2-D R-tree, 133
3-D R-tree, 133
ranked query evaluation
for text, 170-175
ranking, 155-157
relevance
judgments, 152
of documents, 152
satellite network, 194
SC-index, 21
search engine, 211
semantic object, 87
sequential search, 212
set-oriented access, 2
SGML,175
shape,84
signature file
for image, 84
for text, 168-169
of color-spatial index, 105
similarity, 155, 156
measures, 79, 82, 155
approximate match, 82
Euclidean distance, 83
exact match, 82
signature-based, 107
signature-based (weighted), 109
skd-tree,51-54
SMAT
of color-spatial index, 96
snowflake schema, 205
spatial access method
for image, 83
spatial database, 39-75,215
spatial index taxonomy, 42
non-overlapping, 43
overlapping, 44
transformation approach, 43
spatial operators, 39
250 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
adjacency, 40
containment, 40
intersection, 39, 41
spatial query processing, 40
approximation, 40
multi-step strategy, 42
spatial relationship, 88
SQL,l
SQL-3,2
stabbing query, 219
star schema, 205
stemming
of words, 154
stopwords, 156, 175
storage on the air, 196
structured documents, 175-178
indexing of, 177-178
suffixing
of words, 154
summary table, 205
temporal database, 113-149,215
temporal index, 121-142
B+ -tree with linear order, 129
temporal query, 119-121
bitemporal key-range time-slice, 120
bitemporal time-slice, 120
key, 120
key-range time-slice, 120
time-slice, 119
inclusion, 119
intersection, 119
point, 120
time-slice query
containment, 120
text database, 151-182
text indexing, 157-169
text passage retrieval, 180-181
texture, 89
time
lifespan, 115
time span, 115
transaction time, 114
valid time, 114
time index, 123-125
TP-index, 137-139
transaction time, 114-116
traversal strategy, 6
TREC, 159
TSB-tree, 122-123
tuning time, 200, 202
unary code, 161-162
valid time, 114,116-117
variable-bit codes, 161-163
WAIS,211
walkstation, 195
Web Crawler, 214
Web navigation, 210
Web robot, 214
Webcrawler,211
weight, 221
weight-balanced B-tree, 220
Whois,211
Whois++, 211
wireless interface, 194
WWW Worm, 214

More Related Content

What's hot (20)

PPTX
Object relational and extended relational databases
Suhad Jihad
 
PPTX
Data Redundancy & Update Anomalies
Jens Patel
 
ODP
Introduction to MongoDB
Dineesha Suraweera
 
PPTX
oops concept in java | object oriented programming in java
CPD INDIA
 
PPT
9. Document Oriented Databases
Fabio Fumarola
 
PDF
Dbms interview questions
Soba Arjun
 
PDF
Introduction to HBase
Avkash Chauhan
 
PPTX
ORDBMS.pptx
Anitta Antony
 
PPTX
Data Vault Overview
Empowered Holdings, LLC
 
PDF
librados
Patrick McGarry
 
PPT
Object Oriented Database Management System
Ajay Jha
 
PPT
Database Chapter 1
shahadat hossain
 
PDF
Paper: Oracle RAC and Oracle RAC One Node on Extended Distance (Stretched) Cl...
Markus Michalewicz
 
PPTX
PPL, OQL & oodbms
ramandeep brar
 
PPTX
Understanding LINQ in C#
MD. Shohag Mia
 
PDF
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
PgDay.Seoul
 
PDF
NoSQL et Big Data
acogoluegnes
 
PPTX
Inheritance in JAVA PPT
Pooja Jaiswal
 
PPTX
OOPs in Java
Ranjith Sekar
 
Object relational and extended relational databases
Suhad Jihad
 
Data Redundancy & Update Anomalies
Jens Patel
 
Introduction to MongoDB
Dineesha Suraweera
 
oops concept in java | object oriented programming in java
CPD INDIA
 
9. Document Oriented Databases
Fabio Fumarola
 
Dbms interview questions
Soba Arjun
 
Introduction to HBase
Avkash Chauhan
 
ORDBMS.pptx
Anitta Antony
 
Data Vault Overview
Empowered Holdings, LLC
 
librados
Patrick McGarry
 
Object Oriented Database Management System
Ajay Jha
 
Database Chapter 1
shahadat hossain
 
Paper: Oracle RAC and Oracle RAC One Node on Extended Distance (Stretched) Cl...
Markus Michalewicz
 
PPL, OQL & oodbms
ramandeep brar
 
Understanding LINQ in C#
MD. Shohag Mia
 
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
PgDay.Seoul
 
NoSQL et Big Data
acogoluegnes
 
Inheritance in JAVA PPT
Pooja Jaiswal
 
OOPs in Java
Ranjith Sekar
 

Similar to Indexing techniques for advanced database systems (20)

PDF
11.challenging issues of spatio temporal data mining
Alexander Decker
 
PDF
10.1.1.118.1099
Suresh Nannuri
 
PDF
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
PDF
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
cscpconf
 
PDF
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
PDF
An Advanced IR System of Relational Keyword Search Technique
paperpublications3
 
PDF
Research on ontology based information retrieval techniques
Kausar Mukadam
 
PDF
Stacked Generalization of Random Forest and Decision Tree Techniques for Libr...
IJEACS
 
PDF
Az31349353
IJERA Editor
 
PPTX
2. DATABASE MODELING_Database Fundamentals.pptx
Javier Daza
 
PDF
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
IJMER
 
DOC
View the Microsoft Word document.doc
butest
 
DOC
View the Microsoft Word document.doc
butest
 
DOC
View the Microsoft Word document.doc
butest
 
DOC
Chapter1_C.doc
butest
 
PDF
A Systems Approach To Qualitative Data Management And Analysis
Michele Thomas
 
PDF
Spatio-Temporal Database and Its Models: A Review
IOSR Journals
 
PDF
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
PDF
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
PDF
Urm concept for sharing information inside of communities
Karel Charvat
 
11.challenging issues of spatio temporal data mining
Alexander Decker
 
10.1.1.118.1099
Suresh Nannuri
 
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
cscpconf
 
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
An Advanced IR System of Relational Keyword Search Technique
paperpublications3
 
Research on ontology based information retrieval techniques
Kausar Mukadam
 
Stacked Generalization of Random Forest and Decision Tree Techniques for Libr...
IJEACS
 
Az31349353
IJERA Editor
 
2. DATABASE MODELING_Database Fundamentals.pptx
Javier Daza
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
IJMER
 
View the Microsoft Word document.doc
butest
 
View the Microsoft Word document.doc
butest
 
View the Microsoft Word document.doc
butest
 
Chapter1_C.doc
butest
 
A Systems Approach To Qualitative Data Management And Analysis
Michele Thomas
 
Spatio-Temporal Database and Its Models: A Review
IOSR Journals
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
Urm concept for sharing information inside of communities
Karel Charvat
 
Ad

Recently uploaded (20)

PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PDF
EXCRETION-STRUCTURE OF NEPHRON,URINE FORMATION
raviralanaresh2
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
Introduction to Probability(basic) .pptx
purohitanuj034
 
PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
PDF
Tips for Writing the Research Title with Examples
Thelma Villaflores
 
PPTX
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
PPTX
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
PPTX
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
PPTX
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
PDF
John Keats introduction and list of his important works
vatsalacpr
 
PDF
My Thoughts On Q&A- A Novel By Vikas Swarup
Niharika
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Basics and rules of probability with real-life uses
ravatkaran694
 
EXCRETION-STRUCTURE OF NEPHRON,URINE FORMATION
raviralanaresh2
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Introduction to Probability(basic) .pptx
purohitanuj034
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
Virus sequence retrieval from NCBI database
yamunaK13
 
Tips for Writing the Research Title with Examples
Thelma Villaflores
 
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
John Keats introduction and list of his important works
vatsalacpr
 
My Thoughts On Q&A- A Novel By Vikas Swarup
Niharika
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Ad

Indexing techniques for advanced database systems

  • 2. The Kluwer International Series on ADVANCES IN DATABASE SYSTEMS Series Editor Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907 Other books in the Series: DATABASE CONCURRENCY CONTROL: Methods, Performance, and Analysis by Alexander Thomasian ISBN: 0-7923-9741-X TIME-CONSTRAINED TRANSACTION MANAGEMENT: Real-Time Constraints in Database Transaction Systems by Nandit R. Soparkar, Henry F. Korth, Abraham Silberschatz ISBN: 0-7923-9752-5 SEARCHING MULTIMEDIA DATABASES BY CONTENT by Christos Faloutsos ISBN: 0-7923-9777-0 REPLICATION TECHNIQUES IN DISTRIBUTED SYSTEMS by Abdelsalam A. Helal, Abdelsalam A. Heddaya, Bharat B. Bhargava ISBN: 0-7923-9800-9 VIDEO DATABASE SYSTEMS: Issues, Products, and Applications by Ahmed K. Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmed ISBN: 0-7923-9872-6 DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS by Nabu R. Adam andAryya Gangopadhyay ISBN: 0-7923-9924-2 INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES by Thomas A. Mueck and Martin L. Polaschek ISBN: 0-7923-9971-4
  • 3. INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS by Elisa Bertino University of Milano, Italy Beng Chin Ooi National University of Singapore, Singapore Ron Sacks-Davis RMIT, Australia Kian-Lee Tan National University of Singapore, Singapore Justin Zobel RMIT, Australia Boris Shidlovsky Grenoble Laboratory, France Barbara Catania University of Milano, Italy SPRINGER SCIENCE+BUSINESS MEDIA, LLC
  • 4. Library of Congress Cataloging-in-Publication Data A CLP. Catalogue record for this book is available from the Library of Congress. ISBN 978-1-4613-7856-3 ISBN 978-1-4615-6227-6 (eBook) DOI 10.1007/978-1-4615-6227-6 Copyright © 1997 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 1997 Softcover reprint of the hardcover 1st edition 1997 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo- copying, recording, or otherwise, without the prior written permission of the publisher, Springer Science+Business Media, LLC. Printed on acid-free paper.
  • 5. Contents Preface VII 1. OBJECT-ORIENTED DATABASES 1 1.1 Object-oriented data model and query language 3 1.2 Index organizations for aggregation graphs 7 13 Index organizations for in heritance hierarchies 20 1.4 Integrated organizations 29 1.5 Caching and pointer swizzling 36 1.6 Summary 38 2. SPATIAL DATABASES 39 2.1 Query processing using approximations 40 2.2 A taxonomy of spatial indexes 42 2.3 Binary-tree based indexing techniques 46 2.4 B-tree based indexing techniques 56 2.5 Cell methods based on dynamic hashing 64 2.6 Spatial objects ordering 70 2.7 Comparative evaluation 71 2.8 Summary 73 3. IMAGE DATABASES 77 3.1 Image database systems 78 3.2 Indexing issues and basic mechanisms 80 3.3 A taxonomy on image indexes 84 3.4 Color-spatial hierarchical indexes 91 3.5 Signatu re-based color-spatial retrieval 105 3.6 Summary 109
  • 6. INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 4. TEMPORAL DATABASES 4.1 Temporal databases 4.2 Temporal queries 4.3 Temporal indexes 4.4 Experimental study 4.5 Summary 5. TEXT DATABASES 5.1 Querying text databases 5.2 Indexing 5.3 Query evaluation 5.4 Refinements to text databases 5.5 Summary 6. EMERGING APPLICATIONS 6.1 Indexing techniques for parallel and distributed databases 6.2 Indexing issues in mobile computing 6.3 Indexing techniques for data warehousing systems 6.4 Indexing techniques for the Web 6.5 Indexing techniques for constraint databases References Index 113 114 119 121 142 148 151 152 157 169 175 181 185 186 194 203 210 214 225 247
  • 7. Preface Database management systems are widely accepted as a standard tool for ma- nipulating large volumes of data on secondary storage. To enable fast access to stored data according to its content, databases use structures known as in- dexes. While indexes are optional, as data can always be located by exhaustive search, they are the primary means of reducing the volume of data that must be fetched and processed in response to a query. In practice large database files must be indexed to meet performance requirements. Recent years have seen explosive growth in use of new database applications such as CAD/CAM systems, spatial information systems, and multimedia in- formation systems. The needs of these applications are far more complex than traditional business applications. They call for support of objects with complex data types, such as images and spatial objects, and for support of objects with wildly varying numbers of index terms, such as documents. Traditional index- ing techniques such as the B-tree and its variants do not efficiently support these applications, and so new indexing mechanisms have been developed. As a result of the demand for database support for new applications, there has been a proliferation of new indexing techniques. The need for a book addressing indexing problems in advanced applications is evident. For practitioners and database and application developers, this book explains best practice, guiding selection of appropriate indexes for each application. For researchers, this book provides a foundation for development of new and more robust indexes. For newcomers, this book is an overview of the wide range of advanced indexing techniques. The book consists of six self-contained chapters, each handled by area ex- perts: Chapters 1 and 6 by Bertino, Catania, and Shidlovsky, Chapters 2, 3 and 4 by Ooi and Tan, and Chapter 5 by Sacks-Davis and Zobel. Each of the first five chapters discusses indexing problems and techniques for a different VII
  • 8. VIII INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS database application; the last chapter discusses indexing problems in emerging applications. In Chapter 1 we discuss indexes and query evaluation for object-oriented databases. Complex objects, variable-length objects, large objects, versions, and long transactions cannot be supported efficiently by relational database systems. The inadequacy of relational databases for these applications has pro- vided the impetus for database researchers to develop object-oriented database systems, which capture sophisticated semantics and provide a close model of real-world applications. Object-oriented databases are a confluence of two tech- nologies: databases and object-oriented programming languages. However, the concepts of object, method, message, aggregation and generalization introduce new problems to query evaluation. For example, aggregation allows an object to be retrieved through its composite objects or based on the attribute values of its component objects, while generalization allows an object to be retrieved as an instance of its superclass. Spatial data is large in volume and rich in structures and relationships. Queries that involve the use of spatial operators (such as spatial intersection and containment) are common. Operations involving these operators are ex- pensive to compute, compared to operations such as join, and indexes are essential to reduction of query processing costs. Indexing in a spatial database is problematic because spatial objects can have non-zero extent and are asso- ciated with spatial coordinates, and many-to-many spatial relationships exist between spatial objects. Search is based, not only on attribute values, but on spatial properties. In Chapter 2, we address issues related to spatial indexing and analyze several promising indexing methods. Conventional databases only store the current facts of the organization they model. Changes in the real world are reflected by overwriting out-of-date data with new facts. Monitoring these changes and past values of the data is, how- ever, useful for tracking historical trends and time-varying events. In temporal databases, facts are not deleted but instead are associated with times, which are stored with the data to allow retrieval based on temporal relationships. To support efficient retrieval based on time, temporal indexes have been proposed. In Chapter 3, we describe and review temporal indexing mechanisms. In large collections of images, a natural and useful way to retrieve image data is by queries based on the contents of images. Such image-based queries can be specified symbolically by describing their contents in terms of image features such as color, shape, texture, objects, and spatial relationship between them; or pictorially using sketches or example images. Supporting content- based retrieval of image data is a difficult problem and embraces technologies including image processing, user interface design, and database management.
  • 9. PREFACE IX To provide efficient content-based retrieval, indexes based on image features are required. We consider feature-based indexing techniques in Chapter 4. Text data without uniform structure forms the main bulk of data in corpo- rate repositories, digital libraries, legal and court databases, and document archives such as newspaper databases. Retrieval of documents is achieved through matching words and phrases in document and query, but for docu- ments Boolean-style matching is not usually effective. Instead, approximate querying techniques are used to identify the documents that are most likely to be relevant to the query. Effectiveness can be enhanced by use of transforma- tions such as stemming and methodologies such as feedback. To support fast text searching, however, indexing techniques such as special-purpose inverted files are required. In Chapter 5, we examine indexes and query evaluation for document databases. In the first five chapters we cover the indexing topics of greatest importance today. There are however many database applications that make use of indexing but do not fall into one of the above five areas, such as data warehousing, which has recently become an active research topic due to both its complexity and its commercial potential. Queries against warehouses requires large number of joins and calculation of aggregate functions. Another example is the use of indexes to minimize energy consumption in portable equipment used in a highly mobile environment. In Chapter 6 we discuss indexing mechanisms for several such emerging database applications. We are grateful to the many people and organizations who helped with this book, and with the research that made it possible. In particular we thank Timothy Arnold-Moore, Tat Seng Chua, Winston Chua, Cheng Hian Goh, Peng Jiang, Marcin Kaszkiel, Alan Kent, Ramamohanarao Kotagiri, Wan-Meng Lee, Alistair Moffat, Michael Persin, Yong Tai Tan, and Ross Wilkinson. Dave Abel, Jiawei Han and Jung Nievergelt read earlier drafts of several chapters, and provided helpful comments. We are also grateful to the Multimedia Database Systems group at RMIT, the RMIT Department of Computer Science, the Australian Research Council and the Department of Information Systems and Computer Science at the National University of Singapore. Elisa Bertino Barbara Catania Beng Chin Ooi Ron Sacks-Davis Boris Shidlovsky Kian-Lee Tan Justin Zobel
  • 10. 1 OBJECT-ORIENTED DATABASES There has been a growing acceptance of the object-oriented data model as the basis of next generation database management systems (DBMSs). Both pure object-oriented DBMS (OODBMSs) and object-relational DBMS (OR- DBMSs) have been developed based on object-oriented concepts. Object- relational DBMS, in particular, extend the SQL language by incorporating all the concepts of the object-oriented data model. A large number of products for both categories of DBMS is today available. In particular, all major vendors of relational DBMSs are turning their products into ORDBMSs [Nori, 1996]. The widespread adoption of the object-oriented data model in the database area has been driven by the requirements posed by advanced applications, such as CAD/CAM, software engineering, workflow systems, geographic information systems, telecommunications, multimedia information systems, just to name a few. These applications require effective support for the management of com- plex objects. For example, a typical advanced application requires handling text, graphics, bitmap pictures, sounds and animation files. Other crucial re- quirements derive from the evolutionary nature of applications and include multiple versions of the same data and long-lived transactions. The use of an object-oriented data model satisfies many of the above requirements. For E. Bertino et al., Indexing Techniques for Advanced Database Systems © Kluwer Academic Publishers 1997
  • 11. 2 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS example, an application's complex objects can be directly represented by the model, and therefore there is no need to flatten them into tuples, as when re- lational DBMSs are used. Moreover, the encapsulation property supports the integration of packages for handling complex objects. However, because of the increased complexity of the data model, and of the additional operational re- quirements, such as versions or long transactions, the design of an OODBMS or an ORDBMS poses several issues, both on the data model and languages, and on the architecture [Kim et al., 1989, Nori, 1996, Zdonik and Maier, 1989]. An important issue is related to the efficient support of both navigational and set-oriented accesses. Both types of accesses occur in applications typical of OODBMS and ORDBMS and both must efficiently supported. Navigational access is based on traversing object references; a typical example is represented by graph traversal. Set-oriented access is based on the use of a high-level, declarative query language. Object query languages have today reached a certain degree of consolidation. A standard query language, known as OQL (Object Query Language), has been proposed as part of the ODMG standard- ization effort [Bartels, 1996, Cattell, 1993], whereas the SQL-3 standard, still under development, is expected to include all major object modeling concepts [Melton, 1996]. The two means of access are often complementary. A query selects a set of objects. The retrieved objects and their components are then accessed by using navigational capabilities [Bertino and Martino, 1993]. A brief summary of query languages is presented in Section 1.1. Different strategies and techniques are required to support the two above ac- cess modalities. Efficient navigational access is based on caching techniques and transformation of navigation pointers into main-memory addresses (swizzling), whereas efficient execution of queries is achieved by the allocation of suitable access structure and the use of sophisticated query optimizers. Access struc- tures typically used in relational DBMSs are based on variations of the B-tree structure [Comer, 1979] or on hashing techniques. An index is maintained on an attribute or combination of attributes of a relation. Since an object-oriented data model has many differences from the relational model, suitable index- ing techniques must be developed to efficiently support object-oriented query languages. In this chapter we survey some of the issues associated with index- ing techniques and we describe proposed approaches. Also, we briefly discuss caching and pointer swizzling techniques, for more details on these techniques we refer the reader to [Kemper and Kossmann, 1995]. In the remainder of this chapter, we cast our discussion in terms of the object-oriented data model typ- ical of OODBMSs, because most of the work on indexing techniques have been developed in the framework of OODBMSs. However, most of the discussion applies to ORDBMSs as well.
  • 12. OBJECT·ORIENTED DATABASES 3 The remainder of the chapter is organized as follows. Section 1.1 presents an overview of the basic concepts of object-oriented data models, query lan- guages, and query processing. For the purpose of the discussion, we consider an object-oriented database organized along two dimensions: aggregation, and inheritance. Indexing techniques for each of those dimensions are discussed in Sections 1.2 and 1.3, respectively. Section 1.4 presents integrated organizations, supporting queries along both aggregation and inheritance graphs. Section 1.5 briefly discusses method precomputation, caching and swizzling. Finally, Sec- tion 1.6 presents some concluding remarks. 1.1 Object-oriented data model and query language An object-oriented data model is based on a number of concepts [Bertino and Martino, 1993, Cattell, 1993, Zdonik and Maier, 1989]: • Each real-world entity is modeled by an object. Each object is associated with a unique identifier (called an OlD) that makes the object distinguish- able from any other object in the database. OODBMSs provide objects with persistent and immutable identifiers: an object's identifier does not change even if the object modifies its state. • Each object has a set of instance attributes and methods (operations). The value of an attribute can be an object or a set of objects. The set of at- tributes of an object and the set of methods represent the object structure and behavior, respectively. • The attribute values represent the object's state. This state is accessed or modified by sending messages to the object to invoke the corresponding methods. • Objects sharing the same structure and behavior are grouped into classes. A class represents a template for a set of similar objects. Each object is an instance of some class. A class definition consists a set of instance attributes (or simply attributes) and methods. The domain of an attribute may be an arbitrary class. The definition of a class C results in a directed-graph (called aggregation graph) of the classes rooted at C. An attribute of any class on an aggregation graph is a nested attribute of the class root of the graph. Objects, instances of a given class, have a value for each attribute defined by the class. All methods defined in a class can be invoked on the objects, instances of the class. • A class can be defined as a specialization of one or more classes. A class defined as specialization is called a subclass and inherits attributes and meth-
  • 13. 4 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Figure 1.1. An object-oriented database schema. ods from its superclasses. The specialization relationship among classes or- ganizes them in an inheritance graph which is orthogonal to the aggregation graph. An example of an object-oriented database schema, which will be used as running example, is graphically represented in Figure 1.1. In the graphical representation, a box represents a class. Within each box there are the names ::If the attributes of the class. Names labeled with a star denote multi-valued attributes. Two types of arcs are used in the representation. A simple arc from a class C to a class C' denotes that C' is domain of an attribute of C. A bold arc from a class C to a class C' indicates that C is a superclass of C' . In the remainder of the discussion, we make the following assumptions. First, we consider classes as having the extensional notion of the set of their instances. Second, we make the assumption that the extent of a class does not include the instances of its subclasses. Queries are therefore made against classes. Note that in several systems, such as for example GemStone [Bretl et aI., 1989], O2 [Deux, 1990], and ObjectStore [Obj~ctStore, 1995] classes do not have manda- tory associated extensions. Therefore, applications have to use collections, or sets, to group instances of the same class. Different collections may be defined on the same class. Therefore, increased flexibility is achieved, even if the data model becomes more complex. When collections are the basis for queries, in- dexes are allocated on collections and not on classes [Maier and Stein, 1986]. In some cases, even though indexes are on collections, the definitions of the classes of the indexed objects must verify certain constraints for the index to be allocated on the collections. For example, in GemStone an attribute with
  • 14. OBJECT-ORIENTED DATABASES 5 an index allocated on must be defined as a constrained attribute in the class definition, that is, a domain must be specified for the attributel . Similarly, ObjectStore requires that an attribute on which an index has to be allocated be declared as indexable in the class definition. As we discussed earlier, most OODBMSs provide an associative query lan- guage [Bancilhon and Ferran, 1994, Cluet et al., 1989, Kim, 1989, Shaw and Zdonik, 1989]. Here we summarize those features that most influence indexing techniques: • Nested predicates Because of object's nested structures, most object-oriented query languages allow objects to be restricted by predicates on both nested and non-nested attributes of objects. An example of a query against the database schema of Figure 1.1 is: Retrieve the authors of books published by f(luwer. (Q1) This query contains the nested predicate "published by Kluwer". Nested predicates are usually expressed using path-expressions. For example, the nested predicate in the above query can be expressed as Author.books.publisher.name = "Kluwer". • Inheritance A query may apply to just a class, or to a class and to all its subclasses. An example of a query against the database schema of Figure 1.1 is: Retrieve all instances of class Book and all its subclasses published in 1991. (Q2) The above query applies to all the classes in the hierarchy rooted at class Book. • Methods A method can used in a query as a derived attribute method or a predicate method. A derived attribute method has a function comparable to that of an attribute, in that it returns an object (or a value) to which comparisons can be applied. A predicate method returns the logical constants True or False. The value returned by a predicate method can then participate in the evaluation of the Boolean expression that determines whether the object satisfies the query. A distinction often made in object-oriented query languages is between im- plicit join (called also functional joins), deriving from the hierarchical nesting of objects, and explicit join, similar to the relational join, where two objects are
  • 15. 6 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS explicitly compared on the values of their attributes. Note that some query lan- guages only support implicit joins. The motivation for this limitation is based on the argument that in relational systems joins are mostly used to recom- pose entities that were decomposed for normalization [Bretl et 301., 1989] and to support relationships among entities. In object-oriented data models there is no need to normalize objects, since these models directly support complex objects and multivalued attributes. Moreover, relationships among entities are supported through object references; thus the same function that joins provide in the relational model to support relationships is provided more naturally by path-expressions. It therefore appears that in OODBMSs there is no strong need for explicit joins, especially if path-expressions are provided. An example of a path-expression (or simply path) is "Book.publisher.name" denoting the nested attribute "publisher.name" of class Book. The evaluation of a query with nested predicates may require the traversal of objects along aggregation graphs [Bertino, 1990, Jenq et 301.,1990, Kim et 301., 1988, Graefe, 1993, Straube and Ozsu, 1995]. Because in OODBMSs most joins are implicit joins along ag- gregation graphs, it is possible to exploit this fact by defining techniques that precompute implicit joins. We discuss these techniques in Section 1.2. In order to discuss the various index organizations, we need to summarize some topics concerning query processing and execution strategies. A query can be conveniently represented by a query graph [Kim et 301., 1989]. The query ex- ecution strategies vary along two dimensions. The first dimension concerns the strategy used to traverse the query graph. Two basic class traversal strategies can be devised: • Forward traversal: the first class visited is the target class of the query (root of the query graph). The remaining classes are traversed starting from the target class in any depth-first order. The forward traversal strategy for query Ql is (Author Book Publisher). • Reverse traversal: the traversal of the query graph begins at the leaves and proceeds bottom-up along the graph. The reverse traversal strategy for query Ql is (Publisher Book Author). The second dimension concerns the technique used to retrieve instances of the classes that are traversed for evaluating the query. There are two ba- sic strategies for retrieving data from a visited class. The first strategy, called nested-loop, consists of instantiating separately each qualified instance of a class. The instance attributes are examined for qualification, if there are simple pred- icates on the instance attributes. If the instance qualifies, it is passed to its parent node (in the case of reverse traversal) or to its child node (in case of forward traversal). The second strategy, called sort-domain, consists of instan- tiating all qualified instances of a class at once. Then all qualifying instances
  • 16. OBJECT-ORIENTED DATABASES 7 are passed to their parent or child node (depending on the traversal strategy used). The combination of the graph traversal strategies with instance retrieval strategies results in different query execution strategies. We refer the reader to [Bertino, 1990, Graefe, 1993, Jenq et al., 1990, Kim et al., 1988, Straube and Ozsu, 1995] for details on query processing stl'ategies for object-oriented databases. 1.2 Index organizations for aggregation graphs In this section, we first present some preliminary definitions. We then present a number of indexing techniques that support efficient executions of implicit joins along aggregation graphs. Therefore, these indexing techniques can be used to efficiently implement class traversal strategies. Definition. Given an aggregation graph H, a path P is defined as C1.A1.A2 . ... An(n 2:: 1) where: • C1 is a class in H; • A1 is an attribute of class C1 ; • Ai is an attribute of a class Ci in H, such that Ci is the domain of attribute Ai-1 of class Ci-1, 1 < i :S n; len(P) = n denotes the length of the path; class(P) = C1U{CdCj is the domain of attribute Aj - 1 of class Cj- 1, 1 < i :S n} denotes the set of the classes along the path; dom(P) denotes the class domain of attribute An of class Cn; two classes Cj and CH1, 1 :S i :S n - 1, are called neighbor classes in the path. o A path is simply a branch in a given aggregation graph. Examples of paths in the database schema in Figure 1.1 are: • P1 : Author.books.publisher.name len(Pt}=3, class(Pd={Author, Book, Publisher}, dom(Pt}=string • P2: Book.year len(P2)=I, class(P2 )={Book}, dom(P2)=integer • P3 : Organization.staff.books.publisher.name len(P3 )=4, class(P3 )={Organization, Author, Publication, Publisher}, dom(P3)=string The concept of path is closely associated with that of path instantiation. A path instantiation is a sequence of objects found by instantiating a given path.
  • 17. 8 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS The objects in Figure 1.2 are instances of the classes shown in Figure 1.1. The following are example instantiations of the path Pa: • Ph= 0[1].A[4].B[1].P[2].Addison-Wesley (Ph is shown in Figure 1.2 by arrows connecting the instances in Ph) • P12= 0[2].A[3].B[2].P[4].Kluwer • P1a= 0[2].A[3].B[3].P[4].Kluwer Ort;:unizatilill Author Puhlisher MadlllllSh C Pnlgramming BI2J PI5j PIJJ PI4J I KJuwer I IMimlSuft I Elsevier Manual Hil1UJhutik MIIJ c++ Pnlgramming Languages Eflkicnl Parsing fur Naluml LunguagO' BIIJ I c++ RcJercllt.:eMallUal ~ M12) IT11cGUIGuide ~ BIIJ. MIl] AlII A15) A12) IJ. van LeeuwenIHIIII 10 Mark IBI4q 014) IWisconsin u·l] Figure 1.2. Instances of classes of the database schema in Figure 1.1. The above path instantiations are all complete, that is, they start with an instance belonging to the first class of path Pa (that is, Organization), con- tain an instance for each class found along the path, and end with an in- stance of the class domain of the path (Publisher.name). Besides the com- plete instantiations, a path may have also partial instantiations. For example, A[2] .B[4] .P[2].Addison-Wesley is a left-partial instantiation, that is, its first component is not an instance of the first class of the path (Organization in the example), but rather an instance of a class following the first class along the path (Author in the example). Similarly, a right-partial instantiation of a path ends with an object which is not an instance of the class domain of the path. In other words, a right- partial instantiation is such that the last object in the instantiation contains a null value for the attribute referenced in the path. 0[4] is a right-partial instantiation of path Pa.
  • 18. OBJECT-ORIENTED DATABASES 9 The last relevant concept we introduce here is the concept of indexing graph. The concept of indexing graphs (IG) was introduced in [Shidlovsky and Bertino, 1996] as an abstract representation of a set of indexes allocated along a path P. Given a path P = C1 .A1 .A2 .....An , an indexing graph contains n + 1 vertices, one for each class Ci in the path plus an additional vertex denoting the class domain Cn .An 2 of the path, and a set of directed arcs. A directed arc from vertex Ci to vertex Cj indicates that the indexing organization supports a direct associations between each instance of Ci and instances of Cj obtained by traversing the path from the instance of Ci to class Cj. Note that if Ci and Cj are neighbor classes, the indexing organization materializes an implicit join between the classes. 1.2.1 Basic techniques Multi-index This organization was the first proposed for indexing aggregation graphs. It is based on allocating a B+-tree index on each class traversed by the path. There- fore, given a path P = C1 .A1.A2 .·.· .An , a multi-index [Maier and Stein, 1986] is defined as a set of n simple indexes (called index components) h, h, ...,In, where Ii is an index defined on Ci.Ai, 1::; i::; n. All indexes h,I2 , ... ,In - 1 are identity indexes, that is, they have as key values aIDs. Only the compar- ison operators == (identical to) and rvrv (not identical to) are supported on an identity index. The last index In can be either an identity index, or an equality index depending on the domain of An. An equality index is a regular index, like the ones used in relational DBMSs, whose key values are primitive objects, such as numbers or characters. An equality index supports comparison operators such as = (equal to), rv (different from), <, ::;, >, 2:. As an example consider path P1=Author.books.publisher.name. There will be three indexes allocated for this path, as illustrated in Figure 1.3. In the figure, each index is represented in a tabular form. An index entry is represented as a row in the table. The first element of such a row is a key-value (given in boldface), and the second element is the set of aIDs of objects holding this key-value for the indexed attribute. The first index, h, is allocated on Author.books; similarly indexes h and Is are allocated on Book.publisher and Publisher.name, respectively. Note that in the first index (h) the special key-value Null is used to record a right-partial instantiation. Therefore, the multi-index allows determining all path instantiations having null values for some attributes along the path. By contrast, determining left-partial instantiations does not require any special key-value.
  • 19. 10 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS B[l] A[4] B[2] Al3] B[3] A[3] B[4] A[2] Null A[4J P[lJ Null P[2] B[l], B[4] P[4] B[2], B[3] Academic Press PllJ Addison-Wesley Pl2J Elsevier P[3] Kluwer Pl4J Microsoft Pl5] figure 1.3. Multi-index for path P1 =Author.books.publisher.name. Under this organization, solving a nested predicate requires scanning a num- ber of indexes equal to the path-length. For example to select all authors whose books were published by Kluwer (query Ql), the following steps are executed: 1. A look-up of index Is with key-value "Kluwer"; the result is {P[4]}. 2. A look-up of index h with key-value P[4]; the result is {B[2], B[3]}. 3. A look-up of index It with key-values B[2] and B[3]; the result is {A[3]} which is the result of the query. Therefore, under this organization the retrieval operation is performed by first scanning the last index allocated on the path. Then the results of this index lookup are used as keys for a search on the index preceding the last one in the path, and so forth until the first index is scanned. Therefore, this organization only supports reverse traversal strategies. Its major advantage, compared to others we describe later on, is the low update cost. The indexing graph for the multi-index is as follows. Let P be a path of length 7l. The graph contains an arc from class Gi+1 to class Gi, for i = 1, ... ,7l. The IG for P3 =Organization.staff.books.publisher.name is shown in Figure lA.a. Join index The notion ofjoin index was introduced to efficiently perform joins in relational databases [Valduriez, 1987]. However, the join index has also been used to efficiently implement complex objects. A binary equijoin index is defined as follows: Given two relations Rand 5 and attributes A and B, respectively from R and 5, a binary equijoin index is where
  • 20. OBJECT-ORIENTED DATABASES 11 Ofl:anl1.:ltillll Autlltlr Book Publisher Puhlishcr.U:llIlC Or~allii',alioll Au!lwr Book Puhlisher Puhlishcf.namc h) Author organizatillll " BllOk o Pul'llishcr oPuhlishcr.nan Organization AuU)or ~) Bouk Put'llisllCf Puhlisher.namc Figure 1.4. Indexing graphs: a) multi-index; b) join indexes; c) nested index; d) path- index; e) access support relation. • ri (sd denotes the surrogate of a tuple of R (5); • tuple 1'i (tuple Sk) refers to the tuple having ri (Sk) as surrogate.O A Bll is implemented as a binary relation and two copies may be kept, one clustered on ri and the other on Sk; each copy is implemented as a B+- tree. In aggregation graphs, a sequence of Blls can be used in a multi-index organization to implement the various index components along a given path. We refer to such sequence of join indexes as II organization. Consider path Pl=Author.books.publisher.name. The join indexes allocated for such path are listed below. They are illustrated together with some example index entries in Figure 1.5. • The first join index BJIt is on Author.books. The copy denoted as BJIt (a) in Figure 1.5 is clustered on aIDs of instances of Author, whereas the copy denoted as BlIt (b) is clustered on aIDs of instances of Book. • The second join index Blh on Book.publisher. The copy denoted as BJh(a) in Figure 1.5 is clustered Oil aIDs of instances of Book, whereas the copy denoted as BJ12 (b) is clustered on aIDs of instances of Publisher. • The third join index BJ13 is on the attribute Publisher.name. The copy de- noted as BJIs (a) in Figure 1.5 is clustered on aIDs of instances of Publisher,
  • 21. 12 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS BIh A[2] B[4] A[3] B[2] A[3] B[3] A[4] Bll] BJ Ida) B[l] P[2] B[2] Pl4] B[3] P[4] B[4] P[2] P[l] Academic Press P[2] Addison-Wesley P[3] Elsevier P[4] I<Iuwer P[5] Microsoft BIh BIh B[l] Ar41 B[2] A[3] B[3] A[3] B[4] A[2] BJh(b) P[2] Brl P[2] B[4 P[4] Bf2 P[4] B[3] BJh(b) Academic Press P[l] Addison-Wesley P[2] Elsevier pr31 Kluwer P[4] Microsoft pr5] BJ Is (b) Figure 1.5. JI organization for path PI =Author.books.publisher.name. whereas the copy denoted as BJI3 (b) is clustered on values of attribute "nan1e" . A JI organization supports both forward and reverse traversal strategies when both copies are allocated for each join index. Reverse traversal is suitable for solving queries such as query Ql ("Retrieve the authors of books published by Kluwer."). Forward traversal arises when given an object, all objects must be determined that are referenced directly or indirectly by this object. An example is the query "Determine the publishers of the books written by author A[3]". Reverse traversal is already supported by the multi-index. However, such technique does not support forward traversal that must, therefore, executed by directly accessing the objects. The usage of a sequence of JIs may make forward traversal faster when object accesses are expensive (for example, very large objects or non-optimal clustering). Moreover, forward traversal supported by a
  • 22. OBJECT·ORIENTED DATABASES 13 sequence of JIs may be useful in complex queries when objects at the beginning of the path have already been selected as the effect of another predicate in the query. An example of more complex query is "Select all books written by an author from AT&T Lab". Suppose that an index is allocated on attribute "Organization.name" and moreover a JI organization is allocated on the path P=Organization.staff.books. A possible query strategy could be to first select the OlD of the organization named "AT&T Lab" using the index on attribute "Organization.name", and then use the JI organization in forward traversal to determine the books written by authors of the organization 0[1) selected by the first index scan. The IG for a JI organization along a path P is constructed as follows. For each pair of neighbor classes Ci and Ci+l along path P, the graph contains two arcs (Ci,Ci+l) and (Ci+l,C;). The former arc corresponds to the copy of the binary join index between C; and Ci +1 clustered on class C; while the latter arc corresponds to the copy clustered on class Ci+l. The IG for the path P3 is presented in Figure l.4.b. Note that when the JI organization is used for forward traversal, the se- quence of B+-trees searched in the traversal corresponds to a chain of arcs in the IG. Moreover, such chain consists of left-to-right directed arcs only. By contrast, the use of the JI organization in a reverse traversal corresponds to a chain of arcs in the IG containing only right-to-left directed arcs. The usage of join indexes in optimizing complex queries has been discussed in [Valduriez, 1986). A major conclusion is that the most complex part (that is, the joins) of a query can be executed through join indexes, without accessing the base data. However, there are cases when traditional indexing (selection indexes on join attributes) is more efficient than the usage of a join index. For example, a traditional index is more efficient than a join index when the query simply consists of a join preceded by a highly selective selection. The major conclusion is that join indexes are more suitable for complex queries, that is, queries involving several joins. The update costs for the JI organization are in general the double of the costs for the multi-index organization, since in the JI organization there are two copies of each join index. The update costs of the JI organization can be, however, reduced by allocating a single copy for one or more join indexes in the organization, rather than two copies. Allocating a single copy, however, makes forward or reverse traversal more expensive, depending on which copy is allocated, and therefore the correct allocation decision must be based on the expected query and updates patterns and frequencies.
  • 23. 14 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Nested index Both the previous organizations require, when solving a nested predicate, to access a number of indexes proportional to the path length. Different orga- nizations have been proposed to reduce the number of indexes accessed. The first of these organizations is the nested index [Bertino and Kim, 1989] provid- ing a direct association between an object of a class at the end of a path and the corresponding instances of the class at the beginning of the path. Consider path Pl =Author.books.publisher.name. A nested index allocated on this path contains as key-values names of publishers. It associates with each publisher name the OIDs of authors that have written a book published by this publisher. Figure 1.6 shows some example entries for a nested index allocated on path Pl. Academic Press Null Addison-Wesley Al2j, Al4J Elsevier Null Kluwer Al3J Microsoft Nul) Figure 1.6. Nested index for path Pl = Author.books.publisher.name. Retrieval under this organization is quite efficient. A query such as Q1 is solved with only one index lookup. The major problem of this indexing technique is update operations that require access to several objects in order to determine the index entries to be updated. For example, suppose that book B[4] is removed from the database. To update the index, the following steps must be executed: 1. Access object B[4] and determine the value of nested attribute "Book.pub- lisher.name"; result: "Addison-Wesley". 2. Determine all instances of class Author having B[4] in the list of authored books; result: {A[2]}. 3. Remove A[2] from the index entry with key-value equal "'Addison-Wesley"; after the removal the index entry for "Addison-Wesley" is {A[4]}. As this example shows, update operations in general require both forward and backward traversals of objects. Forward traversal is required to determine the value of the indexed attribute (that is, the value of the attribute at the end of the path) for the modified object. Reverse traver~al is required to determine the instances at the beginning of the path. The OIDs of those instances will
  • 24. OBJECT-ORIENTED DATABASES 15 be removed (added) to the entry associated with the key value determined by the forward traversal. Note that reverse traversal is very expensive when there are no reverse references among objects. In such case, the nested index organization may not be usable. Note that a nested index as defined above can only be used for reverse traver- sal. However, it would be possible, as for the J1 organization, to allocate two copies of a nested index: the first having as key-values the values of attribute An at the end of the path (examples of entries of this copy for path PI are the ones we have shown earlier); the second having as key-values the OIDs of the instances at the class at the beginning of the path. Therefore, for path PI this second copy would have the entries illustrated in Figure 1.7. A[l] Null A[2] Addison-Wesley A[3] Kluwer A[4] Addison-Wesley A[5] Null Figure 1.7. A nested index for path PI = Author.books.publisher.name clustered on OIDs of instances of the class at the beginning of the path. The use of the above nested index would be more efficient than forward traversal using the object themselves. The IG for a nested index allocated on a path P contains only two arcs, namely (CI,Cn+d and (Cn+I,CI). The former are, however, is only inserted in the IG if the second copy of the nested index, supporting forward retrieval, is allocated. The IG for a nested index allocated on path Pa is shown in Figure 1.4.c. Path index A path index [Bertino and Kim, 1989] is based on a single index, like the nested index. The difference is that a path index provides an association between an object 0 at the end of a path and all instantiations ending with O. For a path of length n, the leaf-node records of a path index contain the instantiation implemented as records of n components. Example index entries for path PI are given in Figure 1.8. Note that a path index records, in addition to complete instantiations, left- partial and right-partial instantiations. Unlike the nested index, a path index can be used to solve nested predicates against all classes along the path. For example, the path index on PI can be used to determine all authors of books published by Kluwer, or simply to find the books published by Kluwer.
  • 25. 16 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Publisher.name Path instantiations Academic Press Null Addison-Wesley 0[1] .A[4] .B[l] .P[2], A[2] .B[4] .P[2] Elsevier Null Kluwer 0[2] .A[3] .B[2] .P[4], 0[2] .A[3] .B[3] .P[4] Null 0[4] Figure 1.8. Path index for path P3=Organization.staff.books.publisher.name. This feature is also very useful when dealing with complex queries. It supports a special kind of projection, called projection on path instantiation [Bertino and Guglielmina, 1991, Bertino and Guglielmina, 1993]. This oper- ation allows retrieving OIDs of several classes along the path with a single index lookup. For example, suppose we wish to determine all authors who have their books published by Kluwer in 1991. This query can be solved by first performing an index lookup with key-value equal Kluwer and then per- forming a projection on positions of classes Author (pos=l) and Book (pos=2) on the selected index entries. That is, the first and second elements of each path instantiation verifying the nested predicate are extracted from the index. Therefore, the results of this projection in the above example are: {(A[3], B[2]), (A[3], B[3])}. Then the second element of each pair is extracted. The corre- sponding object is accessed and the predicate on attribute "year" is evaluated. If this predicate is satisfied, the first element of the pair is returned as query result. For example, given the two pairs above, instances B[2] and B[3] of class Book would be accessed to verify whether the value of attribute "year" is 1991. Since only B[3] verifies the predicate, A[3] is returned as the query result. An analysis of query processing strategies using this operation is presented in [Bertino and Guglielmina, 1993]. Updates on a path index are expensive, since forward traversals are required, as in the case of the nested index. However, no reverse traversals are required. Therefore, the path index organization can be used even when no reverse ref- erences among objects on the path are present. The IG for a path index allocated on a path P contains n arcs, namely (Cn +1 , Cj ) for all i's in the range 1, ... , n. The IG for a path index allocated on path P3 is shown in Figure l.4.d. Access support relation (ASR) This approach is very similar to the path-index in that it involves calculating all instantiations along a path and storing them in a relation. Given a path P = C1 .A1 .A2 . ... .An , all path instantiations are stored as records in an (n+ 1)-
  • 26. OBJECT-ORIENTED DATABASES 17 ary relation. The ith attribute of that relation corresponds to the class Gi. Also, both complete and partial instantiations are represented in the table. Example index entries for path P1 are given in Figure 1.9. Two B+-trees are allocated on the first and last attributes (classes G1 and Gn+d for the access relation for accelerating forward and reversal traversals. Like the path index, the ASR has a low retrieval cost and quite high update cost. Org Author Book Publisher Publisher.name O[lJ Al4J Bl1J P[2] Addison-Wesley 0[2] A[3] B[2] P[4 Kluwer 0[2] A[3] B[3] P[4] Kluwer 0[4] Null Null Null Null Null A[2] B[4J P[2 Addison-Wesley Null Null Null P[l Academic Press Null Null Null Pl3 Springer Figure 1.9. Access support relation for path P3 =Organization.staff.books.publisher.name. In the IG for an ASR allocated on a path P, any vertex for class Gi , i = 2, ... ,n -1 has two incoming arcs (G1,G;) and (Gn.An,Gi ). Figure 1.4.e presents the indexing graph for ASR for path P3. It contains arcs outgoing from the first and last classes in the path on which the two B+-trees are allo- cated. Comparison A comparison among three of the basic indexing techniques, namely multi- index, nested index and path index, has been presented in [Bertino and Kim, 1989]. An important parameter in the evaluations is represented by the degree of reference sharing. Two objects share a reference if they reference the same object as value of an attribute. Therefore, this degree models the topology of references among objects. A more accurate model of reference topology was developed in [Bertino and Foscoli, 1995]. The main results of the comparison can be summarized as follows. For re- trieval the nested index has the lowest cost as expected, and the path index has lower cost than the multi-index. The nested index has a better perfor- mance than the path index for retrieval, because a path index contains OIDs of instances of all classes along the path, while the nested index contains OIDs of instances of only the first class in the path. However, a single path index allows predicates to be solved for all classes along the path, while the nested index does not. For update the multi-index has the lowest cost. The nested index has a slightly lower cost than the path index for path length 2. For paths
  • 27. 18 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS longer than 2, the nested index has a slightly lower cost than the path index if updates are on the first two classes of the path; otherwise the nested index has significantly higher cost than the path index. Note; however, that the update costs for the nested index are computed under the hypothesis that there are reverse references among objects. When there are no reverse references, update operations for the nested index became much more expensive. 1.2.2 Advanced index organizations Each of the basic organizations described in the previous subsection is biased towards a specific kind of operation (retrieval or update). No organization supports equally well retrieval and update operations. In this subsection, we present some advanced approaches which are characterized by a customization component. Such component allows tailoring the organizations with respect to specific query and update patterns and frequencies. The customization requires detecting an index configuration which is optimal for a given set of operations along the indexed path. Path splitting The path splitting approach [Bertino, 1994, Choenni et 301.,1994] overcomes the problem of biased performance of three basic techniques, namely high update costs in the nested and path index and high retrieval costs in the multi-index. The approach is based on splitting a path into several shorter subpaths, and allocating on each subpath one among the following basic organizations: multi- index, nested index, path index. For example, path P3=Organization.staff. books.publisher.name could be split into two subpaths: • P31=Organization.staff.books with a multi-index allocated • P32=Book.publisher.name with a path index allocated. An algorithm determining optimal configurations for paths has been devel- oped [Bertino, 1994]. The algorithm takes as input the frequency of retrieval, insert, and delete operations for classes along the path. Moreover, it takes into account whether reverse references exist among objects as well as all data logi- cal and physical characteristics. The algorithm determines the optimal splitting of a path into subpaths, and the organization to use for each subpath. The al- gorithm also considers, for each subpath, the choice of allocating no index. An interesting result obtained by running the algorithm is that when the degrees of reference sharing along a path are very low (that is, close to 1) and reverse references are allocated among objects, the best index configuration consists of allocating no index on the path.
  • 28. OBJECT-ORIENTED DATABASES 19 The overall index configuration obtained according to the path splitting approach can be simply represented by an IG. As an example consider the IG for the configuration of path P3 consisting of subpaths P 3i with a multi-index allocated, and P32 with a path index allocated, shown in Figure l.10.a. Organization Author Book Publisher Publisher.name a) Organization - _ b) Book ___--..rublisher.name c) Figure 1.10. Indexing graphs for advanced techniques: a) Path splitting; b) ASR decom- position; c) join index hierarchy. ASR decomposition Under the ASR organization one table is maintained for all instantiations along the path. Similarly to the path splitting approach, a path may be decomposed and different access relations allocated for each subpath. Even though [Kemper and Moerkotte, 1992] proves some properties of the ASR decomposition, it does not provide any criteria or algorithm for "optimal" partitioning. Figure 1.10.b shows the IG corresponding to a case where the ASR allocated on path P3 is decomposed into two partitions. Join index hierarchy This is another approach based on the join index [Valduriez, 1987]. A complete join index hierarchy (IJH) consists of basic join indexes and derived join indexes [Xie and Han, 1994]. Basic indexes which form the base of the Jl hierarchy are supported for pairs of neighbor classes in a path P, whereas derived indexes are supported for pairs of non-neighbor classes. Derived join indexes are built from basic join indexes and, possibly, other derived join indexes. For the path P3 , Figure 1.11 shows the derived join between class Author (pos=2) and attribute Publisher.name (pos=5). Maintenance of the complete JI hierarchy is expensive in terms of both storage and update costs. Therefore, a partial JI hierarchy which contains all basic JIs and only several derived indexes seems to be more efficient for
  • 29. 20 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Author Publisher.name A[2] Addison-Wesley A[3] Kluwer A[4] Addison-Wesley Figure 1.11. Derived join index between Author and Publisher.name. most real cases. In the partial hierarchy, any derived join index needed for executing a query but not included in the partial J1 hierarchy is derived from the indexes in the partial J1 hierarchy through a sequence of join operations. The selection of the derived J1s to be included in the partial J1 hierarchy is driven by some heuristics and metrics. As the performance tests reported in [Xie and Han, 1994] show, a partial JI hierarchy behaves better than the complete J1 hierarchy and the ASR organization. An IG corresponding to a partial 1J hierarchy is characterized by the fol- lowing property. If it contains an arc from class Cj to class Cj , then it contains the arc from Cj to Ci as well. Figure 1.10.c shows the IG of a partial J1 hierarchy for path P3. Such partial J1 hierarchy supports basic join indexes for the following pairs of neighbor classes: (Organization, Author), (Author, Book), (Book, Publisher), (Publisher, Publisher.name). It moreover supports an additional derived join index for the pair (Author, Publisher.name). 1.3 Index organizations for inheritance hierarchies As we discussed in Section 1.1, an object-oriented query may apply to a class only or to a class and all its direct and indirect subclasses. Since an attribute of a class C is inherited by all its subclasses, a relevant issue concerns how to efficiently evaluate a predicate against such an attribute when the scope of the query is the inheritance hierarchy rooted at C3 . In this section we discuss indexing techniques addressing such an issue. The various approaches are an- alyzed with respect to storage overhead, update and retrieval costs. Retrieval costs, in particular, depend on whether the query is a point query or a range query. In a B+-tree index, a point query retrieves one leaf node only; the query predicate is usually an equality predicate. By contrast, a range query specifies an interval (or a set) of values for the search key and may require retrieving several leaf nodes. Consider an attribute A defined in a class C and inherited by all its sub- classes. A query against attribute A is a single-class query (SC-query) if the query scope consists of only one class from the inheritance hierarchy rooted at C. Otherwise, the query is a class-hierarchy query (CH-query) and its scope
  • 30. OBJECT-ORIENTED DATABASES 21 Book 1986 Bl2J 1990 B[4] 1991 B[1],B[3] Handbook II 1990 Q!I!I] Figure 1.12. SC-index organization for the inheritance hierarchy rooted at class Book. includes a subhierarchy of the inheritance hierarchy, that is, some class in the hierarchy with all its subclasses. A CH-query is a rooted CH-query if the root of the subhierarchy in the scope coincides with the root class C. Otherwise, the query is a partial CH-query. Consider the database schema shown in Figure 1.1. Consider the inheritance hierarchy rooted at class Book and queries against its attribute "year" which is inherited by classes Manual and Handbook. An example of SC-query is the query which retrieves instances of one of the classes in the hierarchy (Book, Manual or Handbook). The query against the attribute "year" which retrieves instances of all the three classes is a rooted CH-query. If the class Manual had a subclass called ManuaLon.CD, then a query with classes Manual and ManuaLon.CD in the scope would be a partial CH-query. SC-index and CH-tree The inheritance hierarchy indexing problem was first addressed in [Kim et al., 1989] where two possible approaches are proposed. The first approach, called single-class index (SC-index), is based on maintaining a separate B+-tree on the indexed attribute for each class in the inheritance hierarchy. Therefore, if the inheritance hierarchy has m classes, the SC-index requires m B+-trees. As an example, consider the inheritance hierarchy rooted at class Book in Figure 1.1. If the attribute "year" is frequently referred in queries against this hierarchy, the SC-index approach requires building three indexes, one for each class in the hierarchy, namely Book, Manual and Handbook. The evaluation of a predicate against the attribute "year" would then require scanning the three indexes and performing the union of the results. The three indexes against the attribute "year" for the classes in the inheritance hierarchy rooted at class Book are shown in Figure 1.12. This approach is very efficient for SC-queries. However, it is not optimal for CH-queries, because it requires scanning all the indexes allocated on the classes in the queried inheritance hierarchy. The second approach, called class-hierarchy index (CH-tree), is based on maintaining a unique B+-tree for all classes in the hierarchy. An index entry in a leaf node may thus contain the aIDs of instances of any class in the
  • 31. 22 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Book, Manual, Handbook 1986 (Book,{B[2]}) 1990 (Book, {B[4]}) , (Manual, ~Mlll}),(Handbook, {Hll]}) 1991 (Book, {B[l],B[3]}) 1993 (Manual, {Ml2]}) Figure 1.13. Entries of CH-tree for the inheritance hierarchy rooted at class Book. indexed inheritance hierarchy. A CH-tree allocated on the attribute "year" for the inheritance hierarchy rooted at class Book is shown in Figure 1.13. Note, from the figure, that the entry with key value equal 1990 contains three sets of aIDs. The first set contains the aIDs of the instances of Book, (B[4] in the example), whereas the second and third sets contain aIDs of manuals (M[l]) and handbooks (H[l]), respectively. Generally, a leaf node in a CH-tree consists of a key-value, a key-directory, and for each class in the inheritance hierarchy the number of elements in the list of aIDs for instances of this class that hold the key-value in the indexed attribute, and the list of aIDs. The key-directory contains an entry for each class that has instances with the key-value in the indexed attribute. An entry for a class consists of the class identifier and the offset in the index record where the list of aIDs for the class is located. Under the CH-tree organization, a SC-query is evaluated as follows. Let C be the class against which the query is issued. The index is scanned to find the leaf-node record with the key-value satisfying the query predicate. Then the key-directory is accessed to determine the offset in the index record where the list of aIDs of instances of C is located. If there is no entry for class C, then there are no instances of C satisfying the predicate. A CH-query is processed in the same way, except that the lookup in the key-directory is executed for each class involved in the query. In general, the performance of the CH-tree has an inverse trend with respect to the SC-index. The CH-tree is more efficient for queries whose access scope involves all classes (or a significant subset of the classes) in the indexed in- heritance hierarchy, whereas a SC-index is effective for queries against a single class. By contrast, the CH-tree retrieves many unnecessary leaf node pages when the query applies to a single class only. Results of an extensive evaluation of the two indexing techniques have been reported in [Kim et al., 1989]. An important parameter in the evaluation is the distribution of key values across the classes in the inheritance hierarchy. In general, if each key value is taken by instances of only one class C (that is, dis- joint distribution), the CH-tree is less efficient than the SC-index. Conversely, if each key value is taken by instances of several classes, the CH-tree performs
  • 32. OBJECT-ORIENTED DATABASES 23 better. Also, the update cost for the CH-tree is higher that in SC-index because the size of one B+-tree for one class is expected to be much smaller compared to the cost of a single index for the entire hierarchy. H-tree The skewed performance of the SC-index and CH-tree for SC- and CH-queries led to more attempts to overcome the problem. The H-tree [Low et al., 1992] is a variant of the SC-index which aims at improving the performance of the SC-index for CH-queries. Like the SC-index, a separate B+-tree is maintained on the indexed attribute for each class in the inheritance hierarchy. However, unlike the SC-index, in the H-tree the B+-trees are linked based on their class- subclass relationships by pointers in the internal nodes of the B+-tree. For each pair of classes C and C' in the inheritance hierarchy, such that class C' is a direct subclass of C, a set of additional pointers are maintained from the internal nodes of the B+-tree allocated on class C to internal nodes in the B+- tree allocated on class C'. The pointers connect internal node's separators for same values of the indexed attribute. Figure 1.14 shows a fragment of a H-tree allocated on the inheritance hierarchy rooted at class Book which indexes the "year" attribute. B+-tree of Manual B+-lree of Book B+-lree of Handbook Figure 1.14. Fragment of the H-tree organization for the inheritance hierarchy rooted at class Book. To execute a CH-query, the H-tree performs a complete scan on the B+- tree allocated on the query class, followed by the partial search on each of the B+-trees allocated on the other classes in the subhierarchy rooted at the query class. The partial search is performed by following the additional pointers from the B+-tree, allocated on the class root of the queried inheritance hierarchy, to the B+-trees of the subclasses of the class root. Unfortunately, the usage of those additional pointers solves the problem of low performance only partially. Although the H-tree reduces the number of accesses to the B+-tree internal
  • 33. 24 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS nodes, it still requires accessing more leaf node pages than those accessed under the SC-index organization. Moreover, the reduced query cost is achieved at the expense of the additional storage overhead for the pointers between B+-trees. As a consequence, the update cost in the H-tree is higher than in the SC-index. CG-tree The CG-tree [Kilger and Moerkotte, 1994] enhances the H-tree by collecting all pointers between different class's indexes in special nodes which create one additional level located just before the leaf node level of B+-trees. Given an inheritance hierarchy of m classes, the CG-tree maintains m B+- trees, one for each class. In each B+-tree, an additional level between the internal and leaf nodes is included. Each node at this level contains a vector of m elements (called class directory) of leaf node references. There is one element in the array for each class in the indexed inheritance hierarchy. The ith component of the class directory contains a reference to the leaf node containing those elements of the class Ci whose keys have the same key values. The position i of class Ci is given by the preorder traverse of the inheritance hierarchy. The CG-tree has better performance than the H-tree, as it avoids reading unnecessary internal nodes. However, it may still require reading unnecessary leaf nodes. Moreover, the CG-tree has a high storage overhead and update cost because of the class directories. heC-tree The hcC-tree [Sreenath and Seshadri, 1994] is another organization attempting to combine the advantages of the SC-inclex and CH-tl'ee. Like the CH-tree, it is based on maintaining a single B+-tree-like data structure to index the entire inheritance hierarchy. In addition to the usual internal and leaf nodes of a standard B+-tree used for indexing the attribute values, it includes a new type of nodes, so called OlD nodes. The OlD nodes lie one level below the leaf nodes and contain the lists of aIDs related to the attribute values. Given an inheritance hierarchy with m classes, the hcC-tree maintains m+ 1 chains of OlD-nodes with m class chains (one chain for each class) and one chain of OlD-nodes corresponding to the entire inheritance hierarchy. The class chain for a class C groups the aIDs belonging to C, and the hierarchy chain groups all the aIDs of all instances of all the classes in the inheritance hierarchy. Practically, a class chain looks like the chain of leaf nodes in a SC- index, whereas the hierarchy chain is similar to the chain of leaf nodes in a CH-tree. The OlD nodes are referenced by entries in the leaf nodes. Each leaf node entry, in addition to key values, contains a bitmap with n bits and a set P of (m + 1) pointers. Each bit in the bitmap corresponds to a class in the
  • 34. OBJECT·ORIENTED DATABASES 25 inheritance hierarchy such that if ith bit is set, the ith pointer in P points to the first node in the class chain for the class C containing OIDs with the key value. Each internal node entry consists of a key value, a node pointer and a n-bit bitmap. For SC-queries, the performance of the hcC-tree is comparable to that of the SC-index as it requires searching only one class chain. For the range rooted CH-queries the hcC-tree's performance is comparable to that of the CH-tree as it requires searching only the hierarchy chain. However, for range partial CH- queries, the hcC-tree behaves like the SC-index because it requires searching a number of class chains equal to the number of classes in the query class scope. Furthermore, as the hcC-tree stores each OlD twice (in one class chain and the hierarchy chain), it incurs a high storage overhead and update cost. x-tree All the above approaches basically use one of two mutually exclusive grouping methods. The SC-index, the H-tree and the CG-tree group attribute values in the leaf nodes of B+-tree on the base of a class wherein instances with the value appear. By contrast, the CH-tree and the hcC-tree are based on the values of the indexed attribute regardless of the class the instances with the value belong to. Because of this dichotomy, the various indexing techniques behave differently for different queries. Indexing techniques based on the first grouping method are always more efficient for SC-queries, whereas techniques based on the second grouping method are always more efficient for CH-queries. The above considerations have led researchers to the insight that the search space for the class-hierarchy indexing is actually 2-dimensional, with the index- ing attribute values extended along one dimension (attribute-dimension) and classes in the hierarchy extended along the second dimension (class-dimension). As a result, grouping of the indexed values should extend in both directions. In such a case, the several techniques supporting multi-dimensional indexing like R-tree, quad-tree, grid-file etc. [Ooi, 1990], can be used for indexing an inher- itance hierarchies. Figure 1.15 represents data from the inheritance hierarchy rooted at class Book as 2-dimensional search space. Using such a representa- tion, the query Q2 "Retrieve all instances of Class Book and all its subclasses printed in 1991" becomes a rectangular domain in the data plane. x-tree [Chan et al., 1997] is a dynamic indexing technique similar to the R-tree [Guttman, 1984] and R*-tree [Beckmann et al., 1990]. Data are stored in the leaf nodes which appear at the same level of the tree. Each leaf node entry consists of the key value J(, the object identifier aid and the identifier cid of the class the object belongs to. If all entries with the same key value J( do not fit one leaf node, two or more nodes are allocated and all node entries with same class identifier are grouped together.
  • 35. 26 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Attribute dimension 1993 Q2 1990 19911986 Class dimension ( B[I] B[2] B[4] B[3] M[I] M[2] H[I] '--- Book Manual Handbook Figure 1.15. Objects from hierarchy rooted at Book as a 2-dimensional search plane. The internal nodes contain entries of the form (cidSet, Kmin, Kmax , P), where cidSet is a subset of the classes in the indexed inheritance hierarchy, [Kmin, Kmax] is a subrange of the attribute domain, and P is a pointer to a child node on the next level. In the internal nodes of the x-tree, all node entries with the same set of classes are clustered together into the same record. As the node splitting strategy in the R-tree is more complicated than in a B+-tree and often depends on the data shape and distribution, x-tree uses some heuristics for node splitting based on a special proximity cost metric. The heuristic generates a list of candidate node splits along both the class- dimension and the attribute-dimension. The candidates are generated on the base of a low proximity cost of the split. After the generation step, the best candidate is selected as a final node split. As performance tests show, the x-tree outperforms the CH-tree for most types of query. As it can be expected, the only exception is for queries against all the classes in the indexed inheritance hierarchy. In such case, the x-tree fetches about 80% more pages than the CH-tree. Also, like the R-tree which has a lower space utilization than the B+-tree, the x-tree is higher and requires larger storage space than the CH-tree. Good worst case indexing techniques The x-tree is more efficient than all the previous index organizations for a wide range of queries and data distributions. Yet, it does not have a good worst- case performance because it uses R-tree as underlying data structure and some heuristics for node splitting. An approach with a proven good worst-case performance was proposed in [Kanellakis and Ramaswamy, 1996, Ramaswamy and Kanellakis, 1995]. A key assumption is that the class-dimension in the 2-dimensional data space is static,
  • 36. OBJECT-ORIENTED DATABASES 27 A A' a) "'CD' /0:;:::', "~'~' {AI {BI ICi {D} {EI {F} h) class CH-qllcry against d.l.o;s C dimension tiJA ..... -- .... -----.- --.- B ---- ---~'-:'.--'-. --- C ---+ ,- ..... ------. ---- D ----.-------------.- E - ..... ----.----.----+-- F ----.-----.---- ...... --- allrihutc lIimcnsiun c) Figure 1.16. Class-division: a) Example hierarchy; b) Binary tree on the class-dimension; c) A CH-query against class C in the 2-dimensional data space. that is, no classes in the hierarchy may be removed or inserted even though objects of the classes may be updated. This redu·ces indexing the inheritance hierarchy to a special case of the external dynamic 2-dimensional range search- ing when data in the 2-dimensional space are points with their y-coordinates being in a static set corresponding to the set of classes. A given class hierarchy H is preprocessed as follows. We create a family G where each member is a set of classes from H. After the preprocessing, B+-tree indexes are maintained for the union of the classes in each member of G. If a CH-query is against class C in the hierarchy H, a subset of indexes is queried, which exactly covers C's subclasses and which involves at most q indexes, where q is a small integer. On the other hand, a class is allowed to appear in at most a small number r members of G, so an object can have at most r replicas. Updates are processed by changing all replicas. In other words, the preprocessing solves the following combinatorial problem, which is named class-division of H according to maximal replication factor r and maximal query factor q: Input: Class hierarchy H with m classes, and positive integers rand q. Output: A family G, whose members are sets of classes from H such that (1) No class appears in more than r members of G. (2) For any class C' in Hand C' its set of subclasses in H including C itself, there are at most q members of G that exactly cover C' (the union of at most q members of G is C'). SC-tree is an example of the class-division withq =m and r =1. Similarly, class-division is possible for and q = 1 and l' = m, when B+-tree indexes are maintained for all subhierarchies in H and each object can have up to m replicas. In the general case, there exists the following efficient space-time tradeoff:
  • 37. 28 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS For any class hierarchy H with m classes, it is possible to perform class-division of H according to r = flog2 m1+ 1 and q =2flog2 m1. To prove this, we recall that every CH-query can be represented as a 2- dimensional range query, with two ranges extended along the attribute-dimen- sion and the class-dimension. To make all classes from a subhierarchy be con- tiguous along the class-dimension, classes of H should be sorted due to the preorder hierarchy traversal. When performing such traversal of H, we build a binary tree on the class-dimension. Leaves in the tree scanned from the left to right, contain classes due to the preorder traversal of H while an internal node contains the union of classes in all leaves of its subtree. Therefore, the tree has m leaves and flog2 m1+ 1 levels. In Figure 1.16.a the class hierarchy consists of six classes and the preorder traversal of the hierarchy is given by {A, B, C, D, E, F}. The binary tree built for the hierarchy is given in Figure 1.16.b. Once the tree is built, the family G is obtained by generating family members for all nodes of the tree. Because each class is present in at most one node on each level and the binary tree has flog2 c1+ 1 levels, no object has more than flog2 c1+ 1 replicas in G. A CH-query corresponds to a range along the class-dimension in the preorder sort. To minimize the number of members of G (or nodes of the tree) covering the query class range, we select those nodes Vi of the binary tree which are completely contained in the query range while their parents do not. The query issued against class C (see Figure 1.16.c) gives the class range {A, B, C} and the minimal cover for the range is given by nodes {A, B} and {C} (see shadow nodes in Figure 1.16.b). In the worst case, the query class range has two such nodes Vi on each level of the tree and 2flog2 C1nodes in total. That is, one can answer class indexing queries on any class by looking at no more than 2flog2 C1 indexes. This gives the time-space tradeoff previously stated. As a B+-tree is maintained for each member of G, this tradeoff allows to construct an efficient data structure in external storage which occupies o(log2 m(N/ B)) pages and has the worst case I/O query time O(log2 c10gB N + T / B), where B is the size of the external memory page, m is the number of classes in the inheritance hierarchy, N is the number of objects in the inheri- tance hierarchy and T is the number of objects the query retrieves. The update time in such a structure is O(log2 dogB N). The above schema provides the worst case complexities for any class hierar- chy. Hovever, for many hierarchies, values of rand q may be further improved by using heuristics, some of them were discussed in [Ramaswamy and Kanel- lakis, 1995]. Also, an improvement of the data structure that reduces the query time from O(log2 c10gB n +i/B) to O(logB n + log2 B +i/B) was proposed in [Kanellakis and Ramaswamy, 1996].
  • 38. OBJECT-ORIENTED DATABASES 29 1.4 Integrated organizations Even though we have addressed indexing techniques separately for each dimen- sion along which an object database is organized (namely, aggregation and in- heritance), most object-oriented queries involve classes along both dimensions. Such queries typically contain nested predicates and have as a target any num- ber of classes in a given inheritance hierarchy. The query that retrieves all books and manuals written by authors from AT&T Lab. is an example of such queries. Developing integrated indexing techniques able to support such queries is crucial. In principle, every indexing technique defined for one dimen- sion could be combined with any technique defined for the other dimension. However, no integrated indexing technique has been proposed, with the excep- tion of the nested-inherited index [Bertino and Foscoli, 1995], that we describe in the remainder of this section. The nested-inherited index is defined as a combination of concepts from the nested index, the join index and the CH-tree techniques. In order to present this indexing technique, we need some additional definitions. To simplify the following discussion, we make the assumption that a class occurs only once in a path. First we recall that, given a class C, C' denotes the set of classes in the inheritance hierarchy rooted at C. As an example, consider the object-oriented schema in Figure 1.1: Book' = {Book, Manual, Handbook}. Given a path P = C1 .A1.A2 ... . An (n 2 1), the scope of P is defined as the set UC,EclasS('P) Ct. Class C1 is the root of the scope. Given a class C in the scope of a path, the position of C is given by an integer i, such that C belongs to the inheritance hierarchy rooted at class Cj4, where Cj E class(P). The scope of a path simply represents the set of all classes along the path and all their subclasses. For example, consider the path P= Orga- nization.staff.books.publisher.name, scope(P) = {Organization, Author, Book, Publisher}. Class Organization is the root of P. Class Organization has posi- tion one, class Author has position two, classes Book, Manual and Handbook have position three, and class Publisher has position four. In the remainder of the discussion, given an object 0, we will use the term parent object to denote an object that references O. For example, the parents of the instance M[I] of class Manual are objects A[I] and A[4], instances of class Author. Given a path P = C1 .A1.A2 ... An, the nested-inherited index associates with a value v of attribute An OIDs of instances of each class in the scope of P having vas value of the (nested) attribute An. A nested-inherited index on path P=Organization.staff.books.publisher.name associates with a given publisher name all organizations having in their staff authors of books or manuals or
  • 39. 30 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS handbooks published by the publisher. Similarly for all the other classes in the scope. Logically, the index will contain the following entries Academic Press (Publisher, {P[l]}) Addison-Wesley (Organization,{O[I])), (Author,{A[I], A[2], A[4]}) , (Book,{B[I],B[4]}), (Manual, {M[l]}, (Publisher, {P[2]}) Elsevier (Organization, {O[3]}), (Author, {A[5J}), (Handbook, {H[l]}), (Publisher, {P[3]}) Kluwer (Organization, {O[2]}), (Author, {A[3]}), (Book, {B[2], B[3]}), (Publisher, {P[4]}) Microsoft (Manual, {M[2]}), (Publisher, {P[5]}) Figure 1.17. Nested-inherited index for path P=Organization.Author.Book.Publisher. The nested-inherited index, as the nested index and path index, supports efficient retrieval operations. However, unlike those two organizations, the nested-inherited index does not require object traversals for update operations, because of some additional information that is stored in the index. The format of a non-leaf node has a structure similar to that of traditional indexes based on B+-tree. The record in a leaf node, called primary l'ecord, has a different structure. It contains the following information: • record-length • key-length • key-value • class-directory • for each class in the path scope, the number of elements in the list of OIDs for the objects that hold the key-value in the indexed attribute, and the list of OIDs. The class-directory contains a number of entries equal to the number of classes having instances with the key-value in the indexed attribute. For each such class Ci , an entry in the directory contains: • the class identifier • the offset in the primary record where the list of OIDs of Ci instances are stored
  • 40. OBJECT-ORIENTED DATABASES 31 • the pointer to an auxiliary record where the list of parents is stored for each instance of Gi . An auxiliary record is allocated for each class, except for the root class of the path and for its subclasses. An auxiliary record consists of a sequence of 4-tuples. A 4-tuple has the form: (oidi , pointer to primary record, no-oids, {p - oidi 1 ' •.. , p - oidi J ). There are as many 4-tuples as the number of instances of Gi having the key-value in the indexed attribute. For an object Oi, the tuple contains the identifier of Oi, the pointer to the primary record, the number of parent objects of Oi, the list of parent objects. In the 4-tuple definition above, no-oids denotes the number of parent objects, and p - oidi . denotes the j-thJ parent of Oi. Auxiliary records are stored in different pages than primary records. Given a primary record, there are several auxiliary records that are connected to it. A second B+-tree is superimposed on the auxiliary records. The second B+-tree indexes the 4-tuples based on the OIDs that appear as the first elements of 4-tuples. Therefore, the index organization actually consists of two indexes. The first, called the primary index, is keyed on the values of attribute An. It associates with a value v of An the set of OIDs of instances of all classes relative to the path that have v as value of the (nested) attribute. The second index, called the auxiliary index, has OIDs as indexing keys. It associates with the OlD of an object 0 the list of OIDs of the parents of O. Leaf- node records in the primary index contain pointers to the leaf-node records in the auxiliary index, and vice versa. The reason for the auxiliary index is to provide all information for updating the primary index without accessing the objects themselves. Recall that when updates are executed, the nested index may require object forward and reverse traversals, while the path index only requires forward traversals. By contrast, the nested inherited index does Dot require any access to the objects. The reason for this organization will be however more clear when discussing the operations. Figure 1.18 provides an example of the partial index contents for the objects shown in Figure 1.2. The IG for a nested-inherited index contains three sets of arcs. First, because the primary index associates each value of attribute Gn .An with the instances of all classes in the scope of the indexed path, the IG contains arcs from vertex Gn.An to classes Gi, where i = 1, ... , n. Second, it contains arcs from Gi to Gn.An, i = 2, ... , n. Finally, the IG contains arcs 0;+1 to Gi, i = 1, ... , n - 1. The IG for the path P=Organization.staff.books.publisher.name is shown in Figure 1.19. We now discuss how retrieval, insert, and delete operations are performed on the nested-inherited index. For ease of presentation, we will use examples
  • 41. 32 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS offl Non·leaf nude record in the primary B··tree "ff2 oflJ uff4 "ffS Organizatitln offl AddiStHl· Authur "fll ( AIIJ. (B[IJ. ,I fi Book ufO I (O(lJ) AI2J. (MIl) I (PI2J) Wesley Manual "lf4 AI4J) B[4J) Handht)(}k Puhlisher olTS Auxiliary rCl,;urd for c1as.' Alllhur Nt)O·Jcaf Il{}dc rcc(lCt! in i.luxilary B·lrce Figure 1.18. Example of index contents in a nested-inherited index. Figure 1.19. Indexing graph of the P =0rga nization.Author. Book. Publisher. na me. nested-inherited index for path
  • 42. OBJECT-ORIENTED DATABASES 33 to describe the operations. Formal algorithms are presented in [Bertino and Foscoli, 1995]. Retrieval The nested inherited index supports a fast evaluation of predicates on the indexed attribute for queries having as target any class, or class hierarchy, in the scope of the path ending with the indexed attribute. As an example, consider a query that retrieves the organizations whose staff members have published books with Addison-Wesley. This query is executed by first executing a lookup on the primary index with key value equal to "Addison-Wesley". The primary record is then accessed. A lookup in the class directory is executed to determine the offset where the aIDs of Organization instances are stored. Then those aIDs are fetched and returned as result of the query. For our query, the result is {O[l]}. We now consider a query that retrieves the books published by Addison- Wesley. The same steps as before are executed. The only difference is that the class-directory lookup is executed for classes Book, Manual, and Handbook. Since the entry for class Handbook is empty, only the record portions for classes Book and Manual are accessed, with offsets obtained from the class-directory. The query result, {B[I]' B[4], M[I]}, is generated by merging the lists of aIDs returned for classes Book and Manual. Therefore, the retrieval operation is similar to retrieval in an CH-tree [Kim et aI., 1989J. The main difference, however, is that a nested-inherited index can be used for queries on all class hierarchies found along a given path. By contrast, the CH-tree is allocated on a single inheritance hierarchy. Therefore, if a path has length n, the number of CH-trees allocated would be n. Insert Suppose that a new manual B[5J with author A[4J is created with P[2J as value of attribute "publisher". B[5J is therefore a new parent of P[2J. The overall effect of the insertion in the index must be that B[5J is added to the primary record with key-value equal to "Addison-Wesley" , and to the parent list of P[2J. The following steps are executed: 1. The auxiliary index is accessed with key-value equal to P[2J. 2. The 4-tuple of P[2J is retrieved and modified by adding B[5J to the list of P[2J parents. 3. From the 4-tuple of P[2J the pointer to the primary record is determined. 4. The primary record is accessed. 5. A look-up is executed of the class directory in the primary record to deter- mine the offset where aIDs of the class Book are stored.
  • 43. 34 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 6. B[5] is added to the list of OIDs stored at the offset determined at the previous step. 7. A 4-tuple for B[5] is inserted in the auxiliary index with {A[4]} as the author list. Note that there is no need to execute a look-up of the primary index, since the address of the primary record can be directly determined from the auxiliary record. Delete Suppose now that manual M[l] is removed. The overall effect of this operation on the index must be that M[l] and all instances referencing M[l] (that is, 0[1]' A[l] and A[4]) be eliminated from the primary record with key-value equal to "Addison-Wesley". Moreover, the 4-tuples for instances M[l], 0[1], A[l] and A[4] must be eliminated. Finally, M[l] must be eliminated from the parent list of P[2]. Note that the update to the parent lists of P[2] may not be needed if P[2] is removed as well; in this case it may be better to accumulate several delete operations on the same index. However, we will include that update to exemplify the algorithm. 1. The value of attribute "publisher" of M[l] is determined. This value is the aID P[2]. 2. The auxiliary index is accessed with key-value equal to P[2]. 3. The 4-tuple of P[2] is retrieved and modified by removing M[l] from the list of parents of P[2]. 4. From the 4-tuple of P[2] the pointer to the primary record is determined. 5. The primary record is accessed. 6. A look-up is executed on the class-directory in the primary record to de- termine the offset where the aIDs of the class Manual are stored and the pointer to the auxiliary record for class Manual. 7. M[1] is removed from the list of a IDs stored at the offset determined at the previous step. 8. The auxiliary record of class Manual is accessed and the 4-tuple containing as first element the OlD M[l] is determined. From this tuple, the aIDs of the M[l] parents are determined. Those are A[l] and A[4]. Then the 4-tuple of M[l] is removed. 9. The 4-tuples of A[l] and A[4] are accessed to retrieve the parent lists.
  • 44. OBJECT-ORIENTED DATABASES 35 O. A lookup is executed on the class-directory in the primary record to deter- mine the offset where the OIDs of the class Author are stored. 1. A[l] and A[4] are removed from the list of OIDs stored at the offset deter- mined at the previous step. 2. A lookup is executed on the class-directory in the primary record to deter- mine the offset where the OIDs of class Organization are stored. 3. 0[1] is removed from the list of OIDs stored at the offset determined at the previous step. The delete operation may appear rather costly. However, note that the primary record is accessed only once from secondary storage. Several modifi- cations may be required on this record. However, the record can be kept in memory and written back after all modifications have been executed. Also note that the algorithm may require accessing several auxiliary records. However, they are all connected to the same primary record. Therefore, they are likely to be in the same page. A preliminary comparison among the nested-inherited index and two other organizations ha.s been presented in [Bertino, 1991a, Bertino and Foscoli, 1995]. The first of the two organizations is a multi-index organization and simply con- sists of allocating a.n index on each class in the scope of the path. In the exam- ple of path P=Organization.staff.books.publisher.name, seven indexes would be allocated. The second organization, called inherited-multi-index, consists of allocating an inherited index on each inheritance hierarchy found along the path. Therefore, the inherited-multi-index is a combination of the CH-tree organization (defined for inheritance hierarchies) with the multi-index organi- zation (defined for aggregation hierarchy). For the same path P, there would be an CH-tree rooted at class Book (thus, indexing Book, Manual and Handbook), and three B+-tree indexes on classes Organization, Author and Publisher. Ma- jor results from the comparison are the following: • The nested-inherited index has the best retrieval performance. • The nested-inherited index has quite good performance for the insert oper- ation, since it requires an additional cost of at most three I/O operations with respect to the other two organizations. • The delete operation for the nested-inherited index has in the worst case an additional cost of 4 x i (where i is the position of the class in the path) with respect to the other organizations. An accurate model of those costs has been recently developed in [Bertino and Foscoli, 1995].
  • 45. 36 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS The nested-inherited index does not support any customization with respect t.o t.he operation profile (see Subsection 1.2.1). Nevertheless, it may be success- fully used in t.he path splitting approach together wit.h other basic techniques as an index allocated on some subpath which contains one or more inherita.nce hierarchies. 1.5 Caching and pointer swizzling The indexing techniques we discussed so far are based on object structures, that is, on object attributes. Another possibility is to provide indexing based on ob- ject behavior, that is, on method results [Bretl et al., 1989]. Techniques based on this approach have been proposed in [Bertino, 1991b, Bertino and Quarati, 1991, Jhingran, 1991, Kemper et al., 1994]. Most techniques are based on precomputing or caching the results of method invocations. Moreover, precom- puted results can be stored in an index, or other access structures, so that it is possible to efficiently evaluate queries containing the invocation of the method. A major issue of this approach is how to detect when the computed method results are no longer valid. In most approaches some dependency information is kept. This dependency information keeps track of which objects (and possibly which attributes of each object) have been used to compute a given method. When an object is modified, all method precomputed results that have used that object are invalidated. Different solutions can be devised to the problem of dependencies, also depending on the characteristics of the method. In the approach proposed in [Kemper et al., 1994], a special structure (implemented as a relation) keeps track of these dependencies. A dependency has the format This dependency records the fact that the object whose identifier is oidj has been used in computing the method of name method_name with input parameters < oidl , oid2 , .... , oidk >. Note that the input parameters include also the identifier of t.he object to which the message invoking the method has been sent. A more sophisticated approach has been proposed in [Bertino and Quarati, 1991]. If a method is local, that is, uses only the attributes of the object upon which it has been invoked, all dependencies are kept within the object itself. Those dependencies are coded as bit-strings, therefore they require a minimal space overhead. If a method is not local, that is, uses attributes of other objects, all dependencies are stored in a special object. All objects whose attributes have been used in the precomputation of a method, have a reference to this special object. This approach is similar to the one proposed in [Kemper
  • 46. OBJECT-ORIENTED DATABASES 37 et al., 1994]. The main difference is that in the approach proposed by Bertino and Quarati, dependencies are stored not in a single data structure, rather they are distributed among several "special objects". The main advantage of this approach is that it provides a greater flexibility with respect to object allocation and clustering. For example, a "special object" may be clustered together with one of the objects used in the precomputation of the method, depending on the expected update frequencies. To further reduce the need of invalidation, it is important to determine the actual attributes used in the precomputation of a method. As noted in [Kemper et al., 1994], not all attributes are used in executing all methods. Rather, each method is likely to require a small fraction of an object's attributes. Two basic approaches can be devised exploiting such observation. The first approach is called static and it is based on inspecting the method implementation. There- fore, for each method the system keeps the list of attributes used in the method. In this way, when an attribute is modified, the system has only to invalidate a method if the method uses the modified attribute. Note, however, that an inspection of method implementations actually determines all attributes that can be possibly used when the method is executed. Depending on the method execution flow, some attributes may never be used in computing a method on a given object. This problem is solved by the dynamic approach. Under this approach, the attributes used by a method are actually determined only when the method is precomputed. Upon precomputation of the method, the system keeps track of all attributes actually accessed during the method exe- cution. Therefore, the same method precomputed on different objects may use different sets of attributes for each one of these objects. Performance studies of method precomputation have been carried out in [Jhingran, 1991, Kemper et al., 1994]. Besides caching and precomputing, a close class of techniques, commonly referred to as "pointer swizzling" [Kemper and Kossmann, 1995, Moss, 1992], was investigated for managing references among main-memory resident per- sistent objects. Pointer swizzling is a technique to optimize accesses through such references to objects residing in main-memory. Generally, each time an object is referenced through its OlD, the system has to determine whether the object is already in main memory by performing a table lookup. If the object is not already in main memory, it must be loaded from secondary storage. The basic idea of pointer swizzling is to materialize the address of a main-memory resident persistent object in order to avoid the table lookup. Thus, pointer swizzling converts database objects from an external (persistent) format con- taining aIDs into an internal (main memory) format replacing the aIDs by the main-memory address of the referenced objects. Though the choice of a specific swizzling strategy is strongly influenced by the characteristics of the underly-
  • 47. 38 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS ing object lookup mechanism, a systematic classification of pointer swizzling techniques, quite independent from system characteristics, has been developed [Moss, 1992]. Later, this classification was extended and a new dimension of swizzling techniques, when swizzling objects can be replaced from the main- memory buffer, was proposed [Kemper and Kossmann, 1995]. 1.6 Summary In this chapter, we have discussed a number of indexing techniques specifi- cally tailored for object-oriented databases. We have first presented indexing techniques supporting an efficient evaluation of implicit joins among objects. Several techniques have been developed. No one of them, however, is opti- mal from both retrieval and update costs. Techniques providing lower retrieval costs, such as path indexes or access relations, have a greater update costs com- pared to techniques, such as multi-index, that, however have greater retrieval costs. Then we have discussed indexing techniques for inheritance hierarchies. Fi- nally, we have presented an indexing technique that provides integrated support for queries on both aggregation and inheritance hierarchies [Bertino and Foscoli, 1995]. Overall, an open problem is to determine how all those indexing techniques perform for different types of queries. Studies along that direction have been carried out in [Bertino, 1990, Kemper and Moerkotte, 1992, Valduriez, 1986]. Similar studies should be undertaken for all the other techniques. Another open problem concerns optimal index allocation. In the chapter we have also briefly discussed techniques for an efficient exe- cution of queries containing method invocations. This is an interesting problem that is peculiar to object-oriented databases (and in general, to DBMSs sup- porting procedures or functions as part of the data model). However, few solutions have been proposed so far and there is, moreover, the need for com- prehensive analytical models. Notes 1. Note that in GemStone, unlike other OODBMSs, attributes must not necessarily have a domain. 2. For sake of homogeneity, we will denote the class domain Cn.An as class Cn+1 • 3. A set containing class C itself and all classes in the inheritance hierarchy rooted at C is denoted as C' 4. Note that if a class occurs at several points in a path, the class has a set of positions.
  • 48. 2 SPATIAL DATABASES Many applications (such as computer-aided design (CAD), geographic infor- mation systems (GIS), computational geometry and computer vision) operate on spatial data. Generally speaking, spatial data are associated with spatial coordinates and extents, and include points, lines, polygons and volumetric objects. While it appears that spatial data can be modeled as a record with multiple attributes (each corresponding to a dimension of the spatial data), conven- tional database systems are unable to support spatial data processing effec- tively. First, spatial data are large in quantity, complex in structures and relationships, and often represent non-zero sized objects. Take GIS, a popular type of spatial database systems, as an example. In such a system, the database is a collection of data objects over a particular multi-dimensional space. The spatial description of objects is typically extensive, ranging from a few hun- dred bytes in land information system (commonly known as LIS) applications to megabytes in natural resource applications. Moreover, the number of data objects ranges from tens of thousands to millions. Second, the retrieval process is typically based on spatial proximity, and em- ploys complex spatial opemtors like intersection, adjacency, and containment. E. Bertino et al., Indexing Techniques for Advanced Database Systems © Kluwer Academic Publishers 1997
  • 49. 40 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Such spatial operators are much more expensive to compute compared to the conventional relational join and select operators. This is due to irregularity in the shape of the spatial objects. For example, consider the intersection of two polyhedra. Besides the need to test all points of one polyhedron against the other, the result of the operation is not always a polyhedron but may sometimes consist of a set of polyhedra. Third, it is difficult to define a spatial ordering for spatial objects. The con- sequence of this is that conventional techniques (such as sort-merge techniques) that exploits ordering can no longer be employed for spatial operations. Efficient processing of queries manipulating spatial relationships relies upon auxiliary indexing structures. Due to the volume of the set of spatial data objects, it is highly inefficient to precompute and store spatial relationships among all the data objects (although there are some proposals that store pre- computed spatial relationships [Lu and Han, 1992, Rotem, 1991]). Instead, spatial relationships are materialized dynamically during query processing. In order to find spatial objects efficiently based on proximity, it is essential to have an index over spatial locations. The underlying data structure must support efficient spatial operations, such as locating the neighbors of an object and identifying objects in a defined query region. In this chapter, we review some of the more promising spatial data struc- tures that have been proposed in the literature. In particular, we focus on indexing structures designed for non-zero sized objects. The review of these indexes is organized in two steps: first, the structures are described; second, their strengths and weaknesses are highlighted. The readers are referred to [Nievergelt and Widmayer, 1997, Ooi et al., 1993) for a comprehensive survey on spatial indexing structures. The rest of this chapter is organized as follows. In Section 2.1, we briefly discuss various issues related to spatial processing. Section 2.2 presents a tax- onomy of spatial indexing structures. In Section 2.3 to Section 2.6, we present representative indexing techniques that are based on binary tree structure, B- tree structure, hashing and space-filling techniques. Section 2.7 discusses the issues on evaluating the performance of spatial indexes, and approaches adopted in the literature are reviewed, and finally, we summarize in Section 2.8. 2.1 Query processing using approximations Spatial data such as objects in spatia.! database systems, and roads and lakes in GIS, do not conform to any fixed shape. Furthermore, it is expensive to per- form spatial operations (for example, intersection and containment) on their exact location and extent. Thus, some simpler structure (such as a bounding rectangle) that approximates the objects are usually coupled with a spatial in-
  • 50. SPATIAL DATABASES 41 dex. Such bounding structures allow efficient proximity query processing by preserving the spatial identification and dynamically eliminating many poten- tial tests efficiently. Consider the intersection operation. Two objects intersect implies that their bounding structures intersect. Conversely, if the bounding structures of two objects are disjoint, then the two objects do not intersect. This property reduces the testing cost since the test on the intersection of two polygons or a polygon and a sequence of line segments is much more expensive than the test on the intersection of two bounding structures. By far, the most commonly used approximation is the container approach. In the container approach, the minimum bounding rectangle/circle (box/sphere) - the smallest rectangle/circle (box/sphere) that encloses the object - is used to represent an object, and only when the test on container succeeds then the actual object is examined. The bounding box (rectangle) is used throughout this chapter as the approximation technique for discussion purposes. The k-dimensional bounding boxes can be easily defined as a single dimensional array of k entries: (10, ft, ... ,h-d where Ii is a closed bounded interval [a, b] describing the extent of the spatial object along dimension i. Alternatively, the bounding box of an object can be represented by its centroid and extensions on each of the k directions. Objects extended diagonally may be badly approximated by bounding boxes, and false matches may result. A false match occurs when the bounding boxes match but the actual objects do not match. If the approximation technique is very inefficient, yielding very rough approximations, additional page accesses will be incurred. More effective approximation methods include convex hull [Preparata and Shamos, 1985] and minimum bounding m-corner. The covering polygons produced by these two methods are however not axis-parallel and hence incur more expensive testing. The construction cost of approximations and storage requirement are higher too. Decomposition of regions into convex cells has been proposed to improve ob- ject approximation [Gunther, 1988). Likewise, an object may be approximated by a set of smaller rectangles/boxes. In the quad-tree tessellation approach [Abel and Smith, 1984], an object is decomposed into multiple sub-objects based on the quad-tree quadrants that contain them. The decomposition has its problem of having to store object identity in multiple locations in an index. The problems of the redundancy of object identifiers and the cost of object- reconstruction can be very severe if the decomposition process is not carefully controlled. They can be controlled to a certain extent by limiting the num- ber of elements generated or by limiting the accuracy of the decomposition [Orenstein, 1990]. The object approximation and spatial indexes supporting such concepts are used to eliminate objects that could not possibly contribute to the answer of
  • 51. 42 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS queries. This results in a multi-step spatial query processing strategy [Brinkhoff et al., 1994]: 1. The indexing structure is used to prune the search space to a set of candidate objects. This set is usually a superset of the answer. 2. Based on the approximations of the candidate objects, some of the false hits can be further filtered away. The effectiveness of this step depends on the approximation techniques. 3. Finally, the actual objects are examined to identify those that match the query. Clearly, the multi-step strategy can effectively reduce the number of pages accessed and the number ofredundant data to be fetched and tested through the index mechanism, and reduce the computation time through the approximation mechanism. The commonly used conventional key-based range (associative) search, which retrieves all the data falling within the range of two specified values, is general- ized to an intersection search. In other words, given a query region, the search finds all objects that intersect it. The intersection search can be easily used to implement point search and containment search. For point search, the query region is a point, and is used to find all objects that contain it. Containment search is a search for all objects that are strictly contained in a given query region and it can be implemented by ignoring objects that fail such a condition in intersection search. The search operation supported by an index can be used to facilitate a spatial selection or spatial join operation. While a spatial selection retrieves all objects of the same entity based on a spatial predicate, a spatial join is an operation that relates objects of two different entities based on a spatial predicate. 2.2 A taxonomy of spatial indexes Various types of data structures, such as B-trees [Bayer and McCreight, 1972, Comer, 1979], ISAM indexes, hashing and binary trees [Knuth, 1973], have been used as a means for efficient access, insertion and deletion of data in large databases. All these techniques are designed for indexing data based on pri- mary keys. To use them for indexing data based on secondary keys, inverted indexes are introduced. However, this technique is not adequate for a database where range searching on secondary keys is a common operation. For this type of applications, multi-dimensional structures, such as grid-files [Nievergelt et al., 1984]' multi-dimensional B-trees [Kriegel, 1984, Ouksel and Scheuer- mann, 1981, Scheuermann and Ouksel, 1982], kd-trees [Bentley, 1975] and
  • 52. SPATIAL DATABASES 43 quad-trees [Finkel and Bentley, 1974] were proposed to index multi-attribute data. Such indexing structures are known as point indexing structures as they are designed to index data objects which are points in a multi-dimensional space. Spatial search is similar to non-spatial multi-key search in that coordinates may be mapped onto key attributes and the key values of each object represent a point in a k-dimensional space. However, spatial objects often cover irregular areas in multi-dimensional spaces and thus cannot be solely represented by point locations. Although techniques such as mapping regular regions to points in higher dimensional spaces enable point indexing structures to index regions, such representations do not help support spatial operators such as intersection and containment. Based on existing classification techniques [Lomet, 1992, Seeger and Kriegel, 1988], the techniques used for adapting existing indexes into spatial indexes can be generally classified as follows: The transformation approach. There are two categories of transformation approach: • Parameter space indexing. Objects with n vertices in a k-dimensional space are mapped into points in an nk-dimensional space. For example, a two- dimensional rectangle described by the bottom left corner (Xl, yt} and the top right corner (X2, Y2) is represented as a point in a four-dimensional space, where each attribute is taken from a different dimension. After the transformation, points can be stored directly in existing point indexes. An advantage of such an approach is that there is no major alteration of the multi-dimensional base structure. The problem with the mapping scheme is that the spatial proximity between the k-dimensional objects may no longer be preserved when represented as points in an nk-dimensional space. Con- sequently, intersection search can be inefficient. Also, the complexity of insertion operation typically increases with higher dimensionality. • Mapping to single attribute space. The data space is partitioned into grid cells of the same size, which are then numbered according to some curve- filling methods. A spatial object is then represented by a set of numbers or one-dimensional objects. These one-dimensional objects can be indexed using conventional indexes such as B+-trees. The non-overlapping native space indexing approach. This category comprises two classes of techniques:
  • 53. 44 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS • Object duplication. A k-dimensional data space is partitioned into pairwise disjoint subspaces. These subspaces are then indexed. An object identifier is duplicated and stored in all the subspaces it intersects. • Object clipping. This technique is similar to the object duplication approach. Instead of duplicating the identifier, an object is decomposed into several disjoint smaller objects so that each smaller sub-object is totally included in a subspace. The most important property of object duplication or clipping is that the data structures used are straightforward extensions of the underlying point indexing structures. Also, both points and multi-dimensional non-zero sized objects can be stored together in one file without having to modify the structure. However, an obvious drawback is the duplication of objects which requires extra storage and hence more expensive insertion and deletion procedures. Another limitation is that the density (the number of objects that contain a point) in a map space must be less than the page capacity (the maximum number of objects that can be stored in a page). The overlapping native space indexing approach. The basic idea of this approach to indexing spatial database is to hierarchically partition its data space into a manageable number of smaller subspaces. While a point object is totally included in an unpartitioned subspace, a non-zero sized object may extend over more than one subspace. Rather than supporting disjoint subspaces as in the non-overlapping space indexing approach, the overlapping native space indexing approach allows overlapping subspaces such that objects are totally included in only one of the subspaces. These subspaces are organized as a hierarchical index and spatial objects are indexed in their native space. A major design criterion for indexes using such an approach is the minimization of both the overlap between bounding subspaces and the coverage of subspaces. A poorly designed partitioning strategy may lead to unnecessary traversal of multiple paths. Further, dynamic maintenance of effective bounding subspaces incurs high overhead during updates. A number of indexing structures use more than one extending technique. Since each extending method has its own weaknesses, the combination of two or more methods may help to compensate the weaknesses of each other. However, an often overlooked fact is that the use of more than one extending method may also produce a counter effect: inheriting the weaknesses from each method. Figure 2.1 shows the evolution of spatial indexing structures we adapted from [Lu and Ooi, 1993]. A solid arrow indicates a relationship between a new structure and the original structures that it is based upon. A dashed arrow
  • 54. SPATIAL DATABASES 45 1984 1985 1986 1987 1988 1989 1990 GBD-lree 1991 1992 1993 1994 1995 binar~-lree LSD-lree B-tree TV-tree EXCELL Hashing Grid-files Quad-tree based location keys DOT 1996 X-tree Figure 2.1. Filter-tree Evolution of spatial index structures.
  • 55. 46 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS indicates a relationship between a new structure and the structures from which the techniques used in the new structure originated, even though some were proposed independently of the others. In the diagram and also in the subse- quent sections, the indexes are classified into four groups based on their base structures: namely, binary trees, B-trees, hashing, and space filling methods. Most spatial indexing structures (such as R-trees, R*-trees, skd-trees) are nondeterministic in that different sequences of insertions result in different tree structures and hence different performance even though they have the same set of data. The insertion algorithm must be dynamic so that the performance of an index will not be dependent on the sequence of data insertion. During the design of a spatial index, issues that need to be minimized are: • The area of covering rectangles maintained in internal nodes. • The overlaps between covering rectangles for indexes developed based on the overlapping native space indexing approach. • The number of objects being duplicated for indexes developed based on the non-overlapping native space indexing approach. • The directory size and its height. There is no straightforward solution to fulfill all the above conditions. The fulfillment of the above conditions by an index can generally ensure its efficiency, but this may not be true for all the applications. The design of an index needs to take the computation complexity into consideration as well, which although is a less dominant factor considering the increasing computation power of today's systems. Other factors that affect the performance of information retrieval as a whole include buffer design, buffer replacement strategies, space allocation on disks, and concurrency control methods. 2.3 Binary-tree based indexing.. techniques The binary search tree is a basic data structure for representing data items whose index values are ordered by some linear order. The idea of repetitively partitioning a data space has been adopted and generalized in many sophisti- cated indexes. In this section, we will examine spatial indexes originated from the basic structure and concept of binary search trees. 2.3.1 The kd-tree The kd-tree [Bentley, 1975], a k-dimensional binary search tree, was proposed by Bentley to index multi-attribute data. A node in the tree (see Figure 2.2) serves two purposes: representation of an actual data point and direction of a
  • 56. SPATIAL DATABASES 47 search. A discriminator whose value is between 0 and k-1 inclusive, is used to indicate the key on which the branching decision depends. A node P has two children, a left son LOSON(P) and a right son HISON(P). If the discriminator value of node P is the jth attribute (key), then the jth attribute of any node in the LOSON(P) is less than the jth attribute of node P, and the jth attribute of any node in the HISON(P) is greater than or equal to that of node P. This property enables the range along each dimension to be defined during a tree traversal such that the ranges are smaller in the lower levels of the tree. (0,100) (100, 100) 0) discriminator o(x-axis) 8(10,75) tD 30,90) • F(80, 4.A(40,60) -OC(2 -,15) E(70,20) I (y-axis) o(x-axis)(100,0)(0,0) (a) The planar representation. (b) The structure of a kd-tree. Figure 2.2. The organization of data in a kd-tree. Complications arise when an internal node is deleted. When an internal node is deleted, say Q, one of the nodes in the subtree whose root is Q must be obtained to replace Q. Suppose i is the discriminator of node Q, then the replacement must be either a node in the right subtree with the smallest ith attribute value in that subtree, or a node in the left subtree with the biggest ith attribute value. The replacement of a node may also cause successive replacements. To reduce the cost of deletion, a non-homogeneous kd-tree [Bentley, 1979b] was proposed. Unlike a homogeneous index, a non-homogeneous index does not store data in the internal nodes and its internal nodes are used merely as directory. When splitting an internal node, instead of selecting a data point, the non-homogeneous kd-trees selects an arbitrary hyperplane (a line for the two dimensional space) to partition the data points into two groups having almost the same number of data points and all data points reside in the leaf nodes. The kd-tree has been the subject of intensive research over the past decade [Banerjee and Kim, 1986, Beckley et al., 1985a, Beckley et al., 1985b, Beckley
  • 57. 48 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS et al., 1985c, Bentley and Friedman, 1979, Bentley, 1979a, Chang and Fu, 1979, Eastman and Zemankova, 1982, Friedman et al., 1987, Lee and Wong, 1977, Matsuyama et al., 1984, Ohsawa and Sakauchi, 1983, Orenstein, 1982, Overmars and Leeuwen, 1982, Robinson, 1981, Rosenberg, 1985, Shamos and Bentley, 1978, Sharma and Rani, 1985]. Many variants have been proposed in the literature to improve its performance with respect to issues such as clustering, searching, storage efficiency and balancing. 2.3.2 The K-D-B-tree To improve the paging capability of the kd-tree, the K-D-B-tree was proposed [Robinson, 1981]. K-D-B-tree is essentially a combination of a kd-tree and a B-tree [Bayer and McCreight, 1972, Comer, 1979], and consists of two basic structures: region pages and point pages (see Figure 2.3). While point pages contain object identifiers, region pages store the descriptions of subspaces in which the data points are stored and the pointers to descendant pages. Note that in a non-homogeneous kd-tree [Bentley, 1979b], a space is associated with each node: a global space for the root node, and an unpartitioned subspace for each leaf node. In the K-D-B-tree, these subspaces are explicitly stored in a region page. These subspaces (for example, 811, 812 and 813) are pairwise disjoint and together they span the rectangular subspace of the current region page (for example, 81), a subspace in the parent region page. During insertion of a new point into a full point page, a split will occur. The point page is split such that the two resultant point pages will contain almost the same number of data points. Note that a split of a point page requires an extra entry for the new point page, this entry will be inserted into the parent region page. Therefore, the split of a point page may cause the parent region page to split as well, which may further ripple all the way to the root; thus the tree is always perfectly height-balanced. When a region page is split, the entries are partitioned into two groups such that both have almost the same number of entries. A hyperplane is used to split the space of a region page into two subspaces and this hyperplane may cut across the subspaces of some entries. Consequently, the subspaces that intersect with the splitting hyperplane must also be split so that the new subspaces are totally contained in the resultant region pages. Therefore, the split may propagate downward as well. If the constraint of splitting a region page into two region pages containing about the same number of entries is not enforced, then downward propagation of split may be avoided. The dimension for splitting and the splitting point are chosen such that both the resultant pages have almost the same number of entries and the number of splittings is minimized. However, there is no discussion on the selection of splitting points.
  • 58. 51 SPATIAL DATABASES 49 52 • • •• • • • • •• • • 511 • • 521 522 DQ (a) Planar partition. (b) A hierarchical I<-D-8-tree structure. Figure 2.3. The K-D-B-tree structure. The upward propagation of a split will not cause the underflow of pages but the downward propagation is detrimental to storage efficiency because a page may contain less than the usual page threshold, typically half of the page capacity. To avoid unacceptably low storage utilization, local reorganization can be performed. For example, two or more pages whose data space forms a rectangular space and who have the same parent can be merged followed by a resplit if the resultant page overflows. The K-D-B-tree has incorporated the pagination of the B-tree and the tree is height-balanced as a result. Nevertheless, poorer storage efficiency is the trade-off. 2.3.3 The hE-tree In the K-D-B-tree, a region node is split by cutting the region with a plane, possibly cutting through some subregions as well. The child nodes with their space being cut must also invoke the splitting process, causing sparse nodes at lower levels. To overcome such a problem, a new multi-attribute index structure called the holey brick B-tree (the hB-tree) [Lomet and Salzberg, 1990a] allows the data space to be holey, enabling removal of any data subspace from a
  • 59. 50 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS data space. The concept of holey bricks is not new - it has been used to improve the clustering of data in a kd-tree known as the BD-tree [Ohsawa and Sakauchi, 1983]. The hB-tree structure is based on the K-D-B-tree structure and hence preserves the height-balanced property. However, it allows the data space associated with a node to be non-rectangular and it uses kd-trees for space representation in its internal nodes. In an hB-tree, the leaf nodes are known as data nodes and the internal nodes as index nodes. The data space of an index node is the union of its child node subspaces which are obtained through kd-tree recursive partitioning. Indexnode N: AA 2 NI N2 A B C E G F NI: N2~ ~ . ex(~A1yl H x_ y. F y ABC E G F (a) Internal structure of an hB-tree index node. Figure 2.4. (b) The resultant pages after a split. The hB-tree structure. A k-dimensional data space represented by its boundaries requires 2k co- ordinates. To obtain a data space of interest to the search, half of the data subspaces in a node have to be searched on average and for each data space, 2k comparisons are required. For m data spaces, we need on average m . k comparisons. The m data subspaces derived through kd-tree recursive parti- tioning can be represented by a kd-tree with m - 1 kd-tree nodes. It requires one comparison at each internal node and 2k comparisons for the unpartitioned subspace. The average number of comparisons is much smaller than that of the boundary representation. The use of kd-trees therefore reduces the search time as well as the storage space requirement. Like conventional kd-trees, internal nodes of the kd-tree structure in an hB- tree index node partition the search space recursively. Its leaf nodes reference
  • 60. SPATIAL DATABASES 51 some index nodes of the hB-tree. However, multiple leaves of a kd-tree structure may refer to the same hB-tree index node (see Figure 2.4a), giving rise to the "holey brick" representation. As such, the hB-tree is not truly a tree. During a split, the kd-tree is split into two subtrees, with each having between 1/3 and 2/3 of the nodes. In order to achieve this, a subtree may have to be extracted from the original tree structure. This causes duplication of a portion of the tree close to the root in the parent index node. A leaf node of such a kd- tree references either an hB-tree data node, an index node, or a marker (ext in Figure 2.4b) indicating that a subtree has previously been extracted and is referenced from a higher level index node. The deletion algorithm is not addressed in the paper. The hB-tree overcomes the problem of sparse nodes in the K-D-B-tree. How- ever, this is achieved at the expense of more expensive node splitting and node deletion. The multiple references of an hB-tree node may cause a path to be traversed more than once. Of course, this can. be avoided by checking the list of traversed hB-tree nodes. Deletion may result in the kd-tree being collapsed to remove the duplicated portion of kd-trees, followed by a resplit if necessary. 2.3.4 The skd-tree Ooi et al. [Ooi et al., 1987, Ooi et al., 1991] developed an indexing structure called the spatial kd-tree (the skd-tree) in an attempt to avoid object duplica- tion and object mapping. At each node of a kd-tree, a value (the discriminator value) is chosen in one of the dimensions to partition a k-dimensional space into two subspaces. The two resultant subspaces, HISON and LOSON, nor- mally have almost the same number of data objects. Point objects are totally included in one of the two resultant subspaces, but non-zero sized objects may extend over to the other subspace. To avoid the division of objects for and the duplication of identifiers in several subspaces, and yet to be able to retrieve all the wanted objects, a virtual subspace for each original subspace was in- troduced such that all objects are totally included in one of the two virtual subspaces [Ooi et al., 1987]. With this method, the placement of an object in a subspace is based solely upon the value of its centroid. Since a space is always divided into two, an additional value for each subspace is required: the maximum of the objects in the LOSON subspace (maxLOsoN), and the minimum of the objects in the HISON subspace (minHISON ), along the dimension defined by the discriminator. Thus, the structure of an internal node of the skd-tree consists of two child pointers, a discriminator (0 to k -1 for a k-dimensional space), a discriminator-value, (maxLosoN) and (minHIsoN) along the dimension specified by the discriminator. The maximum range value
  • 61. 52 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS of LOSON (maxLosoN) is the nearest virtual line that bounds the data objects whose centroids are in the LOSON subspace, and the minimum range value of HISON (minH/SoN) is the nearest virtual line that bounds the data objects whose centroids are in the HISON subspace. Leaf nodes contain min-range and max-range (in place of maXLOSON and minH/ SON of an internal node) respectively, describing the minimum and max- imum values of objects in the data page along the dimension specified by bound, and a pointer to the secondary page which contains the object bounding rect- angles and identifiers. The minimum and maximum values could be kept for k dimensions. However, for storage efficiency, the range along one dimension that results in the smallest bounding rectangle is chosen. It has been shown [Ooi, 1990] that such a range increases the height of the tree when it is stored as a multiway tree, and hence the improvement becomes fairly marginal. Figure 2.5 shows the structure of a two-dimensional skd-tree and illustrates the virtual boundary (dotted line), minH/SON or maxLOSON of each resultant subspace. An implicit rectangular space is associated with each node and it is ma- terialized during traversal. This rectangle is tested against the query region, and the subtree is examined if they intersect. Since the virtual boundary may sometimes bound the objects tighter than the partitioning line, the intersec- tion search takes advantage of the existing virtual boundary to prune the search space efficiently. To further exploit the virtual boundaries, containment search which retrieves all spatial objects contained in a given query rectangle was pro- posed. During tree traversal, the algorithm always selects the boundaries that yield smaller search space. The direct support of containment search is useful to operators like within and contain. The search rapidly eliminates all objects that are not totally contained in the query region. Inserting index records for new data objects is similar to insertion into a point kd-tree. As new index records are added to a bucket, the bucket is split if it overflows. At each node, the algorithm uses the centroid of the bounding rectangle of the new object to determine which subspace the object will be placed, and updates the virtual boundary if necessary. To delete an object, the centroid of its bounding rectangle is used to deter- mine where the object resides. The removal of an object may cause a bucket to underflow, and merging or reinsertion is then required. If the neighboring node is a leaf-node, then the two buckets are merged and the resultant bucket is resplit if overflow occurs. Otherwise, the records are required to be inserted into the neighboring subtree, and the neighboring node is promoted to replace the parent node. The merging follows the principle of buddy system [Niev- ergelt et aI., 1984], that is the region of two merged nodes is rectangular and a proper subspace derivable from discriminator values in parent nodes. The major problem with deletion occurs when an object contributes to the bound-
  • 62. SPATIAL DATABASES 53 ~.,,~ ?12.~ ?'14'~ ? 13.~ Ix.xI..h2] ?15,(IOJ (X.h2,! .x2J (y,y2•• y!l {y.Y3" h4) Ix.. f J Ix,. f dalapagc ~ ~ ",'"'~(a) A 2-d directory of the skd-tl'ee. x=x I bI II b2 x2 1-- - ~ - - - - - - - - - :- "I ~ - - - - - -p3. I 1 ¥r pl.: ~ .. :1 1'2 I. p2•.. .·... 1 1 ~ LI1J'~--------~1 : p4. 1'3: I : [[4..i I : : 1 I y=b3 I· •.•.•.. : • • . . . . . • . • . • • . • • : 1 : pII I 12 l- - - - - - ~ - T - - - - - -, . b41" ~ 1.. 'LJ7.. ':" .: : r9 : I · .......... , .. yltij...... 1 rSI . I I 1'6 1 I'.:.... ..i 1 .W~ -,~ . r . p5 : y~y3 :... 'p'~~ ...' • p9 I • I ' , pS 1 y2 ..... : I . . . • • plOl '- :... _ 1 _: ....: _ I _ :.... I x=b5 13 b6 (b) A 2-d space coordinate representation. Figure 2.5. The structure of a spatial kd-tree. y=b7 14 bS b9 blO 15 ary of a virtual space is deleted. A new tighter boundary needs to replace the old boundary which may not be as effective. The operation can be expensive as several pages whose space is adjacent to the deleted boundary need to be searched. The operation cost can be reduced by periodically sweeping the sub- trees that are affected by deletions. It should be noted that the delay of finding replacements does not result in any invalid answer. The directory of the skd-tree is stored in secondary memory. The bottom-up approach for binary tree paging [Cesarini and Soda, 1982] is modified to store the skd-tree as a multiway-tree. When such a page splits, one of the subtrees
  • 63. 54 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS is migrated to an existing page that can accommodate the subtree or a new page, and the root of the subtree is promoted to the parent page. It was shown that the containment search is insensitive to the different sizes of objects and distribution of objects, and it is always more efficient than the intersection search due to a smaller search space [Ooi et al., 1991]. It can be noticed that the leaf nodes of the skd-tree take up about half of the storage requirement for the directory. The main objective of having such layer of leaf nodes is to reduce the fetching of data pages. Experiments have been conducted to evaluate the performance of skd-trees with and without the leaf nodes, under different data distributions [Ooi, 1990]. The experiments show that for uniform distributions of spatial objects, the leaf nodes can reduce the page accesses. However, when the distributions are skewed, the extra layers are not effective and large directory sizes incur more page reads than that by the modified skd- tree. The modified skd-tree, which has less number of nodes, saves up to 40% of the directory storage space. 2.3.5 The BD- and GBD-trees The BD-tree [Ohsawa and Sakauchi, 1983], a variant of kd-trees, allows a more dynamic partitioning of space. Each non-leaf node in the BD-tree contains a variable-length string, called the discriminator zone (DZ) expression, con- sisting of D's and l's. The 0 means "<" and 1 "2::", with the leftmost digit corresponding to the first binary division, and the nth bit corresponding to the nth binary division. The string describes the left subspace while the right subspace is its complement. Each string uniquely describes a space. A data space with the DZ expression (for example, 0100) which is the initial substring of a longer DZ expression (for example, 010001) encloses its data space. A BD- tree is different from a kd-tree in the following aspects. One, the data space of a BD-tree node is not a hyper-rectangle. The use of complement makes the space holey. Two, unlike the conventional kd-tree, the use of DZ expression enables rotation, achieving a greater degree of balancing. Three, the partition divides a space into two equal sized subspaces. Four, the discriminators are used cyclically so that each bit of a DZ expression can be correctly associated with a dimension. The BD-tree is expanded to a balanced multi-way tree called the GBD- tree (generalized BD-tree) [Ohsawa and Sakauchi, 1990]. In addition to a DZ expression, a bounding rectangle is used to describe a data space that boupds the objects whose centroids fall inside the region defined by the DZ expression. Centroids of objects are used to determine placement of objects in the correct bucket. While a DZ expression is used to determine the position in the tree
  • 64. SPATIAL DATABASES 55 structure where an entity is located based on its centroid, a bounding rectangle is used in intersection search. In an internal node, each entry describes a data space obtained through binary decomposition. The union of these data spaces forms the data space of the node. While the data spaces described by the entries' DZ expressions do not overlap, their associated bounding rectangles overlap. During point search of an entity, an inclusion check of the DZ expression of the entity is performed against the DZ expression of a node. For the data space that includes the entity, its subtree is traversed. For the intersection search, the bounding rectangles stored in a node are used instead to select subtrees for traversal. When a leaf node overflows, it is split into two. A recursive binary decom- position on alternative axis is performed on the overflowed data space until a subspace contains at least 2(M+1)/3 entries, where M is the maximum number of entries a node can contain. While the smaller space has a new DZ expression, the other subspace takes the DZ expression of the space before splitting. We call such a space a complementary subspace. A new entry is inserted into the parent node and the affected bounding rectangles are re-adjusted accordingly. In an internal node splitting, the subspaces are checked in decreasing order of their sizes to find a data space that contains almost (M + 1)/2 entries. A data space described by the DZ expression el contains the data space described by the DZ expression e2, if el forms the initial substring of e2. In the testing, all DZ expressions must be checked. The worst case is when a node is split into two nodes respectively having M entries and one entry. The DZ expression obtained is used as the DZ expression of a new node. The other new node, which re- uses the original node, is assigned with the DZ expression of the original space. When an entry is deleted, a node may underflow. Like B-trees, tree collapsing is required. Conceptually, the GBD-tree is similar to the BANG file [Freeston, 1987]. The use of bounding rectangles can be applied to the BANG file. The GBD- tree has been shown to have better efficiency than the R-tree in terms of tree construction time for a small set of data [Ohsawa and Sakauchi, 1990]. 2.3.6 The LSD-tree As an improvement to the fixed size space partitioning of the grid files, a binary tree, called the Local Split Decision tree (LSD-tree), that supports arbitrary split position was proposed [Henrich et aI., 1989a]. A split position can be chosen such that it is optimal with respect to the current cell. The directory of an LSD-tree is similar to that maintained by the kd-tree [Bentley, 1975]. Each node of the LSD-tree represents one split and stores the split dimension
  • 65. 56 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS (cf: the discriminator of kd-trees) and position (cf: the discriminator value of kd-trees), and each leaf node points to a data bucket. In an LSD-tree, the nodes in a directory T are divided into two directories: the internal directory and the external directory. The internal directory consists of a subtree that contains the root and is stored in main memory. The external directory consists of multiway-trees and is stored in secondary memory. In an external directory page, the subtree is organized as a heap. When a directory page is split, the root node of that directory page is inserted into the directory T and the left and right subtrees are stored in two distinct directory pages. The main objective of the paging algorithm [Henrich et al., 1989b] is to ensure that the heights of multiway-trees differ by at most one directory page. The proposed paging strategy is similar to binary paging strategy [Cesarini and Soda, 1982], although the latter makes no distinction between the external and internal directories. The major difference is that the internal directory is restructured such that the heights of multi-way trees in the external directory always differ by at most one page. l'o achieve this, nodes close to the boundary that separates the internal and external directories must be moved around between these two directories. Note that the size of the internal directory depends on the allocated internal memory. Like kd-trees, rotation of the tree is not possible. If the data is very skewed, the property of the height differences of at most one cannot be upheld. The deletion algorithm is not presented. We believe that the deletion of [Cesarini and Soda, 1982] can be applied here. 2.4 B-tree based indexing- techniques B+-trees have been widely used in data intensive systems to facilitate query retrieval. The wide acceptance of the B+-tree is its height-balanced elegant characteristic, making it ideal for disk I/O where data transfer is in the unit of page. It has become an underlying structure for many new indexes. In this section, we discuss indexes based on the concept of the hierarchical structure of B+-trees. 2.4.1 Tlle R-tree The R-tree [Guttman, 1984] is a multi-dimensional generalization of the B-tree, that preserves height-balance. Like the B-tree, node splitting and merging are required for inserting and deleting objects. The R-tree has received a great deal of attention due to its well defined structure and the fact that it is one of the earliest proposed tree structures for non-zero sized spatial object indexing. Many papers have used the R-tree as a model to measure the performance of their structures.
  • 66. SPATIAL DATABASES 57 An entry in a leaf node consists of an object-identifier of the data object and a k-dimensional bounding rectangle which bounds its data objects. In a non-leaf node, an entry contains a child-pointer pointing to a lower level node in the R-tree and a bounding rectangle covering all the rectangles in the lower nodes in the subtree. Figure 2.6 illustrates the structure of an R-tree. R.!. _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2 R3 :~- - - - ;'. ~i6----~----I~~---------I~~~-P3-: I • p4 I II 1 I I I I I D 31 r- - - - - - - - - -I' l - - - - - - - I ' 1 I R7 I II I~ L • pll I I 1 II '~p:p~ RSFpl-: N::I I I I :>7 I I II I I I ,I 1_ - - L - - - - - - I 1"5 p5 I I I plj I I _:=====~ -_i_-_-,..j _II I"S' : : I p S . • :>10 I 1 -_ 1 I (a) A planar representation. ~ (b) The directory of an R-tree. Figure 2.6. The structure of an R-tree. In order to locate all objects which intersect a query rectangle, the search algorithm descends the tree from the root. The algorithm recursively traverses down the subtrees of bounding rectangles that intersect the query rectangle. When a leaf node is reached, bounding rectangles are tested against the query rectangle and their objects are fetched for testing if they intersect the query rectangle. To insert an object, the tree is traversed and all the rectangles in the current non-leaf node are examined. The constraint of least coverage is employed to insert an object: the rectangle that needs least enlargement to enclose the new object is selecteel, the one with the smallest area is chosen if more than
  • 67. 58 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS one rectangle meets the first criterion. The nodes in the subtree indexed by the selected entry are examined recursively. Once a leaf node is obtained, a straightforward insertion is made if the leaf node is not full. However, the leaf node needs splitting if it overflows after the insertion is made. For each node that is traversed, the covering rectangle in the parent is readjusted to tightly bound the entries in the node. For a newly split node, an entry with a covering rectangle that is large enough to cover all the entries in the new node is inserted in the parent node if there is room in the parent node. Otherwise, the parent node will be split and the process may propagate to the root. To remove an object, the tree is traversed and each entry of a non-leaf node is checked to determine if the object overlaps its covering rectangle. For each such entry, the entries in the child node are examined recursively. The deletion of an object may cause the leaf node to underflow. In this case, the node needs to be deleted and all the remaining entries of that node are reinserted from the root. The deletion of an entry may also cause further deletion of nodes in the upper levels. Thus, entries belonging to a deleted ith level node must be reinserted into the nodes in the ith level of the tree. Deletion of an object may change the bounding rectangle of entries in the ancestor nodes. Hence readjustment of these entries is required. In searching, the decision to visit a subtree depends on whether the covering rectangle overlaps the query region. It is quite common for several covering rectangles in an internal node to overlap the query rectangle, resulting in the traversal of several subtrees. Therefore, the minimization of overlaps of covering rectangles as well as the coverage of these rectangles is of primary importance in constructing the R-tree. The heuristic optimization criterion used in the R-tree is the minimization of the area of internal nodes covering rectangles. Two algorithms involved in the process of minimization are the insertion and its node splitting algorithms. Of the two, the splitting algorithm affects the index efficiency more. Guttman [Guttman, 1984] presented and studied splitting algorithms with exponential, quadratic and linear cost, and showed that the performance of the quadratic and linear algorithms were comparatively similar. The quadratic algorithm in a node splitting first locates two entries that are furthest apart, that is a pair of entries that would waste the largest area if they are put in the same group. These two rectangles are known as the seeds and the pair chosen tend to be small relative to others. Two groups are formed, each with one seed. For the remaining entries, each entry rectangle is used to calculate the area enlargement required in the covering rectangle of each group to include the entry. The difference of two area enlargements is calculated and the entry that has the maximum difference is selected as the next entry to be included into the group whose covering rectangle needs the least enlargement. As the selection
  • 68. SPATIAL DATABASES 59 is mainly based on the minimal enlargement of covering rectangles and the rectangle that has been enlarged before requires less expansion to include the next rectangle, it is quite often that a single covering rectangle is enlarged till the group has M - m + 1 rectangles (M is the maximum number of entries per node). The two resultant groups will respectively contain M - m + 1 and m rectangles. The linear algorithm chooses the first two objects based on the separation between the objects in relation to the width of the entire group along the same dimension. Greene proposed a slightly different splitting algorithm [Greene, 1989]. In her splitting algorithm, two most distant rectangles are selected and for each dimension, the separation is calculated. Each separation is normalized by dividing it with the interval of the covering rectangle on the same dimension, instead of by the total width of the entire group [Guttman, 1984]. Along the dimension with the largest normalized separation, rectangles are ordered on the lower coordinate. The list is then divided into two groups, with the first (M + 1)/2 rectangles into the first group and the rest into the other. 2.4.2 The R*-tl'ee Minimization of both coverage and overlaps is crucial to the performance of the R-tree. It is however impossible to minimize the two at the same time. A balancing criterion must be found such that the near optimal of both minimiza- tion can produce the best result. Beckmann et al. introduced an additional optimization objective concerning the margin of the covering rectangles; squar- ish covering rectangles are preferred [Beckmann et al., 1990]. Since clustering rectangles with little variance of the lengths of the edges tend to reduce the area of the cluster's covering rectangle, the criterion that ensures the quadratic cov- ering rectangles is used in the insertion and splitting algorithms. This variant of R-tree is referred to as the R*-tree. In the leaf nodes of the R*-tree, a new record is inserted into the page whose entry covering rectangle if enlarged has the least overlap with other covering rectangles. A tie is resolved by choosing the entry whose rectangle needs the least area enlargement. However, in the internal nodes, an entry whose covering rectangle needs the least area enlargement is chosen to include the new record, and a tie is resolved by choosing the entry with the smallest resultant area. The improvement is particularly significant when both the query rectangles and data rectangles are small, and when the data is non-uniformly distributed. In the R*-tree splitting algorithm, along each axis, the entries are sorted by the lower value, and also sorted by the upper value of the entry rectangles. For each sort, M - 2m + 2 distributions of splits are considered; and in the kth distribution (1 :s: k :s: M - 2m + 2), the first group contains the first m - 1 + k
  • 69. 60 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS entries and the other group contains the remaining !l1 - m - k entries. For each split, the total area, the sum of edges and the overlap-area of the two new covering rectangles are used to determine the split. Note that not all three can be minimized at the same time. Three selection criteria were proposed based on the minimum over one dimension, the minimum of the sum of the three values over one dimension or one sort, and the overall minimum. In the algorithm, the minimization of the edges is used. Dynamic hierarchical spatial indexes are sensitive to the order of the inser- tion of data. A tree may behave differently for the same data set but with a different sequence of insertions. Data rectangles inserted previously may result in a bad split in R-tree after some insertions. Hence it may be worth to do some local reorganization, which is however expensive. The R-tree deletion algorithm provides reorganization of the tree to some extent, by forcing the entries in underflowed nodes to be inserted from the root. The performance study shows that the deletion and reinsertion can improve the R-tree perfor- mance quite significantly [Beckmann et al., 1990]. Using the idea of reinsertion of the R-tree, Beckmann et al. proposed a reinsertion algorithm when a node overflows. The reinsertion algorithm sorts the entries in decreasing order of the distance between the centroids of the rectangle and the covering rectan- gle and reinserts the first p (variable for tuning) entries. In some cases, the entries are reinserted back into the same node and hence a split is eventually necessary. The reinsertion increases the storage utilization; and this can be expensive when the tree is large. Experimental study conducted indicates that the R*-tree is more efficient than some other variants, and the R-tree using lin- ear splitting algorithm is substantially less efficient than the one with quadratic splitting algorithm [Beckmann et aI., 1990]. 2.4.3 The R+-tree The R+-tree [Sellis et al., 1987] is a compromise between the R-tree and the K-D-B-tree [Robinson, 1981] and was proposed to overcome the problem of the overlapping covering rectangles of internal nodes of the R-tree. The R+-tree differs from the R-tree in the following constraints: nodes of an R+-tree are not guaranteed to be at least half filled; the entries of any internal node do not overlap; and an object identifier may be stored in more than one leaf node. The duplication of object identifiers leads to the non-overlapping of entries. In a search, the subtrees are examined only if the corresponding covering rect- angles intersect the query region. The disjoint covering rectangles avoid the multiple search paths of the R-tree for point queries. For the space in Fig- ure 2.7, only one path is traversed to search for all objects that contain point P7; whereas for the R-tree, two search paths exist. However, for certain query
  • 70. SPATIAL DATABASES 61 rectangles, searching the R+-tree is more expensive than searching the R-tree. For example, suppose the query region is the left half of object rs. To retrieve all objects that intersect the query region using the R-tree, two leaf nodes have to be searched, respectively through Rs and Rs, and it incurs five page ac- cesses. To evaluate such a query, three leaf nodes of the R+-tree have to be searched, respectively through R6 , Rg , and RiO, and a total of six page accesses is incurred. R.!. ______________ R2 R3 ~~- - - - ;1. :FR~I=2= ====1~~·-p.1-~1 I R4 ,I .r . ~ p2 II , ep4 I r.O--- , I " I _ _ _ _ _ _ _ I I r3 II I' II ~ - - - - - - - - - - - II II ,I I pll ., fr~'-~----: :~ ::~EF1: ,: :--JI I I I I _I =====~ 1 II r8 it: : Rld~ =~8~= =e-!~o : (a) A planar representation. (b) The directory of an R+-tree. Figure 2.7. The structure of an R+ -tree. To insert an object, multiple paths may be traversed. At a node, the subtrees of all entries with covering rectangles that intersect with the object bounding rectangle must be traversed. On reaching the leaf nodes, the object identifier will be stored in the leaf nodes; multiple leaf nodes may store the same object identifier.
  • 71. 62 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Three cases of insertions need to be handled with care [Gunther, 1988, Ooi, 1990]. The first is when an object is inserted into a node where the covering rectangles of all entries do not intersect with the object bounding rectangle. The second is when the bounding rectangle of the new object only partially intersects with the bounding rectangles of entries; this requires the bounding rectangle to be updated to include the new object bounding rectangle. Both cases must be handled properly such that the coverage of bounding rectangles and duplication of objects could be minimized. The third case is more serious in that the covering rectangles of some entries can prevent each other from expanding to include the new object. In other words, some space ("dead space") within the current node cannot be covered by any of the covering rectangles of the entries in the node. If the new object occupies such a region, it cannot be fully covered by the entries. To avoid this situation, it is necessary to look ahead to ensure that no dead space will result when finding the entries to include an object. Alternatively, the crite- rion proposed by Guttman [Guttman, 1984] can be used to select the covering rectangles to include a new node. When a new object cannot be fully covered, one or more of the covering rectangles are split. This means that the split may cause the children of the entries to be split as well, which may further degrade the storage efficiency. During an insertion, if a leaf node is full and a split is necessary, the split attempts to reduce the identifier duplications. Like the K-D-B-tree, the split of a leaf node may propagate upwards to the root of the tree and the split of a non-leaf node may propagate downwards to the leaves. The split of a node involves finding a partitioning hyperplane to divide the original space into two. The selection of a partitioning hyperplane was suggested to be based on the following four criteria: the clustering of entry rectangles, minimal total x- and y-displacement, minimal total space coverage of two new subspaces, and minimal number of rectangle splits. While the first three criteria aim to reduce search by tightening the coverage, the fourth criterion confines the height expansion of the tree. The fourth criterion can only minimize the number of covering rectangles of the next lower level that must be split as a consequence. It cannot guarantee that the total number of rectangles being split is minimal. Note that all four criteria cannot possibly be satisfied at the same time. While the R+-tree overcomes the problem of overlapping rectangles of the R- tree, it inherits some problems of the K-D-B-tree [Robinson, 1981]. Partitioning a covering rectangle may cause the covering rectangles in the descendant sub- tree to be partitioned as well. Frequent downward splits tend to partition the already under populated nodes, and hence the nodes in an R+-tree may contain less than M /2 entries. Object identifiers are duplicated in the leaf nodes, the extent of duplication is dependent on the spatial distribution and the size of
  • 72. SPATIAL DATABASES 63 the objects. To delete an object, it is necessary to delete all identifiers that refer to that object. Deletion may necessitate major reorganization of the tree. 2.4.4 The BY-tree The BY-tree, proposed by Freeston, is a generalization of the B-tree to higher dimensions [Freeston, 1995]. While the BY-tree guarantees that it can specialize to (and hence preserves the properties of) a B-tree in the one-dimensional case, at higher dimensions, it may not be height-balanced and its storage utilization is reduced to no worst than 33% (instead of 50% in B-tree). Despite foregoing these two properties, it is able to maintain the logarithmic access and update time. Based on the BANG file [Freeston, 1987], a subspace 5 is split into two regions 51 and 52 such that the boundary of 51 encloses that of 52. Each region is uniquely identified by a key, and the key is used to direct the search in the BY tree. Although the physical boundaries of regions may be recursively nested, there is no correspondence between the level of nesting of a region and the index tree hierarchy which represents it. In fact, whenever a region r1 whose boundary directly encloses the boundary of a region r2 resulting from a split, then r1 is "promoted" closer to the root. To facilitate searching correctly, the actual level in which r1 belongs to (called a guard) is stored. Figure 2.8 illustrates a BY-tree. As shown in the figure, boundary of region aO encloses that of region bO, which in turns encloses the boundary of regions cO, dO and eO. In this example, region bO has been promoted to the root as it serves as a guard for region bi. --- .... ------- (a) A planar representation. (b) The BY-tree. Figure 2.8. The structure of a BV-tree. The search begins at the root, and descends down the tree. At each node, every entry is checked to identify a guard set that represents regions that best
  • 73. 64 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS match the search region. Two types of entries can be found in the guard set - those that correspond to the set guards of an unpromoted entry, and the best match unpromoted entry that encloses the best match guard. As the tree is descended from level h to level h - 1, the guard sets found at levels h - 1 and h are merged in the process of which some may be pruned away. Once the leaf node is reached, the guard set contains the regions where the search region may be found. The data corresponding to the regions of the guard set are searched to answer the query. During insertion, complication arises when a promoted region is to be split into two such that one region encloses higher-level regions while the other does not. In this case, the entry for the second region will have to be demoted to its unpromoted position in the tree. Deletion may require merging and resplitting. This requires finding a region to merge, and finding a way to split the merged . . regIOn agam. 2.5 Cell methods based on dynamic hashing Both extendible hashing [Fagin et aI., 1979] and linear hashing [Kriegel and Seeger, 1986, Larson, 1978] lend themselves to an adaptable cell method for organizing k-dimensional objects. The grid file [Nievergelt et aI., 1984] and the EXtendible CELL (EXCELL) method [Tamminen, 1982] are extensions of dy- namic hashed organizations incorporating a multi-dimensional file organization for multi-p,ttribute point data. We shall restrict our discussion to the grid file and its variants. 2.5.1 The grid file The grid file structure [Nievergelt et aI., 1984] consists of two basic structures: k linear scales and a k-dimensional directory (see Figure 2.9). The fundamental idea is to partition a k-dimensional space according to an orthogonal grid. The grid on a k-dimensional data space is defined as scales which are represented by k one-dimensional arrays. Each boundary in a scale forms a (k-l )-dimensional hyperplane that cuts the data space into two subspaces. Boundaries form k- dimensional unpartitioned rectangular subspaces, which are represented by a k-dimensional array known as the grid directory. The correspondence between directory entries and grid cells (blocks) is one-to-one. Each grid cell in the grid directory contains the address of a secondary page, the data page, where the data objects that are within the grid cell are stored. As the structure does not have the constraint that each grid cell must at least contain m objects, a data page is allowed to store objects from several grid cells as long as the union of these grid cells together form a rectangular rectangle, which is known as the storage region. These regions are pairwise disjoint, and together they span the
  • 74. SPATIAL DATABASES 65 data space. For most applications, the size of the directory dictates that it be stored on secondary storage, however, the scales are much smaller and may be cached in main memory. data pages grid directory I----i---+----------i • • •• • •• • • • • :. Figure 2.9. The grid file layout. Like other tree structures, splitting and merging of data pages are respec- tively required during insertion and deletion. Insertion of an object entails determining the correct grid cell and fetching the corresponding page followed by a simple insertion if the data page is not full. In the case where the page is full, a split is required. The split is simple if the storage region covers more than one grid cell and not all the data in the region fall within the same cell; the grid cells are allocated to the existing data page and a new page with the data objects distributed accordingly. However, if the page region covers only one grid cell or the data of a region fall within only one cell, then the grid has to be extended by a (k-l)-dimensional hyperplane that partitions the stor- age region into two subspaces. A new boundary is inserted into one of the k grid-scales to maintain the one-to-one correspondence between the grid and the grid directory, a (k-l )-dimensional cross-section is added into the grid di- rectory. The resulting two storage regions are disjoint and, to each region a corresponding data page is attached. The objects stored in the overflowing page are distributed among the two pages, one new and one existing page. Other
  • 75. 66 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS grid cells that are partitioned by the new hyperplane are unaffected since both parts of the old grid cell will now be sharing the same data page. Deletions may cause the occupancy of a storage region to fall below an ac- ceptable level, and these trigger merging operations. When the joint occupancy of a storage region whose records have been deleted and its adjacent storage re- gion drops below a certain threshold, the data pages are merged into one. Based on the average bucket occupancy obtained from simulation studies, Nievergelt et al. [Nievergelt et aI., 1984] suggested that 70% is an appropriate value of the resulting bucket. Two different methods were proposed for merging, the neigh- bor system and the buddy system. The neighbor system allows two data pages whose storage regions are adjacent to merge so long as the new storage region remains rectangular; this may lead to "dead space" where neighboring pages prevent any merging for a particular under-populated page. A more restrictive merging policy like the buddy system is required to prevent the dead space. For the buddy system, two pages can be merged provided their storage regions can be obtained from the subsequent larger storage region using the splitting process. However, total elimination of dead space for a k-dimensional space is not always possible. The merging process will also make the boundary along the two old pages redundant, when there are no storage regions adjacent to the boundary. In this case, the redundant boundary is removed from its scale and the one-to-one correspondence is maintained by removing the redundant entries from the grid directory. The grid file has also been proposed as a means for spatial indexing of non- point objects [Nievergelt and Hinrichs, 1985]. To index k-dimensional data objects, mapping from a k-dimensional space to a nk-dimensional space where objects exist as points is necessary. One disadvantage of the mapping scheme is that it is harder to perform directory splitting in the higher dimensional space [Whang and Krishnamurthy, 1985]. To index a rectangle, it is represented as (ex, ey, dx, dy), where (ex, ey) is the centroid of the object and (dx, dy) are the extensions of the object from the centroid. The (ex, ey, dx, dy) representation causes objects to cluster close to x-axis, while objects cluster on top of x = y for (Xl, X2, Yl, Y2) representation. For ease of grid partitioning, the former representation is therefore preferred. For an object (ex, ey, dx, dy) to intersect with the query region (qex, qey, qdx, qdy), the following conditions must be satisfied: ex - dx < qex + qdx and ex + dx > qex - qdx and ey - dy < qey + qdy and ey+ dy > qey - qdy
  • 76. SPATIAL DATABASES 67 Consider Figure 2.10a, where rectangle q is the query rectangle. The inter- section search region on ex - dx hyperplane, the shaded region in Figure 2.10b, is obtained by the first two inequality equations of the above intersection con- dition. Note that the search region can be very large if the global space is large and the largest rectangle extension along the x-axis is not defined. In Figure 2.10, the known upper bound, udx, for any rectangle extension along the x-axis, reduces the search region to the enclosed shaded region. The same argument applies for the other coordinate. Objects that fall in both search regions satisfy the intersection condition. qcx- qdx qcy lJCX (a) Object distribution. ,," "" ,,/ ,," ,"/ / udx -h~/_":"/---''-/---L_/ --'-_/....,,£. ,,/ ,," .~/ ,," I' I' I' I' I' " I' " "".... ,,I' //.d "'It .-" 1'. .g he I' I' I.: • f qcx-qdx qcx qcx+qdx (b) Search regions on cx-dx hyperplane. dy 'Icy· 'lily IIdy +-~~........---,.4-~~+ L-.:::...:.....-'O"'--4---;<-;;~ cy lJcY·lJdy 'Icy lIcY+lJdy (c) Search regions on cy-dy hyperplane. Figure 2.10. Intersection search region in the grid file. The mapping of regions from a k-dimensional space to points in a nk- dimensional space undesirably changes the spatial neighborhood properties. Regions that are spatially close in a k-dimensional space may be far apart when they are represented as points in an nk-dimensional space. Consequently, the intersection search may not be efficient. 2.5.2 The R-file The grid file structure was originally designed to guarantee two disk accesses for exact match queries, one to access the directory and the other to access the data page. The "two disk access" property can only be ensured if the directory is stored as an array and all grid cells are of the same size. However, with such an implementation, the size of the directory is doubled whenever a new boundary is introduced. Most of these directory entries correspond to empty grid cells that do not contain any data objects. Simulated results [Nievergelt et al., 1984] indicate that the size of the directory grows approximately linearly with the size of the file. To alleviate this problem, multi-level directories [Blanken et al., 1990, Hinrichs, 1985, Hutflesz et al., 1990, Freeston, 1987, Whang and Krishnamurthy, 1985] where grid cells are organized in a hierarchical structure
  • 77. 68 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS have been suggested. We shall present the R-file approach which is designed for non-zero sized objects. In the R-file [Hutftesz et al., 1990], cells are partitioned using the partitioning strategy of the grid file and a cell is split when overflowed. In order for cells to tightly contain the spatial objects, cells are partitioned recursively by repeated halving till the smallest cell that encloses the spatial objects is obtained. Spatial objects that are totally contained in a cell are stored in its corresponding data page, and those that intersect the partitioning line are stored in the original cell. If the number of spatial objects that intersect a partitioning is more than what can be stored in a data page, partitioning line along the other dimensions will be used. If all records lie on the cross point of partitioning lines, they cannot be partitioned by any partitioning lines, and in such a case, a chain of buckets is used. After a split, the original cell and the two new cells overlap and to keep the directory small, empty cells are not maintained. After a split, both the original and new cells have almost the same number of spatial objects. Figure 2.11 illustrates a case in point. Even so, a high number of cells will be inspected for intersection queries, especially those original large cells. The fact that spatial objects stored in the original unpartitioned cells tend to intersect the partitioning lines of the cells suggests the clustering property of these objects. In order to make intersection search more efficient, two extra values that bound the objects in the partitioning dimension are kept with the original cells. Due to the overlapping cells, the directory is potentially large. To avoid storing the cell boundaries, a z-ordering scheme [Orenstein, 1986] is used to number the cells. With such a scheme, cells are partitioned cyclically. For each cell, the directory stores the cell number, the bounding interval, and the data bucket reference. Experiments conducted [Hutftesz et al., 1990] strongly indicate that the bounding information leads to substantial saving of page accesses. 2.5.3 PLOP-hashing In [Kriegel and Seeger, 1988], the grid file was extended for the storage of non-zero sized objects. The method is a multi-dimensional dynamic hashing scheme based on Piecewise Linear Order Preserving (PLOP) hashing. Like the grid file, the data space is partitioned by an orthogonal grid. However, instead of using k arrays to store scales that define partitioning hyperplanes, k binary trees are used to represent the linear scales. Each internal node of a binary tree stores a (k-l)-dimensional partitioning hyperplane. Each leaf node of a binary tree is associated with a k-dimensional subspace (a slice), where the interval along its associated axis is a sub-interval and the other k-l intervals assume the intervals of the global space. Each slice is addressed by an index i stored in its leaf node. To each cell, a page is allocated to store all points that fall in the
  • 78. I I I I GJD9--0- ~D ;0 (a) Original space. SPATIAL DATABASES 69 I I D (b) First bucket. Do oo ------r----- I I I I I I I I I I I I I I I -----, I ~D :I1.- ---' _ (c) Second & Third bucket. (d) Fourth bucket. Figure 2.11. The R-file. unpartitioned subspace. From the indexes stored in k binary trees, the address of a page can be computed. Adopting the bounding scheme similar to that of skd-tree, two extra values are stored in a leaf node to bound the objects whose centroids are in the corresponding slice along the axis that the binary tree is associated with. Hence, an object is inserted into the grid cell that contains its centroid. The regions defined by the two extra values may overlap and they will be used for intersection search. The file organizations based on hashing are generally designed for multi- dimensional point data. To use them for spatial indexing, the mapping of objects from k-dimensional space to nk-dimensional space or duplication of objects identifiers are generally required. Indexing in a parameter space is not efficient for general spatial query retrievals [Guttman, 1984, Whang and Krishnamurthy, 1985].
  • 79. 70 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 2.6 Spatial objects ordering Existing DBMS supports efficient one-dimensional indexes and provides fast ac- cess to one-dimerisional data. If multi-dimensional objects can be converted to one-dimensional objects, such indexes can be used directly without alteration. The mapping functions used in mapping must preserve the proximity between data well enough in order to yield reasonably good spatial search. The idea is to assign a number to each representative grid in a space and these numbers are then used to obtain a representative number for the spatial objects. Tech- niques on ordering multi-dimensional objects using single-dimensional values have been proposed. These include the Peano curve [Morton, 1966], locational keys [Abel and Smith, 1983], Z-ordering [Orenstein and Merrett, 1984], Hilbert curve [Faloutsos and Roseman, 1989], and gray ordering [Faloutsos, 1988]. We discuss the method based on locational keys proposed by Abel and Smith [Abel and Smith, 1983]. A space is recursively divided into four equal sized subspaces, forming a hierarchy of quadrants. For each subspace, a unique numeric key of base S is attached. All objects falling within a given subspace are assigned the subspace's key. The key k for a subspace of level h (> 1) can be derived from the key (k') of the ancestor subspace by the following formula: { k' +sm-h k' + 2 *sm-h k- - k' + 3 *sm-h k' + 4 *sm-h if k is the SW son of k' if k is the NW son of k' if k is the SE son of k' if k is the NE son of k' Here m is an arbitrary maximum number of levels in decomposition, which is greater than h. The global space has Sm as the key. Figure 2.12 illustrates an example of key assignment (base S), where the maximum level of decomposition is 4. One can notice that, when the locational keys of the same level are traced, the ordering is a form of N- or Z-ordering. To assign a key to a rectangle, the smallest block which completely covers the rectangle is used. An inherent problem of such an assignment is that an object bounding rectangle may be very much smaller (as a consequence of the bounding rectangle spanning one or more subspace divisions) than the asso- ciated quadrant. To alleviate this problem, a decomposition technique [Abel and Smith, 1984] is used, where a rectangle may be represented by up to four adjacent quadrants. Rectangles Band C in Figure 2.12b illustrate the cases where one and two quadrants are used: keys 1300 for rectangle B, and keys 1422 and 1424 for rectangle C. By associating each rectangle with a collection of quadrants, a better approximation of a rectangle is achieved. This form of representation requires an object identifier to be stored in multiple locations.
  • 80. SPATIAL DATABASES 71 z.onJcring (a) Assignment of locational keys. (b) Assignment of covering nodes. Figure 2.12. Ordering based on locational keys. However, even if this approach is adopted, the size of the representative quad- rant may still be much larger than the size of the object's bounding rectangle. A B+-tree is used to index the objects based on their associated locational keys. For an intersection search, all quadrants that intersect the query region have to be scanned. The major advantage of the use of the locational key is that B+-tree structures are widely supported by conventional DBMSs. 2.7 Comparative evaluation In this section, we briefly summarize some comparative studies that have been conducted in the literature. Greene evaluated the performance of R-trees and R+-trees [Greene, 1989]. In the comparison between R-trees and R+-trees, it is found that the R+-tree requires much more splits, especially for large data objects, but fewer splits for smaller data objects. For a uniform data distribution of square rectangles that fully covers the map space, 30% of the objects are duplicated. Interestingly, the results show that for the case where the coverage is 100% and the objects are long and narrow along the x-axis dimension, the duplication decreases. This is likely due to the better grouping achieved along the x-axis. In general, the query efficiency tests show that R+-trees perform better for smaller objects and slightly worse off for larger objects. The study in fact exhibits similar pattern of results to that of the kd-trees extended using the overlapping approach and the non-overlapping approach [Ooi, 1990].
  • 81. 72 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Ooi, et al. [Ooi et al., 1991] compared the performance of the skd-tree and the R-tree. The results indicate that the skd-tree is a more efficient structure than the R-tree with nearly the same storage requirement. The containment search provided by the skd-tree is more efficient than its intersection search and is less sensitive to skewed data. In [Hoel and Samet, 1992], Hoel and Samet conducted a qualitative compar- ative study on the performance of three spatial indexes, namely the R*-tree, the R+-tree, and the PMR quadtree [Nelson and Samet, 1987], on large line seg- ment databases. Spatial testing on line segments was conducted. These queries include finding all lines incident at a given point, and at the other endpoint of the line segment of a given point, nearest line segments of a given point, the MBR of line segments that contains a given point and all line segments with a given rectangular window. In their implementation, the execution time of query retrieval is the prime objective, which is sometimes achieved at the expense of a little more expensive storage space. The difference in performance is not very great, although the PMR quadtree has a slight edge over the other two, and the R+-tree is slightly better than the R*-tree because of the disjoint decomposition of line segments. The R+-tree required considerably more space than the other two structures. However, the study did not result in claims of convincing superiority for any of the tested three indexes. This could be due to the use of line segments, which are much simpler than non-zero sized and irregularly shaped objects. In [Ooi, 1990], the efficiency of three extending methods was studied using a family of kd-trees, namely skd-trees [Ooi et al., 1987], Matsuyama kd-tree [Matsuyamaet al., 1984], and the 4d-tree [Banerjee and Kim, 1986]. Databases of 12,000 objects were generated with different distribution of object sizes and object locations. The average data density used is 3. However, for very skewed object placements, the data density of certain locations could be very high. The study shows that the Matsuyama kd-tree which adopts the non-overlapping native space indexing approach performs efficiently in terms of page accesses for small objects. As the object sizes become bigger, its performance degrades. The 4d-tree is the least efficient structure. Its nodes store less information than those of the skd-tree, which accounts for a smaller directory size. Intersection search is not supported efficiently because of its inability to prune the search space effectively. In [Papadias et al., 1995], the topological relationships of meet, overlap, inside, covered-by, covers, contains, and disjoint between MBRs were studied. The efficiency of the R-tree, R+-tree, and R*-tree were then studied using three databases of 10,000 objects, with different sizes of MBRs, and 100 queries. For small MBRs (less than 0.02% of the map area) and medium MBRs (less than 0.1% of the map area), R*-trees and R+-trees outperform the R-tree, with the
  • 82. SPATIAL DATABASES 73 R+-tree slightly more efficient than the R*-tree. However, for large MBRs (less than 0.5% of the map area), the R+-tree becomes less efficient than the other two due to additional levels caused by duplications. The R+-tree does not work for high data density [Greene, 1989, Papadias et aI., 1995]. We also set out to investigate the performance of the R-tree and R*-tree for high-dimensional data. We implemented both structures using C on the Sun SPARC workstation running SunOS 5.5. The size of a disk page used for both trees is 4 KByte. The quadratic cost splitting algorithm [Guttman, 1984] is adopted for the R-tree, and the quadratic cost version of evaluating the overlap of a given node is also implemented for the R*-tree. To deal with paging, a priority based page replacement strategy that adopts a least useful policy is employed [Chan et aI., 1992]. A page is useful if it will be referenced again in the traversal; otherwise, it is useless. The strategy favors useless pages that are at the higher level of the tree, and useful pages that are at the lower level of the tree. We conducted our experimental study on a real data set consisting of Fourier points in high-dimensional space (2, 4, 8 and 16 dimensions) of the contours of industrial parts. The database used is the same one employed in [Berchtold et aI., 1996], except that we extracted a subset of 1 million objects only. Figure 2.13 shows some representative results which are largely consistent with previous works. First, as expected, R*-tree is more space efficient than the R-tree (see Figure 2.13a). Second, R*-tree's insertion cost is larger than that of the R-tree, and as the number of dimensions increases, the relative difference also widens. This is consistent with the result in [Beckmann et aI., 1990]. For point query retrievals, we perform 1000 queries, and used the average number of disk accesses as the metrics. The 1000 points are randomly selected from the respective test data of the dimensions. We observe that when the number of dimensions is small (see Figure 2.13c), both the R*-tree and R-tree perform equally well (with R*-tree slightly better). This result is again consis- tent with the findings in [Papadias et al., 1995] for large databases. However, as the number of dimensions increases, the R*-tree requires more disk accesses than the R-tree during retrieval. We also evaluated 1000 range queries, and the result is shown in Figure 2.13d. The result confirmed the observation that R*-tree outperforms the R-tree only at low dimensions, but is inferior to the R-tree at higher dimensions. Finally, from the results, we note that both the R-tree and the R*-tree does not scale well with the number of dimensions. 2.8 Summary We have reviewed a number of indexes that are suitable for indexing non-zero sized objects in spatial database systems. These have been categorized based on their extending methods and the base structures. We have also discussed
  • 83. 74 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 0.8 20 ..----- --------------------- ------- -------- R·tree - -----_..- c: 16 W·tree ..... .' .2 0.6 .. ~ ~ Cl> :; 8 12Cl> '"'" "'"'" 0.4 R·tree - <Il CI. '6 "'" W·tree ..__. Cl> .!a '" 8." e ----------- Cl> Cl> '" > e '"Cl> 0.2> '" 4 0 0 0 4 8 12 16 0 4 8 12 16 dimension dimension (a) Storage cost. (b) Insertion cost. 2000 3500 R·tree - W-tree ..... R·tree - W·tree ..... 3000 1600 / ~ <Il 2500 <Il Cl> Cl> " "" 1200 "'" " '" 2000 "'" "'"<Il <Il '6 '6 "Cl> " Cl> 1500 '" 800 '"e eCl> Cl> "> > "'" '" 1000 "" , 400 -- ---- 500 ..~....--" 0 0 0 4 8 12 16 0 4 8 12 16 dimension dimension (c) Point query cost. (d) Range query cost. Figure 2.13. Comparison of R-tree and R*-tree.
  • 84. SPATIAL DATABASES 75 the strengths and weaknesses of these techniques. Despite so many work, we believe the area will remain a very fruitful and challenging one for the next decade with several promising research directions. First, there is clearly a lack of benchmarks for evaluating spatial indexes. This can be attributed to the many factors that need to be considered in eval- uating a spatial index. Concerning the data, spatial data varies widely in sizes; spatial objects come in irregular shapes; and objects are not uniformly distributed in the data space. Furthermore, queries range from simple point queries to complex spatial join operations that come in different flavors (inter- section, containment and proximity). Designing a suite of benchmarks is an important issue that cannot be ignored. Second, as pointed out, the evaluation of spatial indexes has been rather limited. Most of the performance study used R-tree as the base for comparisons. Furthermore, most of the work used synthetic data. We believe that more extensive and comprehensive performance studies using real data sets will be necessary and useful for practitioners as well as·developers. Third, the scalability (in terms of number of dimensions of the data space) of existing indexes has not been adequately addressed. Most of the work are restricted to two-dimensional space. Recent work by Berchtold et al. [Berchtold et aI., 1996] addressed the scalability of indexes with respect to the number of dimensions, and showed that the R*-tree does not scale well. Instead the R*- tree degenerates drastically. The same paper also shows that the TV-tree [Lin et aI., 1995] can perform poorly as the number of dimensions increases. While the X-tree [Berchtold et aI., 1996] appears to be a promising scalable index, we believe that designing scalable high-dimensional indexes will be highly exciting and rewarding.
  • 85. 3 IMAGE DATABASES Images have always been an essential and effective medium for presenting vi- sual data. With advances in today's computer technologies, it is not surprising that in many applications, much of the data is images. In medical applications, images such as X-rays, magnetic resonance images and computer tomography images are frequently generated and used to support clinical decision making. In geographic information systems, maps, satellite images, demographics and even tourist information are often processed, analyzed and archived. In police department criminal databases, images like fingerprints and pictures of crimi- nals are kept to facilitate identification of suspects. Even in offices, information may arrive in many different forms (memos, documents, and faxes) that can be digitized electronically and stored as images. The traditional database management systems, which have been effective in managing structured data, are unable to provide satisfactory performance for images that are non-alphanumeric and unstructured. The growing need for image information systems has led to the design and implementation of image database systems [Chang and Fu, 1980, Chang and Hsu, 1992, Kunii, 1989, Knuth and Wegner, 1992, Nagy, 1985, Ogle and Stonebraker, 1995, Tamura and Yokoya, 1984]. E. Bertino et al., Indexing Techniques for Advanced Database Systems © Kluwer Academic Publishers 1997
  • 86. 78 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS In this chapter, we focus on content-based retrieval techniques, that is tech- niques that retrieve images based on their visual properties such as texture, color and shape of objects. In particular, we look at the critical issue of speed- ily finding the correct images from a large image database system based on the image feature. For a large collection of images, sequentially comparing image features is time-consuming and impractical, if not impossible. Instead, access methods that exploit the image features to narrow the search space are necessary. We begin our discussion by looking at what constitute an image database system. Following that, in Section 3.2, will shall discuss some of the issues involved in the design of a content-based index. In the same section, we also review indexing mechanisms that can be used to support content-based re- trievals. In Section 3.3, we provide a taxonomy of existing image indexes. The taxonomy is based on the image features used for indexing. Following that, we present four indexes that facilitate speedy retrieval of images based on color- spatial information. In Section 3.4, we examine three hierarchical indexes that integrate multiple existing indexes into a single structure, and in Section 3.5, we present a signature-based technique. Finally, we shall conclude with a spec- ulation on future trends in Section 3.6. 3.1 Image database systems An image database system must deal with both structured and unstructured data. Furthermore, an image database system also distinguishes itself by the following additional functionalities: • Feature extraction. In order to organize the images and their associated information, it is necessary for the system to understand the contents of the images. Thus, the system must be able to analyze an image to extract key features such as the shape of objects in an image, its color components and texture. • Feature-based indexing. Traditional database systems index their data by key attributes which are usually numeric or fixed-length text data. For image database systems, the system must build indexes based on the features extracted. Such feature-based indexes can then be used to facilitate efficient search of a large collection of images and other related information based on the features of the images. • Content-based retrievals. Image database systems should support a wide range of queries. In particular, queries that involve the contents of the image, in words/text or pictorial form are important and crucial.
  • 87. IMAGE DATABASES 79 • A measure of similarity. Since content-based queries are usually inexact, the system requires a measure to capture what we humans perceive as similarity between two images. However, as the notion of similarity does not neces- sarily mean correct, the similarity measure must be carefully designed not to exclude any relevant images, while at the same time minimize irrelevant images from the results. Input Image t PREPROCESSING MODULE Image Feature Update InputlScanner Extraction IndexlDatabase tQUERY MODULE r ----~ ---Runtime ~Interactive Processor Feature/lmage ~ Query Feature Database Formulation Extraction Concurrency Browsing Feature Control & -E-- & Matching Recovery Feedback Manager Output Retrieved Images User Query Figure 3.1. Architecture of an image database system. Figure 3.1 shows the (generic) architecture of an image database system. Im- ages are preprocessed to extract the key features used for searching. The images and the feature indexes are then stored in the database. During retrieval, fea- tures are extracted from the query image, and matched against those stored to retrieve images that are similar to it. As a consequence of the need to retrieve images based on similarity, the user interface will usually incorporate some browsing and feedback mechanisms to facilitate reformulation of queries to im- prove accuracy. Like traditional database systems, concurrency control and recovery managers are also critical components of an image database system. Supporting a fully functional image database system is a difficult problem and embraces different technologies such as image processing, user interface design, and database management. In fact, early systems are largely attribute- based or free-text-based and hardly have any real content-based support. For attribute-based systems, images are treated as binary large objects (BLOBs).
  • 88. SO INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS A conventional DBMS, extended with the capability to handle BLOBs, can be used to manage the images. Access to the unstructured images is achieved through the structured attributes of the images. Hence, no special effort is re- 1uired to design the organization technique, indexing mechanisms (such as B+- tree and inverted files) and query processing methods of the systems. However, this approach is not capable of handling the more user-friendly content-based queries. The free-text-based approach applies the concepts of document retrieval techniques to provide "content-based" functionalities by manual description of the image and treating the image description as those of a document. Image access is done through the accompanying image description. For example, for the query "Retrieve all images that show a girl skating in an ice ring", the description "a girl skating in an ice ring" is used to retrieve the images. The system attempts to search this description with that of images stored in the database. Indexing methods that can be used include signature file access meth- ods, inverted file access methods and direct (or sequential) file access methods Besides being unable to facilitate true content-based queries, other limitations of the free-text-based approach include a free-text description of an image is highly variable due to the ambiguities in the natural language associated with annotating images with text and the different interpretations of the image; im- age description is usually incomplete since an image is semantically richer than text description; and the vocabulary of the person creating the index and the user or even between users may not match. As such, the effectiveness of this approach is fairly limited. The readers are referred to Chapter 5 for an in-depth discussion on text indexing techniques. 3.2 Indexing issues and basic mechanisms 3.2.1 Key issues in content-based index design Designing an access method for an image database system is more complex than a traditional database system. This is because the features to be indexed (hereafter, we shall refer to as indexing features) are usually unstructured. Three key issues that must be addressed in designing an index structure for content-based image retrieval are: • Determine a representation for the indexing feature. • Determine a similarity measure between two images based on their repre- sentations. • Determine an appropriate index organization.
  • 89. IMAGE DATABASES 81 For the first issue, a suitable representation must be determined and used to represent the indexing feature. Some of the desirable properties of a repre- sentation include • Exactness. For a representation to be useful, it has to capture the essential details of the indexing feature; • Space efficiency. The representation should keep the storage cost low. To this end, approximate representations rather than exact representations are often used. For example, instead of representing the shape of an object, its bounding box can be used. As another example, grouping colors that are perceptually similar can reduce the number of colors that need to be maintained by the system without sacrificing retrieval accuracy. • Computationally inexpensive similarity matching. It should be easier and faster to compute the similarity between the representations than between their features. In general, computing the degree of similarity between ap- proximate representations is less computationally intensive. For example, computing the intersection of two polygons is more costly than computing the intersection of two rectangles that represent them. • Preservation of the similarity between the features. Two features that are similar should remain so under their representations. • Automatic extraction. The representation should be automatically extracted, rather than manually generated. • Insensitivity to noise, distortion, rotation. Any noise or distortion should not affect the representation drastically. In other words, two features of the same image, one without noise, and the other distorted by some noise, should be represented in a similar way (if not exactly). Similarly, the representation of a feature, regardless of whether the image has been rotated or not, should be the same. It is hard to find an effective representation with all the desirable properties. In fact, some of the above properties conflict. For example, representing the color of an image as a vector (color histogram) which has all the above properties has been shown to be less effective than one that also captures the spatial information. However, the latter representation of color incurs more storage, and is more sensitive to the orientation of the image. Before moving on, we would like to look at two methods that can be used to represent image features coarsely. These methods have the advantages of space efficiency as well as reducing the dimensionality of the indexes (for vector-based representations). They can be categorized as follows:
  • 90. 82 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS • Partitioning. This method partitions an image space into a fixed size grid. Each such cell is assigned a label and can be used to approximate the size of an object or the spatial location of a feature. For example, the set of cells that contains an object serves as an indication of the size of the object. As another example, the location of an object can be determined by the position of the cell it is in. • Grouping. This method combines several components of a feature into groups, and represents the image feature in terms of the groups instead of the large number of components. For example, the basic color feature can have over 100 different colors, but can be grouped into a small number of groups based on the fact that many colors are perceived to be similar by humans. As another example, the shape of an object can be described by a small number of primitives such as lines and arcs. A coarse representation can be used as a quick means of pruning away irrelevant images, and a finer representation is usually necessary in order to restrict the set of potential candidate images to a manageable size. The second issue follows from the first. The similarity measure between the indexing features of two images, say 51, may no longer be appropriate on the representations. Thus, an appropriate similarity measure on the represen- tations, say 52, has to be derived. The main criterion for such a similarity measure is that two features that are similar under 51 should remain so under 52. In fact, since the representations may be approximate, we expect the num- ber of images that are similar to a query image under 52 to be larger than that under 51. There are several alternatives to determine the similarity between two features through their representations: • Exact match. In this approach, the representation of an image feature is usually coarse, in the sense that images with similar features will be mapped to the same representation. As a result, an exact match on the representation can be used to search for similar features. • Approximate match. Under this approach, the degree of similarity between the image representations is computed based on some approximation tech- niques. One advantage of this category is that the image representation can be exact. Where approximate representations are used, we can expect more irrelevant images to be retrieved as well. Finally, an appropriate index organization should be determined to organize the representations in a manner that the similarity measure can be supported efficiently. Other important criteria for selection of an index structure include storage efficiency and maintenance (update) overhead. To a certain extent, the
  • 91. IMAGE DATABASES 83 representation and similarity measure determine the index structure. For exam- ple, if the image feature is represented as a vector, and the similarity measure is the Euclidean distance, then a natural choice is the multi-dimension point access method. Here, the vector is mapped to a point in a multi-dimensional space, and a region search can be used to search for similar images in the multi- dimensional space. On the other hand, if the image features are represented as rectangles in the image space, then a spatial access method may be employed. In fact, as we shall see in Section 3.3, most of the image indexes are based on existing techniques. As such, we shall review some of these techniques before proceeding to look at the taxonomy. 3.2.2 Basic indexing scllemes Spatial access methods. Spatial access methods are file structures used to organize large collection of multi-dimensional points or geometric objects to facilitate efficient range or nearest neighbor searches. It turns out that we can easily exploit such techniques to speed up retrieval of images. The basic idea is to extract k image features from each image, thus mapping images into points in a k-dimensional feature space. Once this is done, any spatial access methods can be used as the index, and similarity queries will then correspond to nearest neighbor or range searches. As an example, let us consider the color feature. In general, the color feature can be represented as a k-tuple for a system that supports k colors, and the values of the tuple of an image are the percentages of the colors in the image. Many spatial access methods have been proposed in the literature. These include methods that transform geometric objects into points in a higher di- mensionality space such as the grid file [Hinrichs and Nievergelt, 1983]; meth- ods that linearize spatial data such as quad-trees [Gargantini, 1982] and "z- ordering" [Orenstein, 1986]; and methods that are based on trees such as the family of R-trees [Guttman, 1984]. However, most of these methods suffer from the so-called "high-dimensionality curse", that is these techniques perform no better than sequential scanning as the number of dimensions becomes suffi- ciently large [Faloutsos et aI., 1994]. For example, for R-trees, performance begins to degrade drastically as the dimensionality hits 20 and above. We refer the readers to Chapter 2 for a survey on spatial access methods. Inverted file. In an inverted file index, an inverted list is created for each distinct key (indexed feature). The inverted list essentially consists of a list of pointers to the objects that contain features that are similar to the indexed feature. Given an image feature, the inverted file is scanned, and all images with the features that are similar to it can thus be retrieved speedily. However,
  • 92. 84 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS inverted file method incurs high storage overhead and is also expensive to up- date. Some recent work has been done to address the storage problem [Witten et al., 1994, Moffat and Zobel, 1996]. Signature file. The signature file access method is an efficient access method for objects that can be characterized by a set of descriptors, making it suitable for indexing unstructured data such as textual documents (characterized by a set of keywords) and images (characterized by a set of semantic objects or colors). Each descriptor of an image can be represented as a string of bits, and an image signature can be obtained by superimposing (inclusive-OR) all the descriptors of the image. The signatures of all images can then be maintained in a file called the signature file. During query retrieval, the descriptors of the query image can be coded into a signature, and the signature file is then used as a filtering mechanism to eliminate most of the unqualifying data so that only a portion of the data file needs to be accessed. The retrieval performance, however, can be hampered by a high false drop probability (due to irrelevant images' signatures matching the query image). Variations of signature file access methods have been proposed to improve on the retrieval efficiency of the signature file. These include single-level signature file [Roberts, 1979], multi- level signature file [Sacks-Davis et al., 1987], and partitioning approach [Lee and Leng, 1989]. 3.3 A taxonomy on image indexes Existing image indexing mechanisms can be classified based on the image fea- tures used for indexing. For each image feature, further classifications can be made with respect to the semantic representations used for the feature. A dif.. ferent type of semantic representation entails a different indexing method. In this section, we provide a taxonomy of image indexing schemes based on such classifications. These schemes have been reported in the literature. For some features, other schemes which may also be used but not reported are excluded from our discussion. The taxonomy is summarized in Figure 3.2. 3.3.1 Shape feature The shape feature is extremely useful for image database systems like an X- ray system or a criminal picture identification system. In an X-ray system, queries like "Retrieve all kidney X-rays with a kidney stone of this shape" are very common. For a criminal picture system, we expect queries like "Retrieve all criminals with a round face shape". The example shape, the shape of a kidney stone in the first case, and round in the second, can be supplied using an example image.
  • 93. IMAGE DATABASES 85 Content-based indexes Multi-dimensional index Rectangles I Sequenced Multi-attribute treee Signature I Signature file Color-Spatial Multi-level histogram ~. 2-level . -tier color B+-tree index -------------Color histogram I Geometric properties IInverted file Multi-dimensional index Color Objects . ~aritYagainst SIgnature representative I obts Signature Inverted file file Texture ITilmura features I Multi-dimensional index Multi-dimensional index Shape Rectangular cover I Spatial relationship ~Signature 2-D string MUlti~level Isignature file Sequential file Multi-dimensional index Figure 3.2. A taxonomy of image indexing schemes.
  • 94. 86 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Shape features can be represented using its boundary information by any of 16 primitive shape features. Each primitive feature is either a line, an arc with a starting point, an ending point, and so on. Moreover, the primitive feature can be denoted by a distinct character. Thus, the boundary information can be compactly stored as a one-dimensional string [Jea and Lee, 1990]. The shape features of a shape boundary can then be represented by substrings of the one-dimensional string. This simple representation allows the exploitation of existing efficient string matching algorithms. Since objects with the same shape will be encoded in the same manner, exact string matching is performed instead. To index the string representation, an inverted file is used. A closely related work by Mehrota and Gary [Mehrotra and Gary, 1993] used a set of structural components to represent shape boundary. These com- ponents are modeled as an ordered set of interest points such as locally maximal curvature points or vertices of the polygonal approximation. A shape feature can be obtained by fixing the number of points to be used to represent the shape feature. The feature is then mapped into a point in a multi-dimensional space, where the dimension is given by the number of points used to repre- sent the shape. The similarity measure can then be given by the Euclidean distance between pair of points in the multi-dimensional space. The point multi-dimensional access method is used for indexing the shape feature. In [Jagadish, 1991]' a collection of rectangles that forms a rectangular cover of the shape is used. Since shapes vary widely from objects to objects, the number of rectangles can be very large. To reduce the storage requirement, at most k rectangles in the cover is used to represent the shape. The k rectangles picked must capture the most important features of the shape "sequentially", that is the k rectangles form a sequence. As each rectangle is represented by two pairs of coordinates, and there are at most k rectangles, the shape feature can be easily mapped into a point in a 4k-dimensional space. Thus, a multi- dimensional point access method can be readily used for indexing the shape feature. Similarity retrieval based on Euclidean distance is performed using a region search query. Shape can also be represented based on the concept of mathematical mor- phology [Korn et al., 1996, Maragos and Schafer, 1986, Zhou and Venetsanopou- los, 1988], which employs a primitive shape to interact with an image to extract useful information about its geometrical and topological structure. A (2M+1) vector, called the size distribution of a shape [Serra, 1988], can be used to store the measurements of the area of an image at different (2M+1 of them) scales. The pattern spectrum [Maragos, 1989] turns out to be a compact representation that captures the same information. The advantage of the scheme is that it is essentially invariant to rotation and translation, and can highlight differences at several scales. In [Korn et al., 1996], the pattern spectrum is first employed to
  • 95. IMAGE DATABASES 87 capture the shape information of an image (in the domain of a tumor database). The information is then mapped into the (2M+l) vector of the size distribution so that a multi-dimensional point index can be employed to index the shape information. While similarity retrieval is essentially a nearest neighbor search, the paper also presented a distance function, max-granulometric distance, that guarantees no false dismissals. Numerical vectors have also been employed to model shape. These include using the coefficients of the 2-D Discrete Fourier Transform or Discrete Wavelet Transform [Mallat, 1989], as well as first few moments of inertia [Faloutsos et 301., 1994, Flickner et 301., 1995]. These techniques usually maps the shape feature to multi-dimensional point access method and use the Euclidean distance for similarity retrieval. Alternatively, the shape features can be represented by the geometric properties of the image such as shape factors (for example, ratio of height to width), mesh features, the moment features and curved line features. In this case, the inverted file has been used for indexing. For a system that is based on the shape feature, unless the images have very distinct shape, the performance may suffer. As such, shape is usually employed in specialized domains. 3.3.2 Semantic objects If objects within an image are prominent and can be easily recognized, retrieval can be achieved based on the objects. Queries can be evaluated by matching the list of objects of a query image against the list of objects of images in the database. Two methods have been adopted in the literature: • An object in an image may be analyzed to determine its degree of similarity against a set of distinct objects. This degree of similarity is represented as a belief interval (bi) [Rabitti and Stanchev, 1989] that indicates how closely an image object is compared to the represented object used in the system. An inverted file is used to maintain for each distinct object a list of (bi, ptr) pairs where ptr is a pointer to an image that contains an object that resembles the indexed object with a belief interval of bi. In this way, given a query image object, one first determines the corresponding distinct object it belongs to, from which one can obtain all objects that are similar to it. By sorting the list in non-ascending order, the system can have a control over the degree of similarity desired. • An object may also be represented by an object signature. An image signa- ture is obtained by superimposing all the object signatures of the objects in the image [Rabitti and Savino, 1991]. The signature file access method can then be used to speed up the retrieval process. A query image's set of sig-
  • 96. 88 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS natures can be obtained, and its image signature is first used to prune away images that are irrelevant. Candidate images are then further examined by comparing their object signatures against those of the query image. The object-based approach is, however, limited by current image analysis techniques. Unless objects are very well defined, it still requires substantial human intervention in order to ensure that the objects are correctly extracted. 3.3.3 Spatial relationship In an object-based system, a query image with a ball above a box may also result in images with a ball next to a box or a box above a ball being retrieved. A more discriminating way to retrieve images is to facilitate a more precise querying that specifies both the semantic objects in the images as well as the spatial relationships between the objects. An an example, consider the query "Retrieve all paintings with a house and a tree on its left". Here, the house and tree are the objects while "to the left" is a spatial relationship between the two. In [Chang et al., 1987, Chang et al., 1988], a semantic representation for spatial relationship using a two-dimensional string (2-D string) was proposed. An image is first preprocessed to obtain the symbols that represents the objects it contains. The 2-D string representation is then a projection of the symbols along the x-axis and the y-axis, and consists of a pair of one-dimensional strings (I-D string) each representing the ordering and spatial relationships of the objects along the projected axis. For example, consider an image with three objects such that 0 1 is to the left of O2 which is to the left of 0 3 . The projection on the x-axis results in the I-D string 0 1 < O2 < 0 3 , where "<" is a spatial operator that denotes "to the west or to the south of". In [Chang et al., 1987], only three spatial operators are used: "=" to mean "at the same spatial location as", ,,:,. to represent "in the same grid cell as", and "<" as explained. During query processing, the 2-D string representation of the query image is obtained, and compared against those in the database. Similarity retrieval is supported using an exact representation and an approximate matching algorithm. Variations and extensions to the 2-D strings have been explored [Chang et al., 1989, Lee and Hsu, 1990, Costagliola et al., 1992, Lee et al., 1992]. In particular, a multi-level signature file access method has been adopted as follows. An image can be partitioned into a M x N grid. For each object, a M x N bit object signature can be obtained by setting bit (i - 1) . M +j to 1 if the object occur in cell (i,j); otherwise the bit is cleared. An image signature can then be obtained by superimposing the object signatures. Querying is performed by determining the object and image signatures of the query image, and using them to filter the images to be retrieved.
  • 97. IMAGE DATABASES 89 The effectiveness of exploiting spatial relationships, as already mentioned, can be drastically affected by the orientation of the images since the relation- ships between objects may no longer be preserved. 3.3.4 Texture Texture is an important property that can be used as cues for image retrieval. In particular, because it can be extracted from both gray-level images as well as color images, it can be used in many applications. However, the extraction of texture information is a computationally intensive operation. One of the most popular texture representations is the Tamura features [Tamura et al., 1978]. While texture can be captured by six basic compu- tational forms, coarseness, contrast, directionality, linelikeness, regularity and roughness, it has been shown that the first three can sufficiently be used to discriminate between texture differences in images. As such, these three forms (coarseness, contrast and directionality) have been widely used in texture recog- nition. These three components are briefly summarized here: • Coarseness. The coarseness component measures the scale of the texture (for example, pebbles versus boulders). When two patterns differ only in scale, then the magnified one is considered to be coarser. For patterns with different structures, those that have larger element size or fewer element repetitions are perceived to be coarser by the human eyes. Coarseness can be computed using moving windows of different sizes. The essence of the method adopted in [Tamura et al., 1978] is to pick the coarsest texture as the best size. For every region in an image, its coarseness is represented by the largest best size texture, Sbest. The coarseness of the image can then be obtained by taking the average of Sbest over the image. • Contrast. The contrast component can be thought of as representing the quality of the image. A good quality image is one that is sharp in contrast, while a low quality image is blurred. The human eyes can easily discriminate between a sharp image and a blurred one. As an image contrast can be varied by stretching or shrinking its gray scale, the intensity of each pixel of an image can be multiplied by a positive constant to derive at different contrast value. The contrast can then be obtained as a function of the variance of the gray-level histogram [Tamura et aI., 1978]. • Directionality. Directionality describes whether an image has a favored di- rection (like grass) or whether it is isotropic (like a smooth object such as glass). The human eyes can easily differentiate a directional pattern from one that is non-directional. In [Tamura et aI., 1978], the degree of direc- tionality is calculated using a histogram of local edge probabilities against
  • 98. 90 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS their directional angle. Although this measure does not categorize images as directional or non-directional, this histogram representation can sufficiently capture the global features of the images such as long lines and simple curves. Clearly, texture can be modeled as a 3-tuple (coarseness, contrast, direction- ality). Moreover, since images that are alike if their coarseness, contrast and directionality are similar, the Euclidean distance can be used as a measure of the degree of similarity between images. To speed up the retrieval process, the texture feature can be represented as a point in a 3-dimensional space, with region search being used to prune the search space. There are other representation of texture, such as the Simultaneous Au- toregressive (SAR) model, and the Wold features [Francos et al., 1993]. Both methods also represent texture as vector of numbers, and compare images based on the Euclidean distance. As such, a multi-dimensional indexing mechanism can be used to index the texture features also. 3.3.5 Color A natural way to retrieve colorful images would be to retrieve them by color. The color composition of an image is a global property which does not require knowledge of the component objects of an image. Moreover, color distribution is independent of view and resolution, and color recognition can be carried out automatically without human intervention. A semantic representation for color is the use of color histogram that cap- tures the color composition of images [Swain, 1993]. Using the RGB color space, the histogram comprises a set of "bins" each representing a color that is obtained by a range of red, blue and green values. The number of pixels of an image falling into each of these bins can be obtained by counting the pixels with the corresponding color. The histogram is then normalized by dividing its entries by the total number of pixels of the image. The normalized histogram is size-independent and it enables images of different sizes to be compared mean- ingfully. The degree of similarity between two images is determined by the extent of the intersection between the histograms. Query by visual example is possible by matching the histograms. Object recognition is also achieved by using the color composition of the object. However, to support indexing using color histograms, a multi-dimensional indexing method is necessary and the number of dimensions required is of very high order (which is the num- ber of distinct colors to be supported). The color histogram of an image is mapped into a point in the multi-dimensional space, and a region query can be performed to find matching images. However, it has become clear that color alone is not sufficient to characterize an image. For example, consider two images - one with the top half blue and
  • 99. IMAGE DATABASES 91 bottom half red, while the other's top left and bottom right quadrants are red and its bottom left and top right quadrants are blue. Although these two images are similar in color composition, they are entirely different to a human observer. This is because the ways the colors are clustered and the positions ofthe clusters are very different from one another in the two images. As such, several recent studies have proposed to integrate color and its spatial distribution to facilitate image retrieval [Chua et al., 1997, Gong et al., 1995, Hsu et al., 1995, Lu et al., 1994, Ooi et al., 1997]. Most of the indexing mechanisms proposed for color- spatial information are generally multi-layered - two-level B+-tree [Gong et aI., 1995], three-tier color index [Lu et aI., 1994] and Sequenced Multi-Attribute Tree (SMAT) [Ooi et al., 1997]. An exception to this trend is based on signature files approach [Chua et al., 1997]. 3.4 Color-spatial hierarchical indexes In this section, we describe three indexes that have been proposed to integrate color and spatial information for image retrieval. All these schemes are hierar- chical indexes in that multiple indexing mechanisms are integrated to form a single index structure. The search process begins from the top level index, and moves down to the lowest level index, traversing along the path that satisfies the search criterion. 3.4.1 Two-level B+ -tree structure In [Gong et al., 1995], the color-spatial information of an image is modeled by splitting the image into 9 equal sub-areas (3 x 3), and the color information within each sub-area is represented by a color histogram. In this way, by matching the corresponding color histograms of two images, one can obtain a more accurate similarity (in terms of color-spatial information) between the two images than the traditional histogram-based approach. Although color histogram is a multi-dimensional representation, Gong et ai. cleverly mapped it into a numerical key. This not only turns the computationally intensive matching process into simple numerical-key comparisons, it also fa6litates the exploitation of existing single-dimensional indexing structures such as the B+- tree structure. As a result, a two-level B+-tree structure was proposed to speed up the retrieval process. We shall first look at the retrieval technique, followed by the transformation of color-histogram to a numerical key before proceeding to examine the index structure. The retrieval technique. Given an image, it is first processed to extract its 9 color histograms. Each histogram is then mapped into two levels of informa- tion. The first level describes the composition of colors corresponding to the
  • 100. 92 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS histogram of the region. However, instead of using the full set of colors (which is very large), the colors are grouped into 11 "bins" only. The grouping of colors is based on the observation that some colors are perceived to be similar by humans. This is accomplished in two steps: • The RGB color space is transformed into Munsell's HVC color space [Miya- hara and Yoshida, 1989]. This is necessary because it is not possible to determine the similarity between two colors based on the RGB color space. Instead, the HVC color space describes colors in terms of hue (the color type), value (brightness) and chroma (saturation), and the perceptual differences can be determined by the geometric distances. • The HVC color space is grouped coarsely into 11 bins, each of which can be distinguished from the others as a distinct color by subjective perception. The grouping is based on the argument that two images with the same visual content but taken with minor differences in illuminating conditions should not be considered as different images. Furthermore, instead of the traditional approach ofusing the normalized pixel count to represent the proportion of the groups, each group is assigned a range which bounds the percentage of pixels in the image with colors of the group. A total of 9 disjoint ranges are predetermined and used: [0,5), [5,15), [15,25), ..., [65,75), [75,100]. Because of the groupings, two histograms are considered to be similar if all the corresponding ranges of the 11 bins are the same. This simplifies the histogram matching process, but the coarse grouping increases the probability of retrieving irrelevant images, and missing relevant images whose color composition fall into neighboring ranges. The second level of information contains the average H, average V, and average C values of all the 11 histogram bins. As in the color composition, the H, V and C values are grouped into 9,4 and 4 groups respectively, with interval of 40°, 2.5 and 7.5. This level is used as a secondary similarity measure to complement the histogram metrics in order to reduce the number of irrelevant images retrieved. During query retrieval, the query image is processed to extract its 9 his- togram. For each histogram, the two levels of information are obtained from the sample query. The level 1 information is used to prune away dissimilar images, and candidate images are further examined and compared on their H, V and C group values. The index: Two-level B+-tree structure. The above retrieval mechanism has the nice property that only exact matches need to be performed: two histograms are similar if they have the same range values for the 11 histogram
  • 101. IMAGE DATABASES 93 pins, and for each pair of bins, the groups for the H, V and C values are the same. As such, the authors proposed that the first level information be mapped into a composite key with 12 attributes: the first attribute indicates the histogram region (1 of the 9 region), and each of the other 11 attributes corresponds to one histogram bin and has a value that indicates its range (note that instead of keeping the range, since the set of ranges is predetermined, fixed and disjoint, a range is represented by a number). Similarly, the second level information is mapped into a 34-attribute composite key: the first attribute represents the histogram region, and the other 33 attributes are split into 11 groups of 3 attributes, each group for a histogram bin, with one attribute for the group number of the H value, one for the group number of the V value, and one for the C value. Level I: B+-lree on Normalized Pixel Count Level 2: B+-tree on Average H,Y and C values ~. !IJIIEDJ Figure 3.3. The two-level B+-tree structure.
  • 102. 94 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS A two-level B+-tree can then be exploited to speed up the retrieval process, Figure 3.3 shows the structure. The top level index is a B+-tree built on the 12-attribute key, and is used to facilitate the histogram matching process. The entries in the leaf node of this level is associated with an independent B+- tree that is built on the 34-attribute key. This second level tree is devised to facilitate the comparison of the average H, V and C values. Internal nodes stores the maximum values of the child nodes in order to direct the search. Since images with the same histogram configuration will have the same first part of the key, they can be found in the same leaf node of the top level tree, and hence in the same second level tree associated to that leaf node. Thus the images in the second level tree will be fetched only if matching at both levels are successful. 3.4.2 Three-tier color index To handle speedy image retrieval based 011 the positional information of color, Lu, Ooi and Tan proposed a three-tier color index [Lu et al., 1994]. While layers 1 and 2 prune away irrelevant images based on colors, layer 3 matches images based on their color positions as well. We shall first look at the individual layer 1 and layer 3 and their motivations before presenting the index structure as a whole. The second layer is the R-tree structure. Layer 1: Dominant color classification. The first layer is the dominant color classification. For each image, a fixed number of dominant colors is ex- tracted. The dominant colors are those with the largest numbers of pixel count. Based on the dominant colors, the image can be assigned to a partition. In this way, images with the same dominant colors can be found in the same partition. The underlying assumption is that images with the same dominant colors tend to be more similar than images that match on the less dominant colors. Thus, during the image retrieval process, only a few partitions with the similar sets of dominant colors need to be examined, while the other partitions with different dominant colors can be ignored. Let k denote the number of dominant colors. Then the number of classes is given by: n! number of classes =nCk =..,.----,-.,...--,(n - k)!k! where n is the number of colors supported in the system. Figure 3.4 illustrates this layer when k = 3. Layer 3: Multi-level color histogram. The third layer is a complete quad- tree structure, called the multi-level color histogram, used to capture spatial
  • 103. IMAGE DATABASES 95 distribution of colors. The basic idea is to capture the set of histograms for an image by recursively decomposing the image. For an image, its multi-level color histogram comprises several levels. The top level (root) of the tree corre- sponds to a histogram that gives the color composition of the entire image. The second level consists of four histograms that represent the color composition of the top left, top right, bottom left and bottom right quadrants of the image respectively. At the next level, we have the set of histograms that are obtained from further splitting each quadrant of the image into four equal parts, where each histogram is a description of the color content of each smaller part. This process is repeated for the number of levels desired. In general, at the ith level, the image is subdivided into 4i - 1 regular regions, and each region has its own histogram to describe its color composition. For example, in Figure 3.4, the third layer is a 3-level color histogram. With multi-level color histograms, since every level captures the color com- position of the entire image, any level can be used to compute the similarity between two images. For a level, the degree of similarity is given by the sum of the intersections of the corresponding pairs of histograms at the level. In other words, at the ith level, the similarity value is computed as follows: 4i - 1 m Si = 4i~1 'L'Lmin(NH7(Q),NH7(D)) j=1 k=1 where m is the number of colors supported by the system, Q and D are the query and database images, and NH7 (fMC) is the normalized pixel count of the kth color in the jth histogram of the image fMC. As the lower level of the tree reflects more closely the color composition and distribution of the image, it is clear that the similarity value decreases as the tree is traversed downwards. This observation leads to a filtering mechanism during image retrieval. During query processing, the query image and the database images are compared based on their color histograms. The top-level histograms are first compared. If they match within some threshold value, the next level will be searched and compared, and so on. Only when the threshold value at the leaf level is met then will the image be retrieved. The target image will be "discarded" if the similarity value fails to meet the threshold at any level of the tree. As it costs less to compute the similarity value at the higher levels of the tree, a significant amount of processing time may be saved and unnecessary accesses to irrelevant images can be minimized. The index: Three-tier color index. Figure 3.4 shows the three-tier color index which employs three level of pruning to speed up retrieval. The first layer is the dominant color classification. It allows us to prune away images
  • 104. 96 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS belonging to classes that would never satisfy the query to narrow the search space to some classes. Layer 2 is a multi-dimensionaIR-tree structure to further prune away im- ages within the candidate partitions that are not relevant. This is achieved as follows. For each partition, an R-tree is used to organize the images within the class based on the proportion of the dominant colors in the images. Since the dominant colors is sufficient to discriminate between images, the dimensional- ity required is relatively small. Thus, images that are similar will be spatially close to one another, and a region query will be able to restrict the search to the relevant images within the partition. Finally, the last layer, which is the multi-level color histogram, compares the histograms of the query image with those of the remaining potential candidate images. Images that fail the test need not be retrieved. Thus, we can see that the three-tier color index can minimize accesses to the image collections to only images that are most likely to satisfy the query. 3.4.3 SMAT: A height-balanced color-spatial index In the two color-spatial approaches presented above, the spatial distribution of colors are coarsely captured by the various histograms. There is no indication on how the color is distributed in the image space within each space represented by a histogram. Another problem with the two approaches is that though the individual tree structures (B+-tree, R-tree, Dominant Color Classification) employed in the respective layers are height-balanced, the entire hierarchical index structure may not be so. For example, in the two-level B+-tree structure, if the database images are skewed such that many images have the similar color composition, then a small number of the B+-trees at the second layer will be much larger (and taller) than the rest. Retrieving these images will result in longer access times. The same scenario holds for the three-tier color index. To resolve this problem calls for a new notion of height-balancing, and new height-balanced index structures to be developed. In this section, we look at a height-balanced color-spatial index developed by Ooi, et al. [Ooi et al., 1997]. We shall describe the representation of the color- spatial information, the algorithm to extract them and the retrieval technique before looking at the proposed hierarchical index structure. Representing the color-spatial information. It has been observed that humans are prone to focus on large patches of colors, rather than on small patches that are scattered around [Beck, 1967, Treisman and Paterson, 1980]. The resultant effect is that given two images, they will appear to be similar
  • 105. IMAGE DATABASES 97 Tier I: Dominant Color Classification k =I k=2 (0,2) -(O,n) (1,2) (2,n) . / Tier 3: Multi-Level Color Histogram D EE 1m Figure 3.4. The three-tier color index.
  • 106. 98 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS if both of them have large patches (refer to as clusters) of similar colors at roughly the same locations in the images. For example, Figure 3.5 shows three images and the corresponding eight largest clusters, sorted in descending order. These clusters have been extracted using the proposed color-spatial technique to be discussed shortly. From the cluster representation of image A (Figure 3.5(b)), it can be seen that several clusters contain color 4 (pink). The cluster representation in image B (Figure 3.5(d)) also shows that there are dominant clusters containing color 4 (pink) that fall in the same region and intersect those clusters in image A. Hence, the two images are "similar" in terms of color and spatial information. Similarly, based on the cluster representation in Figure 3.5(f), it is clear that image C is different from the other two images since there are no common color and location between them. Based on the observation, the work [Ooi et aI., 1997] represented the color-spatial information of an image as a set of single-colored clusters in the image space, and these clusters are used to facilitate image retrieval. Extracting the color-spatial information. To extract the color and spa- tial information, a heuristic similar to the one adopted in [Hsu et aI., 1995] was employed. The heuristic, which comprises three phases, represents the color- spatial information as a set of k single-colored regions, for some predetermined value k which is expected to be small. In the first phase, a set of k representative colors of an image is selected. The colors selected are those with the largest number of pixel counts in the image. This set of colors is called the dominant colors. In the second phase, a set of clusters for each of the dominant colors are determined. The algo- rithm adopted is based on the maximum entropy discretization method [Chiu and Kolodziejczak, 1986]. Briefly, for each selected color in the first phase, the maximum entropy discretization algorithm is applied to the image space to extract the spatial information of the color. Initially, the entire image is regarded as one whole region. In the first pass, the image is partitioned into four regions, and the process is repeated on the four regions recursively. For each region, an evaluation criterion is used to determine whether further par- titioning is needed. The results of the application of the algorithm is a set of representative regions for each selected color. Each region is represented as a rectangle within the image space. At the end of phase two, a large set of single-colored clusters have been derived. In phase three, these clusters are ranked (regardless of color) in de- scending order of their sizes (area of the rectangles). The k largest clusters will be picked as the dominant clusters to be used as the color-spatial information of the image.
  • 107. (a) Image A (c) Image B (e) Image C IMAGE DATABASES 99 Dunninant Colur Xmin Ymin Xmax Ymax Area Cluster I 17 116 0 147 114 3,534 2 17 147 0 173 114 2,%4 3 4 20 0 30 114 1,140 4 4 30 0 40 114 1,140 5 17 61 R 116 15 3R5 6 17 61 0 116 7 3R5 7 4 0 0 19 17 323 R 4 0 IR 19 35 323 (b) A's 8 largest clusters Dunninant Culor Xmin Vlllin Xll1ax Yma.lt Area Clusler 1 4 147 0 173 114 2,%4 2 4 0 0 23 114 2,622 3 40 R6 20 105 104 1,596 4 37 60 25 76 114 1,424 5 4 72 21 R6 114 1,302 6 4 7R 15 147 31 1.104 7 4 24 62 3R 114 72R R 4 24 0 7R 13 702 (d) B's 8 largest clusters Durminanl Culur Xmin VllIin Xmax YlHax Area Cluster I 3 150 0 lfifi 114 I,R24 2 3 0 0 12 114 1,36R 3 3 166 0 173 114 79R 4 39 35 3 54 26 437 5 39 RO 3 105 19 4lKl 6 42 34 47 56 65 396 7 39 lOR 45 157 53 392 R 39 30 27 54 43 3R4 (f) C's 8 largest clusters Figure 3.5. Three images and their 8 largest ·c1usters.
  • 108. 100 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS The similarity function used for image retrieval computes the degree of over- lap between the rectangles of the source and target images. Two rectangles overlap only if they have the same color, and they intersect in the image space; the degree of overlap is given by the number of pixels intersected. The retrieval process using the color-spatial information is as follows. The image database is initially preprocessed to determine the clusters (color-spatial information) of the images. Given a sample query image, its k clusters are first extracted. The color-spatial information of each image in the database is then compared with those of the query image using the similarity function described above. The images can then be ranked based on the percentage of overlap, retrieved and displayed in that order. The index: Sequenced multi-attribute tree. Even though the approach restricts the number of clusters per image to k, the number of cluster com- parisons to be performed is still very large, about O(N . k2 ) where N is the number of images in the database. Since only a small number of images is likely to match the sample image, a large number of unnecessary comparisons are being performed. To minimize the expensive comparisons, an index struc- ture, the Sequenced Multi-Attribute Tree (SMAT), is proposed. SMAT is based on three observations on the similarity function of the color-spatial approach: • Color must be matched before the spatial property as color is deemed a more important feature. • If two clusters of two images share the same spatial property but different color content, then the two clusters will not contribute to the similarity function. • If two clusters of two images share the same color but with non-overlapping spatial properties, then the two clusters will also not contribute to the sim- ilarity function. SMAT is a multi-tier tree structure, where each layer corresponds to an indexing attribute. For example, the top layer can be based on color, the second is based on color percentage or size of the cluster, and the last is based on spatial property. Each layer can be constructed using any indexing mechanism. For example, the top layer can be implemented using a single dimensional indexing structure such as the B+-tree [Comer, 1979]. On the other hand, the lowest layer can employ a multi-dimensional indexing structure like the R-tree [Guttman, 1984]. Except for the lowest level, entries in the leaf nodes of all levels point to the roots of the trees in the next level. Only the leaf nodes of the lowest level tree contain pointers to the image data. Thus, SMAT essentially consists of multiple trees integrated together in a hierarchical manner. To
  • 109. IMAGE DATABASES 101 reach the lowest layer of the SMAT where the actual images are pointed to, the query must satisfy the conditions relating to the discriminating keys in all the higher layers. Any condition violated in any layer will terminate the search path prematurely. In [Ooi et al., 1997], a variation of the R-tree structure [Guttman, 1984] was employed to implement a 2-tier SMAT structure. Figure 3.6 shows the struc- tural view of the SMAT structure implemented. The first layer discriminates clusters based on color. Since color is a single-dimensional attribute, the R-tree used at this layer is a single-dimensional R-tree (I-D R-tree). Each entry has a color range that defines the data space of the subtree pointed by its child pointer. The color ranges of internal nodes do not overlap, unless they are exactly the same range. This occurs only when the data is very skewed. En- tries of the leaf nodes of the first layer R-tree are of the form (color-range, BR, PTR), where BR defines the spatial bounding rectangle which contains all the clusters' color rectangles within the image space, and PTR points to an R-tree of the next layer. Spatial information is required at the leaf node for balanc- ing purposes. Suppose, for a given color range, the next layer R-tree pointed by PTR outgrows others and the next split involves its root node (PTR). By splitting such a node, the height of SMAT will increase. To enable some form of balancing, the node is split according to the splitting strategy adopted at the second layer, but the entry is inserted into the leaf node of the first layer instead. In other words, two entries with the same color range (at the first layer) are created, but with different bounding rectangles. The second layer is based on the spatial information of the clusters. Each entry of the internal node contains a rectangle that defines its child node's data space and a pointer pointing to the subtree. The second layer R-tree is like a 2-dimensional spatial R-tree structure. For the leaf nodes, entries are of the form (color, coordinates, PTR). The color attribute contains the color of the cluster, the coordinates attribute contains the four coordinates of the cluster, and PTR is a pointer to the address in the database that contains the image data (see Figure 3.6). The image data contains the ID of the image, and the colors and coordinates of the k dominant clusters. This information are used in computing the similarity function (we shall see how this is used when we discussed the matching algorithm). Matching and searching a SMAT. The matching algorithm retrieves im- ages that are similar to a sample image. Given a sample image, the algorithm extracts k dominant clusters. For each of the clusters extracted, it determines the set of images that are similar to it. This is done by traversing SMAT to determine the clusters that matches the clusters of the sample image. It suffices to know that the search algorithm returns a list of pointers to a file that con-
  • 110. 102 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS node I {[eI,elJ, c::J} IMAGE-ID node 3 cI, coord I c2, coord2 1-oE;:----..J c3, coord3 c4, coord4 Level I: 1-0 R-lree color discriminator node 2 Level 2: 2-D R-tree spatial discriminator Figure 3.6. The SMAT structure.
  • 111. IMAGE DATABASES 103 tains information on potential matching images. Recall that these information includes the image id and the (color, cluster) pairs of the image. From these information, the algorithm proceeds to compute the similarity value of the sam- ple image and the candidate image, and rank the candidate image accordingly. Since it is possible that other clusters of the sample image may also match the same candidate image at a later iteration, the image ids are maintained in a hash table to avoid subsequent comparisons and retrieval. Finally, all the images can be retrieved based on the image ids. The search algorithm of a SMAT structure is fairly straightforward, and follows from the wayan R-tree is searched. The algorithm descends the I-D R-tree from the root, and at each internal node, entries are checked. For each color range that contains the search color, the subtree is searched. When a leaf node is reached, the color of the search cluster is used to check for any entries whose color range contains the color. For all color ranges that qualify, their spatial bounding rectangles are checked to see if they intersect the search cluster. For qualified entries, the search continues to the corresponding 2-D R-trees at the next layer. While the traversal of the I-D R-tree often leads to a distinct path (unless there are duplicates), more than one subtree under the 2-D R-tree may need to be searched. Nevertheless, the search algorithm can eliminate irrelevant clusters of the indexed images and examine only clusters near the search area. Inserting color clusters into SMAT. Inserting image clusters into a SMAT raises some interesting issues concerning the growth of the tree. The first issue concerns the initial loading of SMAT. In this case, the tree is not "mature" in the sense that not all layers may have been constructed. The question of when SMAT grows from one layer to the next arises. The second issue deals with the height-balancing of SMAT. While the R-tree is height-balanced, SMAT may not be fully height-balanced as images may be inserted towards one end of the SMAT. The strategy adopted let SMAT grow downward until some criterion is met, and grow upward when height-imbalance occurs. Initially, the height of all the layers are predetermined. For a SMAT structure with k layers, L1 , L2 , ... , Lk' let the predetermined height for layer L; be h;. Note that h;, for all i E [1, k], changes dynamically as SMAT grows. During initial loading, SMAT is not fully developed, and so h; is used to guide the growth of layer L; downward as follows: layer L;+1 will appear only if all the nodes along the path leading to the leaf node of layer L; in which the new record is to be inserted are full, and the length of the path has reached h;. This is to ensure that the height of the SMAT is maintained and not increased further unless necessary. To illustrate, consider the I-D R-tree in Figure 3.6. Suppose, leaf node 1 is full and h1 is set
  • 112. 104 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS to 2, and a new cluster is to be inserted into leaf node 1. If node 3 is full, instead ::If allowing the I-D R-tree to grow, the tree grows downward by creating the l1ext layer tree, and the record is inserted there. On the other hand, if node 3 IS not full, then creating the next layer will undoubtedly increase the height of the search path by one. Instead, leaf node 1 should be split as normal. Once all the layers of SMAT are developed, the issue of height-balancing becomes a concern since it affects the retrieval time of SMAT. Although R-tree is height-balanced, SMAT may not be so. This happens especially if there are a lot of clusters of a particular color. Thus, there is no guarantee that all the trees in the second layer index will grow and shrink at the same rate. This means that it is possible that a particular tree in a level may grow much faster than the other trees in the same level causing the SMAT to be skewed to one ,ide. This is to say that the basic SMAT structure, can only be locally balanced but not globally height balanced. Since SMAT is a multi-tier structure, the concept of height-balanced is ,lightly different from a single structure index. A SMAT structure is height- balanced if the following two conditions are met: • Each tree structure within a layer is height-balanced. • The difference in the heights of trees within a layer, say Li' is at most ei for some predetermined ei for each layer. Figure 3.7 illustrates a height-balanced tree. As can be seen, in the worst case, the difference between trees in height within a k-layer SMAT is I:7=2 e;. To keep SMAT height-balanced, the upper layers are allowed to grow once the lowest layer has been established. The minimum height of the trees at each layer are maintained. If there is an increase in the height of a tree (at a layer) as a result of an insertion, the new height of the tree is compared against the minimum height at that layer. If the difference between the two is above a certain predetermined threshold, then rebalancing is activated. Rebalancing is performed as follows. Let the layer where rebalancing is needed be Li' and its parent layer be Li-1. Let the root of the tree that causes height imbalance at L; be R;, and the leaf node of L;-1 that points to R; be LN;. Let the entry in LN; that points to Ri be lold. The information at Ri is used to insert a new entry, lnew into LNi. lold is set to point to the left child of Ri, and lnew is set to point to the right child of Ri . Ri can then be removed. Note that the corresponding bounding information in lold needs to be updated too. The insertion algorithm that SMAT adopts within a tree is similar to that used in R-trees in that new clusters are added to the leaves, nodes that overflow are split, and splits are propagated up the tree. The splitting algorithm adopted is based on the quadratic-cost algorithm of R-tree by Guttman [Guttman, 1984].
  • 113. IMAGE DATABASES 105 hI Thl D layer I h2+<1 f2 ...layer 2 h3 . . . layer 3 h3+e3 h4 . . . layer 4 h4+e4 Figure 3.7. A height-balanced SMAT. The algorithm attempts to find a small-area split, but is not guaranteed to find one with the smallest area possible. There is, however, the additional task of handling height-balancing. 3.5 Signature;..based color-spatial retrieval In this section, we present a signature-based color-spatial retrieval technique [Chua et aI., 1997]. The mechanism involves several components, and we discuss each of them in a subsection. First, the color-spatial information has to be extracted and represented. Next, we describe the retrieval process that is based on the color-spatial information. In particular, the retrieval process requires a measure to compute the similarity between two images (in terms of their color-spatial representation). We also discuss an approach which incorporates the concept of perceptually similar colors and weighting of colors. 3.5.1 Representing the color-spatial information The proposed color-spatial approach partitions each image into a grid of m x n cells of equal size. Figure 3.8 shows an example of an image being partitioned into a 4 x 8 grid. Instead of obtaining the color-spatial information at pixel- level, the colors that can be used to represent a cell are determined. This is done as follows. For a given color, each cell is examined to determine the percentage of the total number of pixels in the cell having that color. If this
  • 114. 106 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS percentage is greater than a pre-defined threshold value, then the cell is said to be represented by that color. This approach is equivalent to applying the maximum entropy discretization algorithm [Chiu and Kolodziejczak, 1986] un- :ier the assumption of uniform color distribution. Note that, depending on the threshold value, a cell may have no color representative or it may have more than one representative. o cell does not satisfy the threshold III cell satisfies the threshold Figure 3.8. An image partitioned into a 4 x 8 grid. For the approach to be practical and useful, several issues have to be ad- dressed. First, the number of colors can be very large, resulting in a large set of color-spatial information. This is resolved by restricting the number of colors for an image to a set of C colors (called the dominant colors) of-the image. C is expected to be small as most images are usually dominated by a few colors. To select the C dominant colors, the heuristics employed in [Hsu et al., 1995] is adapted. The heuristics works as follows. Two color histograms, Hi and He, representing the color composition of the entire image and the center of the image are obtained. First, Ci (Ci < C) colors that have the largest number of pixels in Hi are picked. Next, the Ci colors picked are eliminated from con- sideration when the remaining Ce (= C - Cd colors are to be picked. The Ce colors are obtained from the remaining colors with the largest number of pixels in He. While the first set of colors represents the background colors, the second set represents the object colors (based on the inherent assumption that objects usually appeal' in the center of an image). Unlike the algorithm in [Hsu et al., 1995] where the background and the object colors are selected alternatively, the modification is to reduce the probability that the most dominant color in the center of the image (representing the object) is in fact one of the dominant background color. This is based on the observation that a significant portion of the center region of an image can be covered by the background colors. The second issue concerns the representation of the color-spatial information. It turns out that the proposed approach has a very nice property - given a
  • 115. IMAGE DATABASES 107 color, a cell is either represented or not represented by it. As such, each cell can be represented by a bit - if the cell satisfies the threshold value, the bit is set; otherwise, it is cleared. Hence, for each color, a bitstream (called the color signature) that captures the spatial distribution of that color is obtained. In the color signature, bit (i· (m - 1) +j) corresponds to cell (i, j). Referring to Figure 3.8 again, suppose a color qualifies to be the representatives of cells 0,4,5,6,7,10,14,15,25,26,30 and 31, its corresponding 32-bit color signature will be 10001111001000110000000001100011. Given an image with k colors, there will be k color signatures. These color signatures can be superimposed (bitwise logical-OR) to obtain an image signature. 3.5.2 The retrieval process From the human perception point of view, two images are perceived to be alike if the color composition of the two images are similar, and the distributions of the colors in the images are similar. Under the signature-based representation of color-information, the above two points can be translated into the following two conditions to facilitate efficient retrieval: • The images have the same representative sets of colors. • The signatures representing both images are similar in that they may only differ in some of the bits. This only requires a simple operation (logical AND) to compute the intersection between two images for a particular color. We discuss in the next few subsections several similarity measures that have been used [Chua et al., 1997] to indicate the similarity between two images based on their signatures. Basic similarity function. For the signature-based color-spatial approach, recall that each bit in a signature represents a particular cell in the image. Let Qi and Di denote the signatures of color i for a query image Q and a database image D respectively. Then, the two images have the color i at the same particular region (cell) if and only if the corresponding bits in both signatures are set; otherwise the two images are not similar at the region. Let the representative color sets of Q and D be CQ and CD respectively. Then, the similarity measure, SIMbasic, between Q and D for a color i E CQ can be determined as: { BitSet(Q;ID;J SIMbasic(Q, D, i) = BitS~t(Q;) if color i E CD otherwise (3.1) where BitSet(BS) denotes the number of bits in the bitstream BS that are set, and '1' represents the bitwise logical-AND operation. Now, if a large part of
  • 116. 108 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS cells in Q has the same color as that in D, then the similarity computed will be closed to 1. The similarity measure between two images Q and D is then given by: SIMbasic(Q,D) = L SIMbasic(Q,D,i) ViEC'Q (3.2) Similarity function with perceptually similar colors. Because of the effectiveness of using perceptually similar colors [Niblack et al., 1993], Chua, et al. also incorporated the contributions of perceptually similar colors in their similarity measure. To determine the degree of similarity between two colors, the method proposed by Ioka [Ioka, 1989] was adopted. The method first transforms colors in the RGB space to the CIE (Commission Internatinale de I'Eciairege) L*u*v* space, and the similarity between two colors can be measured from the Euclidean distance between the colors in the CIE L*u*v space. The Euclidean distance between two colors, i and j, in the L*u*v* space is computed as: (3.3) Let M denote the number of L*u*v* colors the system can support. The degree of similarity between two colors, i and j, is given by: SIM(i,j) ={ 1 _ ~(i,j)pxD mox ifD(i,j) > P x Dmax otherwise (3.4) where Dmax = max D(i, j), i i=- j, 1::; i, j ::; M, and p is a predetermined threshold value between 0 and 1 (in our study, we have arbitrarily set p to 0.2). Essentially, p x Dmax represents the tolerance in which two colors are considered to be similar. If SIM(i,j) > 0, then color i is said to be perceptually similar to color j, and vise versa. The larger the value of SIM (i, j), the more similar the two colors are. If SIM (i, j) = 0, it means that the two colors are not perceived to be similar. The similarity values computed for all pairs of colors are stored in a M x M matrix, called the color similarity matrix (denoted SM), where entry (i, j) corresponds to the value of SIM(i, j). S M is stored in a flat file and will be frequently used during the retrieval process to determine the similarity between two colors. Under the signature approach, the contribution of the perceptually similar colors of color i for query image Q and database image D is computed as follows: . """' BitSet(QiIDj ) (..) SIMpercept(Q, D, z) = LJ BitSet(Q.) x SM Z,J jESp , (3.5)
  • 117. IMAGE DATABASES 109 where Sp is the set of colors that are perceptually similar to color i as de- rived from the color similarity matrix SM. SM(i,j) denotes the (i,j) entry of matrix SM. To take the contributions of perceptually similar colors into consideration, Equations 3.1 and 3.5 can be combined to obtain the perceived similarity between two signatures on color i as follows: SIMcolor-spatiaL(Q, D, i) =SIMbasic(Q, D, i) + SIMpercept (Q, D, i) (3.6) Thus, the similarity measure for query image Q and database image D is the sum of the similarity for each color in the representative set CQ for image Q, and is given as follows: SIMcoI01·-spatial(Q,D) = L SIMcolor-spatial(Q,D,i) (3.7) 'tiECQ Weighted similarity function. In the above similarity measure, all the dominant colors have been implicitly assigned the same weight. However, in some applications, it may be desirable to give the object colors a higher weight. This is particularly useful when the object is at the center and the user is only interested in retrieving images containing similar objects at similar locations. The authors also proposed a weighted similarity measure which is given as follows: SIMweighted(Q, D) L SIMcolor-spatial(Q, D, i) + iECi wt x L SIMcolor-spatial(Q,D,i) iECe (3.8) where Ci and Cc are the set of background and object colors of Q respectively, and wi (> 1) is the weight given to the object colors. A weight greater than 1 can be assigned to the object colors to give a higher weight to images with similar object colors as that of the query image. 3.6 Summary In this chapter, we have surveyed content-based indexing mechanisms for image database systems. We have looked at various methods of representing and organizing image features such as color, shape and texture in order to facilitate speedy retrieval of images, and how similarity retrievals can be supported. In particular, we have a more in-depth discussion on color-spatial techniques that exploit color as well as their spatial distribution for image retrieval. As images will continue to play an important role in many applications, we believe the need for efficient and effective retrieval techniques and access
  • 118. 110 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS methods will increase. While we have seen much work done in recent years, there remain a lot to be mined in this field. In what follows, we outline several promising areas (not meant to be exhaustive) that require further research. Performance evaluation This chapter has presented a representative set of indexes for content-based im- age retrievals. Unlike other related areas such as spatial databases, the number of indexes proposed to facilitate speedy retrieval of images is still very small. This is probably because content-based image retrievals have been largely stud- ied by researchers in pattern recognition and imaging community, whose focuses have been on extracting and understanding features of the image content, and on studying the retrieval effectiveness of the features (rather than on efficiency issues). It is not surprising then that the indexes discussed have not been ex- tensively evaluated. Besides [Ooi et al., 1997], which reported a preliminary performance comparison demonstrating that SMAT outperforms R-tree in most cases, most of the other works have only compare with the sequential scanning approach. We believe that a comparative study is not only necessary but will be useful for application designers and practitioners to pick the best method for their applications. It will also help researchers to design better indexes that overcome the weaknesses and preserve the strengths of existing techniques. Another aspect of performance study, which is applicable for indexes in general, is the issue of scalability. Again, most of the existing work has been performed on small databases. How well will such indexes scale is certainly unclear until they have been put to the test. The readers are referred to [Zobel et al., 1996] for some guidelines on comparative performance study of indexing techniques. More on access methods The focus of this chapter has been on content-based access methods. There are many other content-based retrieval techniques that have been proposed in the literature [Aslandogan et al., 1995, Chua et al., 1994, Gudivada and Raghavan, 1995, Hirata et al., 1996, Iannizzotto et aI., 1996, Nabil et aI., 1996] and shown to be effective (in terms of recall and precision). These works, however, have not addressed the issue of speedy retrievals. Designing efficient access methods for these promising methods will make them more practical and useful. Another promising direction is to further explore color and its spatial dis- tribution. One issue is to exploit the colors that are perceptually similar. For example, out of the 16.7 million possible shades of colors displayable in a 24-bit color monitor, the human eyes can only differentiate up to 350,000 shades. As such, colors that are perceived to be similar should contribute to the compari-
  • 119. IMAGE DATABASES III son of color similarity. While some work has been done in this direction [Chua et aI., 1997, Niblack et aI., 1993], perceptually similar colors are considered in the computation of the degree of similarity, rather than being modeled in the feature representation. We believe the latter can be more effective in pruning the search space. Another issue is to exploit texture and color for segmentation of an image space. Indexing of clusters based on both texture and color may be more effective. Concurrent access and distributed indexing Traditionally, image retrieval systems have been used for archival systems that are usually static in that the images are rarely updated. As such, the issue of supporting concurrent accesses are not critical. Instead, in such applications, the access methods should be designed to exploit this static characteristic. However, as multimedia applications proliferates, we expect to see more real-time applications as well as applications running in parallel or distributed environment. In both cases, existing techniques will have to be extended to support concurrent accesses. Some techniques have been developed for central- ized systems [Bayer and Schkolnick, 1977, Sagiv, 1986, Ng and Kameda, 1993] as well as parallel and distributed environment [Achyutuni et aI., 1996, Kroll and Widmayer, 1994, Litwin et aI., 1993b, Tsay and Li, 1994]. But, we be- lieve more research that tailors to image data, especially those that involved hierarchical structures, are needed. Integration and optimization The retrieval results of an image database systems are usually not very precise. The effectiveness of using the content of an image for retrieval depends very much on the image representation and the similarity measure. It has been reported that using colors and textures can achieve a retrieval effectiveness of up to 60% in recall and precision [Chua et aI., 1996]. Furthermore, different retrieval models based on different combination of visual attributes and text descriptions achieve almost similar levels of retrieval effectiveness. Moreover, each model is able to retrieve a different subset of relevant images. This is because each image feature only captures a part of the image's semantics. The problems then include selecting an "optimal" set of image features that fits best for an application, as well as developing techniques that can integrate them to achieve the optimal results. One promising method is to use content-based techniques as the basis, but also exploits semantic meanings of the images and queries to support concept-based queries. Such techniques have been known as semantic-based retrieval techniques. Typically, some form of knowledge base is required, rendering such techniques domain-specific. In [Chua et al. , 1996],
  • 120. 112 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS the domain knowledge is supplied by users as part of a query. The query is modeled as a hierarchy of concepts through a concept specification language. Concepts are defined in terms of the multiple images' content attributes such as text, colors and textures. Each concept has three components: its name, its relationships with other concepts, and rules for its identification within the images' contents. In answering queries, the respective indexes are used to speed up the retrievals for concepts that are at the leaf of the hierarchy, and their results combined based on the hierarchy of concepts defined. More studies are certainly needed along this direction.
  • 121. 4 TEMPORAL DATABASES Apart from some primary keys and keys that rarely change, many attributes evolve and take new values over time. For example, in an employee relation, employees' titles may change as they take on new responsibilities, as will their salaries as a result of promotion or increment. Traditionally, when data is updated, its old copy is discarded and the most recent version is captured. Conventional databases that have been designed to capture only the most recent data are known as snapshot databases. With the increasing awareness of the values of the history of data, maintenance of old versions of records becomes an important feature of database systems. In an enterprise, the history of data is useful not only for control purposes, but also for mining new knowledge to expand its business or to move on to a new frontier. Historical data is increasingly becoming an integral part of corporate databases despite its maintenance cost. In such databases, versions of records are kept and the database grows as the time progresses. Data is retrieved based on the time for which it is valid or recorded. Databases that support the storage and manipulation of time varying data are known as temporal databases. In a temporal database, the temporal data is modeled as collections of line segments. These line segments have a begin time, an end time, a time-invariant E. Bertino et al., Indexing Techniques for Advanced Database Systems © Kluwer Academic Publishers 1997
  • 122. 114 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS attribute, and a time-varying attribute. Temporal data can either be valid time or transaction time data. Valid time represents the time interval when the database fact is true in the modeled world, whereas transaction time is when a transaction is committed. A less commonly used time is the user-defined time, and more than one user-defined time is allowed. A database that supports transaction time may be visualized as a sequence of relations indexed by time and is referred to as a rollback database. The database can be rolled back to a previous state. Here the rollback database is distinguished from the traditional snapshot database where temporal at- tributes are not supported and no rollback facility is supported. A database that supports valid time records a history of the enterprise being modeled as it is currently known. Unlike rollback databases, these historical databases al- low retroactive changes to be made to the database as errors are identified. A database that supports both time dimensions is known as bitemporal database. Whereas a rollback database views records as being valid at some time as of that time, and a historical database always views records as being valid at some moment as of now, a bitemporal database makes it possible to view records as being valid at some moment relative to some other moment. One of the challenges for temporal databases is to support efficient query retrieval based on time and key. To support temporal queries efficiently, a temporal index that indexes and manipulates data based on temporal relation- ships is required. Like most indexing structures, the desirable properties of a temporal index include efficient usage of disk space and speedy evaluation of queries. Valid time intervals of a time-invariant object can overlap, but each interval is usually closed. On the other hand, transaction time intervals of a time-invariant object do not overlap, and its last interval is usually not closed. Both properties present unique problems to the design of time indexes. In this chapter, we briefly discuss the characteristics of temporal applications, tempo- ral queries, and various promising structures for indexing temporal relations. We also report on an evaluation of some of the indexing mechanisms to provide insights on their relative performance. 4.1 Temporal databases In this section, we briefly describe some of the terms and data types used in temporal databases. For a complete list of terms and their definitions, please refer to [Jensen, 1994]. An instant is a time point on an underlying time dimension. In our discus- sions that follow, we use 0 to mark the beginning of a time, and time point to mean instant on the discrete time axis. A time interval [Ts, Te) is the time between two time points, T s and T e , where T s :S Te, with the inclusion of the
  • 123. TEMPORAL DATABASES 115 end time. Note that the closed range time is similar to the non-closed range representation, since [Ts , Te] =[Ts , Te + 1). A chronon is a non-decomposable time interval of some fixed minimal duration. In some applications, chronons have been used to represent an interval. A span or time span is a directed du- ration of time. It is the length of the time with no specific starting and ending time points. A lifespan of a record is the time when it is defined. A lifespan of a version (tuple) of a record is the time in which it is defined with certain time-varying key values. For indexing structures that support time intervals, start time and version lifespan are two parameters that may affect their query and storage efficiency. 4.1.1 Transaction time relations Transaction time refers to the time when a new value is posted to the database by a transaction [Jensen, 1994]. For example, suppose a transaction time rela- tion is created at time Ti , so that Ti is the transaction time value for all the tuples inserted at the creation of the relation. The lifespan of these tuples is [Ti , NOW]. The right end of the lifespan at this time is open, which can be assumed to have the value of NOW to indicate progressing time span. At time Tj when a new version of an existing record is inserted, the lifespan of the new version is [Tj , NOW], and that of the previous version is [Ti , Tj). Transaction times which are system generated follow the serialization order of transactions, and hence are monotonically increasing. As such, a transaction time database can rollback to some previous state of its transaction time dimension. There are two representations for transaction time intervals. One approach is to model transaction time as an interval [Snodgrass, 1987] and the other is to model transaction time using a time point [Jensen et aI., 1991, Lomet and Salzberg, 1989, Nascimento, 1996]. The latter approach implicitly models an interval by using the time when a new version is inserted as the start time of its transaction time and the time point immediately before the time when the insertion of the next version as its transaction end time. In what follows, we shall use the single time point representation to model transaction time. However, explicit representation of transaction time intervals is often used for performance reason. To illustrate the concept of temporal relations, we use a tourist relation that keeps track of the movement of tourists to study the tourism industry. The relation has a time invariant attribute, pid, and a time varying attribute, city. At time 0, the relation is created and the transaction time value for the current tuples is °(Table 4.1). The lifespan of these tuples is [0, NO W]. At time 3, the tuple with pid=p1 is updated, the new city value is Los Angeles (Table 4.2).
  • 124. 116 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Table 4.1. A tourist transaction time relation at time O. tuple pid city Tt tl pI New York 0 t2 p2 Washington 0 t3 p3 New York 0 Table 4.2. The tourist transaction time relation at time 3. tuple pid city Tt t1 pI New York 0 t2 p2 Washington 0 t3 p3 New York 0 t4 pI Los Angeles 3 t5 p6 Seattle 3 To keep the history, a new tuple t4 is inserted. Thus, the lifespan for tl is [0, 3) and the lifespan of t4 is [3, NOW]. In the transaction time relation, there are no retroactive updates (updates that are valid in the past) and predictive updates (updates that will be valid in the future). Each transaction is committed immediately with the current transaction time. For instance, if at time 2, the city for p1 changes to Seattle, this update cannot be committed at time 3. If a tuple will be updated at time 4, this update cannot be reflected in Table 4.2, because predictive update is not supported in the transaction time relation. Note that time intervals that are still valid at the present time point are not closed. In other words, the end time progresses with the current time. 4.1.2 Valid time relations The transaction time dimension only represents the history of transactions, it does not model the real world activity. We need a time to model the history of an enterprise such that the database can be rolled back to the right time-slice with respect to the enterprise activity. Valid time is the time when a fact is true. In a valid time relation, a time interval [Ts , Te] is used to indicate when the tuple is true. Valid time intervals are usually supplied by the user, and each
  • 125. TEMPORAL DATABASES 117 Table 4.3. The tourist valid time relation at time O. tuple pid city Ts Te tl pI New York 0 3 t2 p2 Washington 0 NOW t3 p3 New York 0 NOW Table 4.4. The tourist valid time relation at time 3. tuple pid city Ts Te tl pI New York 0 3 t6 pI Seattle 2 3 t2 p2 Washington 0 NOW t3 p3 New York 0 NOW t4 pI Los Angeles 3 NOW t5 p6 Seattle 3 6 t7 p5 Washington 4 6 new tuple is inserted into the relation with its associated valid time interval. A time-invariant key can have different versions with overlapping valid time, provided the temporal attributes of these versions are different. Time intervals that progress the current time are open. Since they are usually determined by users, new tuples often have close intervals that end before or after the current time NOW. Tables 4.3 and 4.4 show the valid time relation of tourist. At time 0, the tuples are inserted with the valid time ranges. Assume in period [2, 3], the city for pi is changed from New York to Seattle, and from time 3, it is changed again to Los Angeles. The relation in Table 4.4 represents these updates. Note also that the valid time relation in Table 4.4 can capture proactive insertions, for example, tuple t7 which has the valid time interval [4, 6] appears in the relation at time 3. Unlike transaction time relation, a valid time relation supports retroactive and predictive updates. If an error is discovered in an older version of a record, it is modified with the correct value, the old value being substituted by a new value. Hence it is not possible to rollback to the past as in the transaction time database.
  • 126. L18 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Table 4.5. The tourist bitemporal relation at time O. tuple pid city Ts Te Tt t1 pI New York 0 3 0 t2 p2 Washington 0 NOW 0 t3 p3 New York 0 NOW 0 Table 4.6. The tourist bitemporal relation at time 5. tuple pid city Ts Te Tt tl pI New York 0 3 0 t6 pI Seattle 2 3 3 t2 p2 Washington 0 NOW 0 t3 p3 New York 0 NOW 0 t4 pI Los Angeles 3 NOW 3 t5 p6 Seattle 3 6 3 t7 p5 Washington 4 6 3 t8 p5 Washington 5 8 5 4.1.3 Bitemporal relations In some applications, both the transaction time and valid time must be mod- eled. This is to facilitate queries for records that are valid at some valid time point and as of some transaction time point. A relation that supports both times is known as a bitemporal relation, which has exactly one system sup- ported valid time and exactly one system supported transaction time. Table 4.5 illustrates an example of the tourist bitemporal relation at time O. From Table 4.6, note that tuples t7 and t8, with the same pid and city values, bear overlapping valid time [Ts , Tel. This is possible because the two tuple versions have different transaction time values. However, in a valid time relation, this situation cannot be represented. Like a valid time relation, the bitemporal relation also supports retroactive and predictive versioning.
  • 127. TEMPORAL DATABASES 119 4.2 Temporal queries Various types of queries for temporal databases have been discussed in the literature [Gunadhi and Segev, 1993, Salzberg, 1994, Shen et aI., 1994]. Like any other applications, temporal indexing structures must be able to support a common set of simple and frequently used queries efficiently. In this section, we describe a set of common temporal queries. These queries should be used to benchmark the efficiency of a temporal index. We use the tourist relation shown in Table 4.7 as an example in our discussion that follows. We assume that the time granularity for this application is one day for both valid and transaction time. Consider the first tuple. The object with pid pI is at New York from day 0 to day 2 inclusive. Its transaction time starts at day 1 and ends when there is an update to the tuple. A set ofcanonical queries was initially proposed by Salzberg [Salzberg, 1994]. We extend this set of queries by further classifying temporal queries in each query type based on the search predicates - intersection, inclusion, contain- ment and point. Such finer classification can provide insights into the effec- tiveness of the indexes on different kinds of search predicates. For queries that involve only one time and one key, the key can either be a time-invariant attribute or a time-varying attribute, and the time can either be valid time or transaction time. However, the single time dimensional queries are more meaningful for valid time databases. They can however be applied to transac- tion time. Nonetheless, the search remains the same although the semantics of time may be different. The following constitutes the common set of temporal queries: l. Time-slice queries. Find all valid versions during the given time interval [Ts, Te]' For a valid time database, the answer is a list of tuples whose valid time fall within the query time interval. For transaction time database, the answer are snapshots during the query time interval and hence the predicate "as of" is used for transaction time. Based on the search operation on the temporal index, time-slice queries can be further classified as: • Intersection queries. Given a time interval [Ts , Te], retrieve all the versions whose time intervals intersect it. For example, a valid time query to find all tourists who are in US during the interval [3,7] would return 9 tuples: t2, t3, t4, t5, t6, t7, tlO, t12 and t14. • Inclusion queries. Given a time interval [Ts , TeL retrieve all the versions whose valid time intervals are included in it. For example, the query "Find all tourists who stay in a city between day 3 and day 7" would return 2 tuples: t5 and tlO.
  • 128. 120 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS • Containment queries. Given a time interval [Ts , Tel, retrieve all the versions whose valid time intervals contain it. For example, the query "Find all tourists who stay in a city from day 3 to day 5" would result in 5 tuples: t3, t4, t7, tiO and tI4. • Point queries. Given a specific time point t(instant), retrieve all the versions whose valid intervals contain the time point. Point queries can be viewed as the special case of intersection queries or containment queries where the time interval [Ts , Te] is reduced to a single time instant T . For example, the query "Find all tourists who are in US on day 1" would result in 3 tuples: tI, t3 and t4. 2. Key-range time-slice queries. Find all tuples which are in a given key range [ks , ke] that are valid during the given time interval [Ts , Te]. It is a con- junction of keys and time. Like the time-slice query, the time-slice part of the query can assume one of the predicates described above. For example, the query to find all tourists who are in New York during the interval [3,7] is a key-range time-slice query with intersection predicate. The result of the query is now 2 tuples instead: t3 and t6. As another example, the query "Retrieve all tourists who are in cities with names beginning in the range [D,N] on day 1" would be a point key-range time-slice query that results in 3 tuples: tI, t3 and t4. The key-range time-slice query is an exact-match query if both ranges are reduced to single value; that is, find the versions of the record with key k at time t. An example of this category is "Find all tourists who visited New York on day 1", and results in tuples: tl and t3. 3. Key queries. Find all the historical versions of the records in the given key range [ks , ke]. Such a query is a pure key-range query over the whole lifespan. For example, the query "Find all tourists who visited New York" is a past versions query. This query will return the tuples: tl, t3, t6, t9 and tIl. 4. Bitemporal Time-slice queries. Find all versions that are valid during the given time interval [Ts , Te] as of a given transaction time Tt · 5. Bitemporal key-range Time-slice queries. Find all versions which are in the given key range [ks , ke] that are valid during the given time interval [Ts , T e] as of a given transaction time Tt . To answer time-slice queries, the index must be able to support retrieval based on time. The key-range time-slice queries require the search to be based on both key and line segments. To support valid time, an index must support dynamic addition, deletion and update of data on the time-dimension, and
  • 129. TEMPORAL DATABASES 121 Table 4.7. A tourist relation for running examples. tuple pid city period trans_time t1 pI New York [0,2] 1 t2 p2 Washington [5, now] 1 t3 p3 New York [0, 6] 1 t4 p4 Detroit [0, 7] 2 t5 p5 Washington [4, 6] 2 t6 p5 New York [7, now] 3 t7 p6 Seattle [3, now] 3 t8 p4 Washington [10, now] 3 t9 p3 New York [12, now] 3 tlO pI Los Angeles [3,6] 3 tIl p7 New York [14, now] 4 t12 pI Detroit [7,9] 4 t13 pI Detroit [10, 12] 5 t14 p9 Los Angeles [3,8] 6 t15 pI San Francisco [13, now] 6 support time that is beyond the current time. In other words, reactive and proactive updates are required. An index that has been designed for valid time can be easily extended for transaction time even though a transaction database can be thought of as an evolving collection of objects. The major differences are that delete operations are not required for transaction time databases, and time increases on one end dynamically as it progresses. However, it is much more difficult to extend a transaction time index for indexing valid time data since transaction time indexes are designed based on the fact that transaction times do not overlap, and such property is quite often built into the index. Further, some transaction time indexes are specifically designed for intervals that are always appended from the current time, and do not support reactive update and proactive insertion. 4.3 Temporal indexes Without considering the semantics of time, temporal data can be indexed as line segments based on its start time, end time, or the whole interval, together with the time-varying attribute or time-invariant attribute. Indexing structures based on start time or end time are straightforward and structurally similar to
  • 130. 122 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS existing indexes such as B+-tree [Comer, 1979]. Such an index is not efficient for answering queries that involve time-slice since no information on the data space is captured in the index. To search for time intervals with a given interval, a large portion of the leaf nodes have to be scanned. To alleviate such a problem, temporal data can be duplicated at the data buckets whose data space of time intervals it intersects. However, duplication increases storage cost and the height of the index, which affects the query cost. Alternatively, temporal data can be indexed directly as line segments or mapped into point data and indexed using multi-dimensional indexes. As such, most temporal indexes proposed so far are mainly based on the conventional B+-tree and spatial indexes like the R-tree [Guttman, 1984]. In this section, we review several promising indexes for temporal data. They are the Time-Split B-tree [Lomet and Salzberg, 1989, Lomet and Salzberg, 1990b, Lomet and Salzberg, 1993], the Time Index [Elmasri et al., 1990], the Append-Only tree [Gunadhi and Segev, 1993], the R-tree [Guttman, 1984]' the Time-Polygon tree [Shen et al., 1994], the Interval B-tree [Ang and Tan, 1995], and the B+-tree with Linearized Order [Goh et al., 1996]. Where necessary, we also discuss the extensions that have to be incorporated for such indexes to facilitate retrieval by both key and time dimensions. 4.3.1 B-tree based indexes The Time-Split B-tree. The Time-Split B-Tree (TSB-tree) [Lomet and Salzberg, 1989, Lomet and Salzberg, 1990b, Lomet and Salzberg, 1993] is a variant of the Write-Once B-Tree (WOBT) [Easton, 1986]. The TSB-tree is one of the first temporal indexes that support search based on key attribute and transaction time. An internal node contains entries of the form <aU- value, trans-time, Ptr>, where aU-value is the time-invariant attribute value of a record, trans-time is the timestamp of the record and Ptr is a pointer to a child node [Lomet and Salzberg, 1989]. Searching algorithms are affected by how a node is split and the information it captures about its data space. Therefore, we shall begin by looking at the splitting strategy. In the TSB-tree, two types of node splits are supported, key value and time splits. A key split is similar to a node split in a conventional B+-tree where a partition is made based on a key value. A TSB-tree after a key split is shown in Figure 4.1. For the time split, an appropriate time is selected to partition a node into two. Unlike key split, all record entries that persist through the split time are replicated in the new node, which stores entries with time greater than the split time. Figure 4.2 shows the TSB-tree time splitting in which the record <pI, Detroit, 4> is duplicated in the historical and new nodes. If the number of different attribute values in a node is more than lM/2J(M is
  • 131. TEMPORAL DATABASES 123 pI New York T=I p2 Washington T=I p3 New York T=I index page data pages After insertion of record <p9, Los Angeles, 6> Ipi New York T=I 0 p2 Washington T=I DL.... _ Ip3 New York T=I op9 Los AngelesT=6 IJL _ Figure 4.1. A key split of a leaf node in the TSB-tree based on p3. the maximum number of entries in a node), a key split is performed; otherwise the node is split based on time. If no split time can be used except the lowest time value among the index item, a key split is executed instead of time split. To search based on key and time, index keys and times of internal nodes are used respectively to guide the search. With data replication, data whose time intersects the data space defined in the index entries are properly contained in its subtree, and this enables fast search space pruning. The TSB-tree can only support transaction times in the sense that times of the same invariant key must strictly be in increasing order. In other words, there is no time overlapping among versions of a record. When a record is updated, the existing record becomes a historical record, and a new version of the record is inserted. The TSB-tree can answer all the basic queries on transaction time and time-invariant key. The major problem of the TSB-tree is that data replication could be severe, and hence this may affect its storage requirements and query performance. As noted, the index cannot be used for valid time data. The Time Index. Elmasri et al. [Elmasri et al., 1990] proposed the time index to provide access to temporal data valid in a given time interval. The technique duplicates the data on some selected time intervals and index them using a B+-tree like structure. Duplications not only incur additional cost
  • 132. 124 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS pI New York T=1 PI Los Angeles T=3 pi Detroit T=4 Now insert record <p9. Los Angeles. 6>. choose T=5 as the split time index page The new nodes are: IpOT=1 II pOT=5 II 1 data pages Ipi New York T=I I IpI Los Angeles T=3 IIpI Detroit T=4 I lL':p..:..1..:.D..:..et.:..:ro..:..:it-=.T_=4 [1 p9 Los Angeles T=6 0'---- _ Figure 4.2. Time splitting in TSB-tree. in insertion and deletion, but also degrade the space utilization and query efficiency. In the worst case that all intervals start at different instants but end at the same instant, the storage cost is of order O(n2 ). As for the query operation, to report all intersections with a long interval requires an order of O(n2 ), since most of the buckets need to be searched. To reduce the number of duplications, an incremental scheme is adopted which only allows the leading buckets to keep all their id's, whereas others maintain the starting or ending instants [Elmasri et al., 1990]. Figure 4.3 depicts the time index constructed using the most current snapshot of the tourist relation in Table 4.7. In the figure, the "+" and "-" signs indicate the starting instant and ending instant of an interval respectively. The number of duplications has been reduced, however, there are still many duplications for tuples having long intervals. To search from an instant onward, all the leading id buckets belonging to the same leaf node have to be read and checked. For instance, the query "Find all persons who were in the United States from day 4 to day 6" can be answered by locating indexing point 4, and reconstructing the list of valid tuples from the leading bucket and subsequent entries right up to indexing point 6. To insert or delete a long time interval, the number of leading id buckets to be read and updated can be high, with the order of O(n). The time-index is likely to be efficient for short query intervals and short time intervals. For long data intervals, the amount of duplication can be significant.
  • 133. TEMPORAL DATABASES 125 o (11,14,17) (+110) (+12,-11,-110) (12,17) (+18,+111,-17) (12,111,112) (+114) t (+113) Figure 4.3. The time index constructed from the tourist relation. This will affect query efficiency as the tree becomes taller and the number of leaf nodes increases. In addition, index support is provided for only a single notion of time (in this case, valid time) and it is not clear how this can be naturally extended to support temporal queries involving both transaction and valid time. Elmasri et al. [Elmasri et al., 1990] also suggested that their time index can be appended to regular indexes to facilitate processing of historical queries involving other non-temporal search conditions. For example, if queries such as "Find all persons who entered United States via LA and remained from day 4 to day 6" is expected on a regular basis, such queries may be supported by attaching a time index structure to each leaf entry of a B+-tree constructed for the attribute city. Answering the above query involves traversing the first B+- tree to identify the leaf entry corresponding to attribute value "LA", followed by an interval search on the time index found there. However, this approach may not be scalable since the number of time indexes will certainly grow to be exorbitantly large in any nontrivial database. The Append-Only tree. The Append-Only tree (AP-tree) [Gunadhi and Segev, 1993] introduced by Gunadhi and Segev is a straightforward extension of the B+-tree for indexing append-only valid time data. In an AP-tree, leaf nodes of the tree contain all the start times of a temporal relation. In a non-leaf node, pointer of each time value points to a child node in which this time value is the smallest value (this rule does not apply to the first child node of each index node). The AP-tree is illustrated in Figure 4.4. Since both the update of an existing record and insertion of a new version will only cause incremental append to the database, every insertion to the AP-tree
  • 134. 126 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS o 3 H 4 5 ~~tl t3 t4 t7 tlO... tl, t3, t4 represent tuples with Ts=O; t7, tlO represent tuples with Ts=3 Figure 4.4. An AP-tree structure of order 3. will always be performed directly at the rightmost leaf node. All the subtrees but the rightmost one of the AP-tree are 100% full. When the rightmost leaf node is full, the node is not split, but instead a new rightmost leaf node is created and is attached to the most appropriate ancestor node. Therefore, the AP-tree may not be height-balanced. One such example is shown in Figure 4.5. The AP-tree structure is simple and is small in the sense that it does not maintain additional information about its data space. However, searching for a record can be fairly inefficient. To search for a record whose interval falls within a given time interval as in a time-slice query, the end time of the search interval is used to get the leaf node that contains the record whose start time is just before the search end time. From that node, the leaf nodes on its left are scanned. To answer queries involving both key and time-slice, a two-level index tree called the nested ST-Tree (NST) was proposed. The first level of an NST is a B+-tree that indexes key values, and the second level is an AP-tree that indexes temporal data that correspond to records with the same key value. In the B+-tree, each leaf node entry has two pointers, with one pointing to the current version of the record with this key, and the other pointing to the root node of the AP subtree. A query involving only key value can directly access the most recent version of the record through the B+-tree. Figure 4.6 shows the structure of the NST. An index structure similar to the NST was also proposed to index time-varying attribute and time. Since the temporal attribute is not unique, the qualified tuples will overlap their associated time intervals.
  • 135. TEMPORAL DATABASES 127 120 3 H 4 S H 7 1 O (a) Insertion of start time 12 into a full AP-tree. (b) Insertion of start times of 13 and 14. Figure 4.5. Append in the AP-tree. The AP-tree only supports monotonic appending with incremental time value. Therefore, the multiplicity of the update operations will be limited. The basic AP-tree itself can support queries involving only time-slice. Even so, the search for time-slice queries is not efficient. A more expensive structure such as the NST has to be used to answer key-time queries. Clearly, for the time-slice queries, it is more efficient to use the AP-tree than the NST-tree. On the other hand, for the key-range time-slice and past versions queries, the NST-tree is more superior. We use the term AP-tree to refer to either of them, and the context determines which structure we are referring to. The Interval B-tree. The Interval B-tree [Ang and Tan, 1995] based on the interval tree [Edelsbrunner, 1983] was proposed for indexing valid time inter- vals. The underlying structure of the interval B-tree is a B+-tree constructed from the end points of valid time intervals. The interval B-tree consists of three structures: primary structure, secondary structure and tertiary structure. The primary structure is a B+-tree which is
  • 136. 128 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS B +-tree for key index AP-trees for time index Data tuples Figure 4.6. A nested ST-tree structure. used to index the end points of the valid time intervals. Initially, it has one empty leaf node. New intervals are inserted into this leaf node. When it overflows, a parent node of this leaf is created, and the middle value of the points, say m, is passed into the newly created index node. The valid time intervals that fall to the left of m is in the left leaf bucket, and those falling to the right of it are in the right leaf bucket. Intervals spanning over m will be stored in a secondary structure attached to m in the index node. Figure 4.7 shows the interval B-tree after inserting tuples tl, t2, t3 and t4 of Table 4.7. Suppose the bucket capacity is 3. When t4 is inserted, the leaf bucket overflows and 6, the middle value of {O, 0, 5, 6, 7, now} is chosen as the item for the index node. The tuple tl is stored in the left child of the new index node, while tl, t2 and t4 are in the secondary structure of index item 6. At this moment, the right leaf bucket is empty because no intervals fall to the right of 6.
  • 137. TEMPORAL DATABASES 129 Index Bucket I I I +t2[5, now] t3[O, 6] t4[O,7] L:JLeaf Bucket secondary structure Leaf Bucket Figure 4.7. An interval B-tree after inserting n, t2, t3 and t4. After the creation of the first index node, any further interval insertion will proceed from the root node of the primary structure. If an interval spans over an index item, it is attached to the secondary structure of this item. A long valid time interval may span over several index items; however, it should be attached to only one of them. The rule is as follow. All the items in the index node can be maintained as a binary search tree called a tertiary structure. The first item that entered this index node is the root of the binary search tree, and the subsequent items having smaller (larger) values will be in the left (right) subtree. Thus, in this binary search tree, the first item found to be spanned by the valid time interval is used to hold it. Figure 4.8 shows insertion of the rest of the tuples in Table 4.7. After insertion, the root of the binary tree in the tertiary structure is 6. Suppose we have a tuple tl6 with time interval [5, 15] to insert. Although the period covers both 6 and 12 in the index node, since 6 is encountered first in the binary tree of the tertiary structure, the tuple is attached to 6. The efficiency of the index is heavily dependent on the distribution of data and the values picked as index. A poor choice of index values may cause most of the intervals being stored in the secondary structure, resulting in a small B+-tree with large secondary structures.
  • 138. 130 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Primary Structure Index Bucket ,6 -----..::::::.:a....__ 12 , t .------- Tertiary Structure 2[5, now) 3[0,6) 4[0,7) 5[4,6) 7[3, now) 10[3,6) 14[3,8) 6[7, now) 8[10, now) 9[12, now) 13[10, 12) Secondary Structure Leaf Bucket Leaf Bucket Figure 4.8. The interval B-tree after insertion of all tuples. B+-tree with Linear Order. Temporal data can also be linearized so that the B+-tree structure can be employed without any modification. Goh et al. [Goh et al., 1996] adopted this approach which involves three steps: mapping temporal data into a two-dimensional space, linearizing the points, and building a B+-tree on the ordered points. In the first step, the temporal data is mapped into points in a triangular two-dimensional space: a time interval [Ts , Te] is transformed to a point [Ts , Te - Ts ]. Figure 4.9 illustrates the transformation of the time interval to the spatial representation for the tourist relation. The x-axis denotes the discrete time points in the interval [0, now], and the y-axis represents the time duration of a tuple. The points on the line named time frontier represent tuples with ending time of now. The time frontier will move dynamically along with the progress of time. In the second step, points in the two-dimensional space is mapped to a one-dimensional space by defining a linear order on them. Given two points, PdXl,Yl) and P2(X2,Y2), the paper proposes three linear orders: • D(iagonal)-order «D). Pl <D P2 iff (a) (Xl + yd < (X2 +Y2); or (b) (Xl + yd = (X2 +Y2) and Xl < X2·
  • 139. y 20 18 now 14 12 10 8 6 4 2 o TEMPORAL DATABASES 131 , ,, , , / :rime Frontier v/ ,/ ,,,,, ,, 2 ~,Tn (outside now) , ,,, ,,,, , ,, , ,, 2 4 6 8 10 12 14 now 18 20 X Figure 4.9. Spatial representation of the tourist relation. • V(ertical)-order « V). PI <v P2 iff (a) X2 + Y2 = now and Xl < X2; or (b) Xl + YI :f:. now and X2 + Y2 :f:. now and Xl < X2; or (c) Xl + YI :f:. now and X2 + Y2 :f:. now and Xl = X2, and YI < Y2· • H(orizontal)-order «H). PI <H P2 iff (a) X2 + Y2 = now and YI < Y2; or (b) Xl + YI :f:. now and X2 + Y2 :f:. now and YI < Y2; or (c) Xl + YI :f:. now and X2 +Y2 :f:. now and YI =Y2, and Xl < X2· now (a) D-order now (b) V-order now (c) H-order Figure 4.10. The three orderings for points in the two-dimensional space. Figure 4.10 provides a graphic representation of the three linear orders de- fined above. Clearly, by linearizing the points using any of the above orders, we can construct a B+-tree on the temporal data. For instance, if we order the
  • 140. 132 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS IX Figure 4.11. Organizing the spatial representation of the tourist relation using a S+-tree and linearizing using the D-order. points of the tourist relation using the D-order, the resultant B+-tree structure is depicted in Figure 4.11. A temporal query can be mapped to a spatial search on the two-dimensional space, which in turn can be translated to a range search operation on the linear space defined by the ordering relation. For example, consider the query "Find all persons who left the United States on or after day 5." This query can be efficiently handled by traversing the D-order B+-tree and retrieving all points in the interval [(0,5), (14, 0)]. However, not all temporal queries can be efficiently handled using the D-order. For example, consider the query "List all persons who entered the United States on or before day 5". The D-order performs poorly for this query, while the V-order is superior. The paper suggests that different indexes (constructed using different ordering relations) be used to support the various types of queries. The main advantage of this method is the ease with which this indexing scheme can be implemented using existing DBMSs. The performance analysis shows that it is more efficient than the time index in terms of both storage utilization and query efficiency. However, the index is more suitable for valid times, which are mostly closed intervals. For data with open intervals, expen- sive reorganization is necessary. 4.3.2 Spatial index based indexing methods The R-tree. Unlike spatial applications where non-spatial data are usually stored and indexed separately from spatial data, temporal attribute data such as time-invariant key and time-varying key are indexed together with temporal data. The time dimension can be viewed as one of the dimensions in a multi-
  • 141. TEMPORAL DATABASES 133 dimensional space and indexed using some existing methods [Rotem and Segev, 1987]. In this section, we discuss how the R-tree [Guttman, 1984] can be used to index temporal data. The R-tree is a multi-dimensional generalization of the B-tree, that preserves the height-balance property. Detailed description of the R-tree can be found in Chapter 2. For temporal applications, to index temporal data and its key, the R-tree can be implemented as a two-dimensional R-tree (2-D R-tree) or a three- dimensional R-tree (3-D R-tree). To use a 2-D R-tree, time intervals [Ts , Te] are treated as line segments in a two-dimensional space, with keys on the other dimension. To index temporal data using a 3-D R-tree, the time intervals and keys have to be mapped into points (key, T s , Te ) in a three-dimensional space. Figure 4.12 shows examples of data partitioning for the tourist relation (see Table 4.7). Both implementations can handle the pure time query, key-time query and pure key query of the query set. For the 2-D R-tree, all searches are performed as intersection search. For the 3-D R-tree, search intervals must be mapped into the search regions in the triangular space. Figure 4.13 shows the query regions on the time dimension for the four search operations. As an example, consider the intersection search. Let the query time interval be [QTs , QTe]. For an interval in the database to intersect the query interval, either its end time must be in the interval or its start time must be in the interval. Thus, no record with end time less than QTs needs to be considered, and no record with start time after QTe needs to be examined. We then have the query region as indicated by the shaded portion. Here it is important to note that the R-tree cannot directly handle intervals with open end-time. An entry in the internal node of the R-tree contains an MBR that describes the data space of its child node. When data intervals are not closed, the MBR cannot be defined properly, and these affect the splitting algorithm that makes use of space coverage to distribute the data into two groups. It is possible to use the current time or the largest time due to the proactive insertion as an estimate during node splitting and data insertion. One of the characteristics of temporal databases is that the historical data is stored for a long time, and no deletion of past data is allowed. The size of the database grows as time progresses, and so are its indexes. Kolovson and Stonebraker proposed variants [Kolovson, 1993, Kolovson and Stonebraker, 1991] of the R-tree to index historical data. The R-tree is used to index time intervals on one dimension and non-temporal attribute on the other. Three variants that store some of the nodes on optical disk were proposed. The first variant (MD-RT) maintains the whole R-tree based index structure on a magnetic disk. There is no migration from a magnetic disk to an optical
  • 142. L34 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Ke p9 ,-rtk ------------------------: , , p8 ' , p7 : w...:, , p6 : _ !7 ' --------::t5....:..:.=-- -,(,----------------, p5 I ' p4 ---------1 18 : p3 +~ - - - - - - - - - -: ~ ~() : -,9 -,- - - - - - - - - - - - - - - - - - - , p2 , , ~ '10 1 I ~ ...Ll1...- ...1l.i...- ' pi ----- -----r_-_-~ .! o 2 3 4 5 6 7 8 9 10 II 12 13 14 now Time (a) Tuples represented as lines in 2-dimensional space Te no 4 13 12 II .17 10 ~ .tI5,' 9 "18· , . ".- , , ,,11'1 ,, (5 " .' , , 3 4 5 6 7 8 9 10 II 12 13 14 w Ts Key (b) Tuples represented as points in 3-dimensional space Figure 4.12. Space partitioning in the R-tree.
  • 143. y Tmax QTe QTs QTs QTe TmaxX TEMPORAL DATABASES 135 y Tmax QTe QTs .. QTs QTe Tmax X (a) Intersection search (b) Inclusion search yy Tmax QTe QTs QTs QTe Tmax X (c) Containment search Tmax QTs QTs (d) Point search Tmaxx Figure 4.13. Query regions for R-tree on the time dimension. :lisk needed. The second variation (MDjOD-RT-1) has the R-tree and its root node on the magnetic disk, and moves the left-most part of the leaf nodes to an optical disk if the size of the R-tree index reaches the pre-defined size. All internal nodes, except the root node, whose child nodes are entirely on the :>ptical disk are recursively vacuumed to the optical disk. The third variant (MDjOD-RT-2) maintains two R-trees, both rooted on magnetic disk. The first resides entirely on the magnetic disk whereas the second stores the root node on the magnetic disk and the nodes of lower level :>n the optical disk. When the size of the first R-t.ree reaches the expected size, all the nodes below its root node are moved to the optical disk. Meanwhile, the references of the first R-tree's root node are inserted into the proper position of
  • 144. 136 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS the second R-tree. The new records will be inserted into the first R-tree while the search operations will be performed on both R-trees. The data is stored in the leaf nodes, and nodes do overlap in their data space for long intervals. In the case that the interval data collections have non- uniform length distributions, overlap between bounding rectangles can be quite severe due to some long intervals. To handle this shortcoming, the Segment R- tree (SR-tree) [Kolovson and Stonebraker, 1991, Kolovson, 1993] was proposed. The SR-tree stores interval records in both non-leaf nodes and leaf nodes. An interval I is stored in the highest level node N of a tree if it spans at least one of the intervals represented by N's child nodes. If an interval segment spans the region covered by a node and extends the boundary of its parent node, it will be cut into a spanning portion and one or more remnant portions. The portions are stored in the separate parts of the index structure. Figure 4.14 shows the case in point. Line segment P spans - - - - C and extends A's boundary root A[i] D ~rc- -- -> -.JL root spanning portion - - - - ~ - - remnant portion Figure 4.14. A SR-tree with spanning portion and remnant portion. An improved version of the of SR-tree, called Skeleton SR-tree, was proposed to pre-partition the entire domain of the interval data into several sub-regions based on estimation of the number of data records and approximation of dis-
  • 145. TEMPORAL DATABASES 137 tribution of intervals. The overlap between data space of leaf nodes is reduced. Such an estimation may be easy to derive for certain applications (for example, video rental) that have little variations in version lifespan. For applications with wide variance of interval lifespan, the pre-partitioning is not effective. The Time-Polygon index. The Time-Polygon Index (TP-Index) was pro- posed to index valid time databases [Shen et al., 1994]. Like the B+-tree with linear order, the TP-Index maps the time interval [Ts , Te] into a point [Ts , Te - T s ] in a triangular two-dimensional space. However, the triangular temporal space is partitioned into some groups such that each group is the clus- ter of data points suited for a certain search pattern. Partitioning along X- and V-dimensions, and parallel to the time frontier produce five polygonal shapes as shown in Figure 4.15. Polygons used in the TP-index are not minimum bounding polygons. The polygons are derived through recursive partitioning, and can be easily merged when the tree is collapsing. The structure of the TP- index is like that of an R-tree. Figure 4.16 shows the partition of the temporal space and the TP-tree structure of the tourist relation. To support proactive additions ofrecords (for example, Tn in Figure 4.16(a)), a virtual time frontier that assumes the largest Te (Tmax ) has to be introduced, and partitions that are adjacent to the time frontier have to be extended outward. A-shape B-shape C-shape D-shape E-shape Figure 4.15. The five polygon shapes in TP-tree. The TP-index was designed solely to index valid time and handle time- slice queries. To enable the TP-index to support the time-invariant key, it is extended to index data in a three-dimensional space [Jiang et al., 1996]. In the data space, the x-axis and y-axis hold the same definitions as before; the z-axis denotes the key values of the data points in the space (see Figure 4.17). Initially, data points are bounded in the three-dimensional temporal space. When overflow occurs, these data points are partitioned into groups such that each group can be stored in one data page. Partitions must cluster the data points to be suited for temporal search patterns. There are three partitions for the TP-tree: y-partition introduces a plane parallel to the x-z plane (called the
  • 146. 138 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS y 20 18 now 14 12 IO 8 6 4 2 o , " / Time Frontier (now) , / '/ "/ , / " , ,, , +Tn (outside now) 6 ", , , ,, , ,,, ,,, 2 4 6 8 IO 12 14 now 18 20 X (a) Partitioning of the temporal space for tourist relation ~ ~ / ~ D:J th ~t t polygon 2 polygon 3 (b) A TP-tree structure data bucket for polygon I data bucket for data bucket for data bucket for polygon 4 Figure 4.16. An TP-tree for the tourist relation.
  • 147. TEMPORAL DATABASES 139 Y Jmax ,, Time Frontier (now) now,' Tmax X I I I I I , I 0 -- data points, I Z (key dimension) Figure 4.17. A three-dimensional spatial rendition of the TP-tree. y-plane); time-partition introduces a plane parallel to the time frontier (called the time-plane); and key-partition introduces a plane parallel to the x-y plane (called the key-plane). The y-partition and time-partition for different bound- ing polygons are similar to those described in [Shen et al., 1994]. Note that after the key-partition, the shapes of the resultant bounding polygons are the same as that before the partitioning. Searching based on time is similar to that proposed in [Shen et al., 1994]' where the search time intervals must be mapped into appropriate query regions. The query regions for the various search oper- ations on the time-dimension are shown in Figure 4.18. For example, consider the query interval [QTs , QTe] for an inclusion search. Since all matching in- tervals must start from QTs , those intervals that start before QTs should be excluded. Similarly, since the query interval ends at QTe , all intervals that end after QTe should be excluded. The resultant query region is thus the shaded region as shown in the figure. 4.3.3 Methods for bi-temporal databases Until recently most indexing and temporal researchers have been working on the indexing problem along one of the two time dimensions. Kumar, Tsotras and Faloutsos [Kumar et al., 1995] proposed two access methods, Bitemporal
  • 148. 140 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS y (a) Intersection search QTs QTs QTe TmaxX Y QTe - QTs QTs QTe Tmax X (b) Inclusion search y QTe QTs QTe Tmax X y QTs QTs Tmaxx (c) Containment search (d) Point search Figure 4.18. Query regions for the TP-tree. Interval Tree and Dual R-trees, for indexing both transaction and valid time dimensions. The Bitemporal Interval Tree makes use ofInterval Tree [Edelsbrunner, 1983] to index a finite set U that contains V valid time points. An interval tree consists of a full binary tree and a number of doublely-linked lists. The V time points are in the leaf, and each internal node contains the middle value of its two immediate children. If the starting point of an interval falls in the left subtree of an internal node and the ending point falls in the right subtree, the interval is stored in the doublely-linked lists associated to this internal node. The left and right lists contain the starting and ending points respectively. In the Bitemporal Interval Tree, the lists are transformed into "conceptual" lists of pages to facilitate the splitting policies of the MVBT [Becker et aI.,
  • 149. TEMPORAL DATABASES 141 1993] so as to answer bitemporal pure-time-slice (BPT) query. By elaborately pagenating the whole indexing structure, the index can answer BPT query in o(10gb V +10gb n +a) I/O operations. The authors also proposed a method that employs two R-trees (2-R) to divide bitemporal records on transaction time. This method aims to eliminate the large overlapping of the mix of rectangles with known ending transaction time and those extending to now. A front R-tree indexes the records whose transaction time is up to now, whereas a back R-tree indexes the records whose transaction time lifespan is closed. 7 6 5 4 3 2 I ,--- t2 I tl 13 12345678 T transaction time (a) Original representation of the time dimensions T'--''--''--1'--1---'---'--1.--1. _ " Vg .", ~ > 7 6 5 4 3 2 t3 I 12345678 "g :=! ..> 7 6 5 4 3 2 T I transaction time 12345678 12 transaction time (b) lhe back R-tree (c) the fronl R-tree Figure 4.19. The two R-tree method. In Figure 4.19(a), there are three records in the bitemporal space. Records tl and t2 have open transaction time lifespan, and the transaction time of t3 is closed at time 3. Note that the three records overlap along the transaction time axis. To avoid this kind of overlapping so as to improve the performance of the R-tree, the dual R-tree method keeps records with closed transaction time range, that is t3, in the back R-tree (Figure 4.19(b)) and records with open transaction time range, that is il and t2, in the front R-tree (Figure 4.19(c)).
  • 150. 142 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS In the front R-tree, a bitemporal record can be represented as an interval line parallel to the valid time axis. As a result, the overlapping is reduced. A bitemporal query is answered by two searches, one is for reCtangles in the back R-tree and the other is for intervals in the front R-tree. The front R-tree needs a slightly more expensive search algorithm due to the open intervals. While it is difficult to extend index structures such as the AP-tree and TSB- tree for bitemporal indexing, the R-tree and the TP-tree can be extended with additional dimensions. For example, a 5-D R-tree or TP-tree could be used to index time-invariant key, transaction point interval and valid time time inter- vals. However, the extension entails redesign of more complex node splitting algorithms and query retrieval algorithms. With an increase in the number of dimensions, spatial indexes may not perform as well. 4.4 Experimental study Indexes are data structures that quickly identify the locations at which indexed data items are stored. Indexes are therefore used as a speed up device in query evaluation algorithms. Properties desired for these indexes include efficient storage utilization, and efficient query retrieval. In other words, the use of disk space should be efficient, which indirectly determines the query efficiency of an index, and an index must be able to answer basic queries efficiently. In addition, index construction and update cost should not be too high although they are often treated as less important selection factors. Various performances have been conducted. The TP-index was shown to be more superior than the Time Index for valid time databases [Shen et aI., 1994]. The result is expected as replication in the Time Index could be very bad, and it results in a much bigger tree. The Interval B-tree was shown to be more efficient than the Time Index and the R-tree [Ang and Tan, 1995]. It is argued that the query efficiency of the interval tree is in the order of 0 (log n + F) where F is the time for reporting intersections. 4.4.1 Implementation of index and buffer management Four indexes, the TSB-tree, AP-tree, 2-D R-tree and TP-tree were implemented in C on a SUN SPARC workstation. In this section, we restrict ourselves to the study on the indexes built on time-invariant key and transaction time. For a large collection of temporal data (such as one million versions), the index size can become fairly large, and it is unlikely that the entirety of the index fits in memory. Instead, some index pages will be paged out as the tree is traversed, and have to be re-fetched at a later time when they are re-referenced. To reduce page re-fetching, a priority-based buffer replacement strategy [Chan et aI., 1992] is used. The strategy employs the least useful policy (LUF policy)
  • 151. TEMPORAL DATABASES 143 and has been designed based on the wayan index is traversed. For a fair comparison, the replacement algorithm was extended for the two-level NST index structure. Under the strategy, priorities are assigned to index pages. An index page is useful if it will be referenced again in a traversal of an index structure; otherwise, the index page is useless in the current traversal. Useful pages have higher priorities than useless pages. As the main concern ofthe work is in minimizing the page re-fetching effect on the performance comparison, the work fixed the buffer size at 32 pages, which is sufficient for traversing the trees with height of up to 5 levels. 4.4.2 Data and query sets The data sets employed in the study was generated using an extended ver- sion of the Time-Integrated Testbed of the Department of Computer Science, University of Arizona. The temporal relations were generated using Poisson distributions with different mean values in arrival time (start time of an in- terval) and version lifespan. Each database contains 1,000,000 versions. The time-invariant attribute is uniformly distributed over [1,10000], and the number of versions per key is randomly determined. For each version, its time-varying attribute value is uniformly distributed in [1, 100000]. For each different set of mean arrival and duration time, the data is generated with the constraint that simulates transaction time. The data is generated in one go and pre-sorted based on the start time. Each tuple is then inserted into the index. By doing so, we did not have to modify existing R-tree splitting algorithm. This is not ideal as the latest versions of transaction time data give rise to open rather than closed intervals. However, apart from the R-tree, the presence or absence of open intervals does not affect the other three indexes. Among the basic queries, we shall look at just two of them: time-slice in- tersection queries and key-range time-slice intersection queries. Being more general, an intersection query is expected to yield more results than the inclu- sion, containment and point queries. Each set of queries contains 100 queries with different keys and time ranges. The keys are randomly picked from its domain (that is [1,10000)). Where there is a key-range search, a predetermined fixed range is used to determine the end of the range. The starting time of the time ranges is generated using the Poisson distribution, and a fixed range. Should the ending time exceeds the current time, then the ending time is set to the current time.
  • 152. 144 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 4.4.3 Some experimental results on indexing invariant keys and transaction time We report on some experimental results on the performance of the indexes that are built on the time-invariant key and transaction time. For time-slice intersection queries, the mean inter-arrival time is fixed at A = 5, and the mean duration time are fixed at J.L = 200, 500, 1000. For key-range time-slice intersection queries, the key range is fixed at 1000 (10% of domain) when the effect of time range is studied, and the time range is fixed at 15000 when the effect of key range is studied. On time-slice intersection query. Figures 4.20a and b show the perfor- mance of the TSB-tree, AP-tree, TP-tree and (2-D) R-tree for time-slice inter- section search queries under mean inter-arrival times of 2 and 5, and a fixed mean duration time of 200. Figure 4.21 shows the effect of longer lifespan on the four indexes. The performance of all four indexes is affected by the search time range used in the query - the longer the search range the worse the performance. Comparison of results summarized in Figures 4.20 and 4.21, reveals that while the mean duration time has little effect on a few indexes, the inter-arrival time has significant effect on the performance of most indexes. Longer mean inter-arrival time means less overlap in time intervals. For indexes such as the TSB-tree and TP-tree, shorter inter-arrival times mean time intervals of different keys are clustered closely, and the same search range intersects more intervals and hence more pages are accessed. The performance of the 2-D R- tree and the AP-tree are affected by the duration of time intervals. For the R-tree which indexes time intervals as line segments, the degraded performance is due to the fact that the minimum bounding rectangles (MBR) in the internal nodes have more overlap for longer line segments. For the AP-tree, the opposite effect is observed. Two factors contributed to this. First, (recall that) the data set is non-overlapping for each key value. Second, a longer duration essentially "stretches" the lifespan of the relation. As a result, the number of nodes to be scanned by AP-tree is smaller for longer duration for the same query range. It is clear that the TSB-tree performs the best. This can be interpreted by the fact that the TSB-tree has a high degree of data clustering in both key and time dimensions. On the contrary, the AP-tree is inferior to all the other techniques. Its page accesses exceeds 2500 pages! This is because to search for the intervals intersecting with the query interval [Ts , Te] in the AP-tree, a leaf node is first determined using T e . All leaf nodes on its right, which contain intervals whose start time is larger than Te , are ignored. Leaf nodes on its left must be searched.
  • 153. TEMPORAL DATABASES 145 (2,200) 3500 3000 TSB-tree -+- R-tree -+- TP-tree -s- 2500 AP-tree "*- "Ql (f) (f) 2000Ql () () <1l (f) 1500Ql OJ <1l 0- 1000 500 0 1000 5000 10000 15000 20000 25000 30000 time interval (a) (A, JL) =(2, 200) (5,200) 3000 2500 '0 2000 TSB-tree -+- Ql R-tree -+-(f) (f) TP-tree -s-Ql () AP-tree "*-() 1500<1l (f) Ql OJ <1l 0- 1000 500 0 1000 5000 10000 15000 20000 25000 30000 time interval (b) (A, JL) =(5,200) Figure 4.20. Effect of arrival rate on time-slice intersection query.
  • 154. 146 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 3000 (5,500) 2500 "0 2000 TSB-tree --<r- Q) AP-tree """*-(/) (/) TP-tree -e-Q) <.> R-tree -I--<.> 1500co (/) Q) en co 0- 1000 500 0 1000 5000 10000 15000 20000 25000 30000 time interval Figure 4.21. Effect of longer lifespan on time-slice intersection query, (.., J-L) = (5 500). On key-range time-slice intersection query. The results for the key- range time-slice intersection queries are very similar to that for the time-slice queries. Here, we shall present the results when (.., J.1) has value of (5, 200). In order to see the effect of key range, the query time range is kept constant, and similarly, to see the effect of the query time range, the key range is fixed. Figure 4.22a shows the result when the key range is fixed at 1000, while Fig- ure 4.22b looks at the effect of varying the key range when the time range is fixed at 15000 time units. Like the time-slice query results, it can be observed the AP-tree is also more expensive than the others due to its two level struc- ture. With such a structure, each AP-tree in the second level of the nested structure is small, and many of such small trees must be searched. It can be seen also that the AP-tree is more sensitive to the key ranges than time ranges (see Figure 4.22b). This is logical since the first level of the nested structure is the B+-tree for keys and the key range determines the number of AP-trees in the second level that need to be searched. As the key range increases, the performance deteriorates. Whereas for a fixed time range, the average number of leaf nodes that need to be searched do not differ greatly. The TSB-tree retains its good performance in key-range time-slice query because of its high degree of data clustering in both key and time dimensions.
  • 155. TEMPORAL DATABASES 147 (5,200) 600 500 TSB-tree -+- AP-tree - R-tree -+- 400 TP-tree -B- '0 (]) (J) (J) (]) (J (J 300Cll (J) (]) Ol Cll a. 200 100 0 1000 5000 10000 15000 20000 25000 30000 time interval (a) Key range = 1000 (5,200) 2500 2000 '0 (]) ~ 1500 (]) (J al(J) ~ 1000 Cll a. TSB-tree -+- AP-tree - R-tree -+- TP-tree -B- 500 50004000 o~-iil-.-liI===~==~~;;;;;;;;1;;;;;;;;;;;;;;;;;;;;;;;;;;;;i 100 500 1000 2000 3000 key range (b) Time range = 15000 Figure 4.22. Performance of intersection search in key-range time-slice query (.., JL) = (5, 200).
  • 156. 148 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS To answer past versions query efficiently, it is important to cluster data by the time-invariant key in an indexing structure. By linking all the past versions of a given key together, the best performance of this query can be expected. However, although the TSB-tree, AP-tree, R-trees and TP-tree do have some feature of data clustering by key, none of them provide an explicit method to link the historical versions of a given key. Hence, a search based on the key is required. Among these four indexes, the AP-tree is likely to be more efficient for the past versions query. For each key that satisfies the search condition, the whole second level AP-tree is retrieved for all the versions. 4.5 Summary In this chapter, we have surveyed a number of promising temporal indexes. Many of these indexes were proposed either for valid time or transaction time database. Researchers only started to work on indexing in a bitemporal database recently. For transaction time databases, the TSB-tree approach is very efficient as it manages to keep the volume of I/O accesses low and uses tight bounding intervals to support fast search. However, it cannot handle disjoint intervals (or overlapping intervals) that may be present in the valid time databases. Direct application of B-trees such as the AP-tree by indexing on a single time point (starting or ending) is efficient in terms of storage space but is not efficient for any search that involves interval. Its inefficiency is due to the fact that no information of the actual data space in the child nodes is captured for pruning the search space. Hence, a simple time-slice search requires the scanning of a large proportion of leaf nodes. Spatial indexes such as the R-tree can be used for indexing both transaction times and valid times. To index open intervals that move with current time NOW, splitting algorithms that split nodes based on area of data space must be re-designed to handle the situation where one side of the MBR is moving with time. The R-tree can be used to index temporal data as line segments or points. As indicated by the experiments, the performance of the R-tree indexing lines is not as ideal as that of the TP-tree. However, should the lines be mapped into points, its efficiency should become comparable to that of the TP-tree. Like other applications, data distribution affects the performance of tem- poral indexes. For bitemporal databases, different distributions may exist for the time-invariant keys, time-varying keys, the number of versions per key, the arrival of new time-invariant keys, and for each key, the arrival of next transaction-time versions and next valid-time versions, and relationships be- tween two times, whether they are strongly bound [Jensen and Snodgrass, 1994]. Generally, the distribution of time-invariant key is likely to be depen-
  • 157. TEMPORAL DATABASES 149 dent on the applications where they can be mapped into some sequential order. Likewise, the distribution of time-varying key is fairly dependent on the appli- cation, some may be on increasing order (for example, salary) while others are likely to be more random. The arrival of new keys and arrival of new versions tend to follow Poisson distribution.
  • 158. 5 TEXT DATABASES Text databases provide rapid access to collections of digital documents. Such databases have become ubiquitous: text search engines underlie the online text repositories accessible via the Web and are central to digital libraries and online corporate document management. Perhaps the key feature distinguishing text databases from other kinds of database is the way in which they are accessed. Queries to conventional databases are exact logical expressions used to satisfy information needs such as "how many accounts have a negative balance" or "which students are en- rolled in computer science". In contrast, queries to text databases are used to satisfy inexact information needs such as "what is the economic impact of recycling" or "what factors led to George Bush's loss in the 1992 presidential election". This inexactness is not because users are unable to express needs precisely; it is because the needs deal with imprecise real-world concepts that cannot be described in a formal system. That is, it is usually not possible to translate such information needs into a logical query expression that will fetch only the documents that are answers-an information need and its answers are not mathematically related. E. Bertino et al., Indexing Techniques for Advanced Database Systems © Kluwer Academic Publishers 1997
  • 159. 152 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Thus there is no exact mechanism for determining whether a document is an answer; instead, queries to text databases are used to identify documents that are likely to be pertinent to the query, that is, likely to be relevant. These documents may even contradict each other-commentators may disagree as to why Bush lost the election, for example. Thus document databases must be designed to answer informal queries and produce the most likely answers. The study of techniques for identifying documents that are relevant to an informa- tion need is known as information retrieval. Since answers have only a loose, informal correspondence to queries it follows that the performance of query evaluation techniques is not just a consequence of how fast they are or how economical they are with system resources. It is also necessary to cOllsider how good they are at identifying relevant documents, that is, their effectiveness. The effectiveness of query evaluation techniques can be formally measured by the proportion of retrieved documents that are relevant and by the proportion of the relevant documents that are retrieved; determination of relevance must be made by a human assessor. (It follows that experiments in information retrieval are expensive, and tend to rely on standard document collections and query sets for which relevance judgments have been made.) Text databases can also be used for more traditional forms of access to data. For example, in a database of newspaper articles each document will include the article's text; but will also include information such as authorship, date of creation, and so on. A possible entry in a database of correspondence is shown in Figure 5.1. Fields such as date could be queried in conventional ways and do not require exotic query evaluation methods. It is the use of informal querying that makes information retrieval systems different to other kinds of database. In this chapter we describe the ways in which text databases might be ac- cessed, kinds of queries, index structures to support these queries, and query evaluation techniques. 5.1 Querying text databases Simple text engines are familiar to anyone who uses the document repositories available via the web. These engines can be used to find information about, say, some individual-to find their home page perhaps-or to search for re- search papers on a given topic. Typical queries are a list of keywords that the user guesses will identify the desired information; the system responds with a list of hits, some of which are relevant and some of which are (in the context of the query) obviously junk. Based on information retrieval theory, the bet- ter systems use efficient query evaluation techniques that return relatively few irrelevant documents.
  • 160. TEXT DATABASES 153 From: Albert Einstein Sender address: Old Grove Rd, Nassau Point, Peconic, Long Island To: F.D. Roosevelt, President of the United States Recipient address: White House, Washington D.C. Date: 2nd August 1939 Sir: Some recent work by E. Fermi and L. Szilard, which has been communicated to me in manuscript, leads me to expect that the element uranium may be turned into a new and important source of energy in the immediate future. Certain aspects of the situation seem to call for watchfulness and, if necessary, quick action on the part of the administration. I believe, therefore, that it is my duty to bring to your attention the following facts and recommendations. In the course of the last four months it has been made probable-through the work of Joliot in France as well as Fermi and Szilard in America-that it may become possible to set up nuclear chain reactions in a large mass of uranium, by which vast amounts of power and large quantities of new radium-like elements would be generated. Now it appears almost certain that this could be achieved in the immediate future. This new phenomenon would also lead to the construction of bombs, and it is conceivable-though much less certain-that extremely powerful bombs of a new type may thus be constructed ... Figure 5.1. Example entry in newspaper database. At the most abstract level, text databases are like conventional databases: given a query, each entry in the database is compared to the query to determine whether it is an answer. To allow this process to be efficient a data structure known as an index is used. Central to effective information retrieval is the ability to use all the terms (that is, words) in a document to compare it to a query. That is, it is necessary to index every term in every document. It is possible to automatically select a subset of the words in a document to represent its content and to index these words only, or to manually assign descriptive words or subject categories. However, automatic selection of key- words is in general not successful; and perhaps surprisingly automatic indexing of all words gives more effective retrieval than does manual indexing [Salton, 1989]. Moreover the cost of manual indexing for a realistically-sized database
  • 161. 154 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS is prohibitive. Thus searches on document databases use content-the full text of each document-rather than descriptors of some kind. 5.1.1 Boolean queries There are two principal approaches to querying to text databases: Boolean and ranked. Boolean query languages were for many years chosen for commercial information retrieval systems. The basic concept is straightforward-queries are Boolean expressions in which the atoms are words and are combined with Boolean operators. For example the query uranium AND ( (nuclear AND energy) OR (atomic AND bomb) ) could be used to retrieve the example document in Figure 5.1. Such queries are effectively equivalent to conventional database queries (and as we discuss below are evaluated in a similar way) but it is not easy for a typical user to translate an information need into a Boolean query. Making good use of Boolean infor- mation retrieval systems requires professional information providers who are experts at interpreting user requests and translating them into formal queries. There are several ways in with Boolean query languages for text retrieval can be extended to give the potential for better effectiveness. One extension of particular value to English text is stemming or suffixing. In its simplest form, suffixing allows partial match on strings, so that for example the query term bomb* would match any word starting with the string bomb. This allows users to match variant forms of the same word, such as bomb, bombs, bombing, bombardier, and so on. Alternatively automatic stemmers can be used; these are algorithms that recognize the standard suffixes used in English'(such as -ed, -es, -ation, and -ness) and remove them prior to indexing [Harman, 1991, Lovins, 1968, Porter, 1980). Stemming is a form of word normalization; another, basic form is case conversion. Another language extension is to allow querying on word proximity, and in particular adjacency. In the query above, there was no requirement that nuclear and energy be nearby in the text. If it is specified in the query that they must be proximate or adjacent then it is more likely that retrieved documents will contain these words as a phrase. The Boolean query languages used in commercial text databases, such as the ISO standard 8777 or Common Command Language, allow the user to require that two words are to be located within any fixed number of word positions from each other. Well-designed interfaces can also help to improve effectiveness, by for ex- ample providing access to an online thesaurus that can be used to expand
  • 162. TEXT DATABASES 155 the query. Such extensions however have no impact on the underlying query evaluation mechanism. 5.1.2 Ranked queries The other principal approach to text retrieval is ranking, in which a query is an expression in natural language or a list of keywords; each document is compared to the query and assigned a numerical similarity; and the documents with the highest similarity values are retrieved for presentation to the user. In contrast to Boolean queries, there is no precise delineation between answers and non-answers; potentially every document in the database has a non-zero similarity but only the first few documents presented for viewing (or, in the case of information filtering [Belkin and Croft, 1992], those above a chosen threshold) are seen by the user. There is a probabilistic assumption that the highest-ranked documents are those most likely to be relevant; thus as the user moves through the list of ranked documents the density of relevant documents should diminish. In many contexts ranked queries are simply lists of keywords, but in others they may be substantial blocks of text. For example, the abstract of a paper- or even a whole paper-could be used as a query to find other papers with a similar topic; experiments with ranking have shown that longer queries are better at identifying relevant documents. Thus a typical query might be a list of keywords such as nuclear atomic energy power or a natural language description such as Relevant documents will discuss the use of nuclear or atomic energy as a power source. The functions used to score documents with respect to queries are known as similarity measures. Many years of information retrieval experiments, with both small document collections and databases of gigabytes of text, have iden- tified several families of effective similarity measures. (These experiments have also shown that ranking is typically more effective than Boolean retrieval, even for queries formulated by an expert.) We do not survey similarity measures in this chapter, but instead illustratively focus on one: the cosine measure. This measure is one of the most effective and has proven successful across a wide range of databases, and is interesting because it makes use of at least as much index information as other effective similarity measures. Discussion of the cosine measure thus allows us to explain what information an index must store. Intuitively, we would like a document and query to be regarded as similar if: most of the query terms occur in the document; they are frequent in the
  • 163. 156 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS document; the density of these words in the document is high; some allowance is made for the "importance" of words, where one would usually regard a word such as uranium to be more discriminating (and therefore more important) than a word such as the. Mathematically these concepts can be captured as follows. The cosine similarity of a document d and query q can be computed as C(q, d) ZtEq&d Wq,t . Wd,t Wq,Wd where Wx,t is thp. importance of word t in x and Wx is the length of x. In this formulation of the cosine measure it can be seen that the numerator is high if important words (that is, high Wx,t words) are in both query and document, and that division by length ensures C(q, d) is high only ifthe document is dense with query terms. Thus, given two documents containing the same query terms with the same frequencies, the shorter of the two will have higher similarity. Word importance is an abstract concept, but in practical ranking is effec- tively captured by the formulations Wq,t (log fq,t + 1) (log~ + 1) and Wd,t (log fd,t +1) Here fx,t is the frequency of occurrence of t in x-that is, the number of times term t occurs in document or query x-and there are N documents in the database of which It contain t. Thus a word that is rare in the collection- that is, has a high inverse document frequency-or frequent in either query or document attracts a high weight. The lengths are usually computed as so that length is essentially a function of the number of distinct words. Note that for a given query Wq is a constant and thus has no impact on the ranking and is not calculated. In principle, then, query evaluation for a query q consists of computing the similarity C(q, d) for every document d in the database, then returning to the user the documents with highest similarity. As for queries to traditional databases, it is valuable to try and improve a ranked query before evaluating it, by removing noise and transforming it into a better description of the information need. In particular, stopwords are usually removed; these are frequent, non-discriminating words such as the and closed- class or function words such as however that carry no meaning. Elimination
  • 164. TEXT DATABASES 157 of stopwords has little impact on effectiveness but is important for efficiency, because these words are so common. After stopping the query above might be transformed to Relevant documents discuss nuclear atomic energy power source Stemming is as valuable for ranking as it is for Boolean queries, for example yielding relev document discus nuclear atom energ power source for the query above. Elementary natural language techniques can also prove valuable; such techniques include recognition and deletion of key phrases, such as "we discuss" or "in this paper" , and recognition of proper names and aliases, so that for example "USA" and "United States" are indexed together. However, while such techniques change the set of terms available for indexing, they do not change the methods used to construct an index or to retrieve documents. For further information on ranking and information retrieval, there are sev- eral good textbooks [Frakes and Baeza-Yates, 1992, Salton, 1989, Salton and McGill, 1983, van Rijsbergen, 1979, Witten et al., 1994]. Recent research de- velopments in the area are presented in special issues of Communications of the ACM [Fox, 1995] and Information Processing and Management [Harman, 1995a] . 5.1.3 Indexing needs The needs of querying determines the kinds of information that must be held in an index. For both Boolean and ranked queries, the index must store every distinct word occurring in the database and, for each word, the documents the word occurs in. To support proximity queries the index must store the positions at which each word occurs in each document; ordinal word numbers are more useful than byte positions. To support ranked queries the index must store the frequency of each word in each document. As we discuss later, richer kinds of queries may require information about document structure. In the following sections we describe index structures that have proved successful for text databases, then explain query evaluation techniques that use these structures. 5.2 Indexing 5.2.1 Inverted indexes An index is a data structure for supporting a query evaluation technique. The most commonly used structure for indexing text databases are inverted indexes,
  • 165. 158 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Lexicon 10 29 41 Documents Mapping table Inverted lists Figure 5.2. Arrangement of a simple inverted file.. a family of structures that can be readily adapted to each of the kinds of query- ing discussed above. Inverted indexes are well-established-they have been used in commercial text retrieval systems since before 1970-and in recent years re- finements to inverted indexing have dramatically improved performance. In outline an inverted index is extremely simple, consisting of a lexicon of the distinct words to be indexed and for each word an inverted list of information about that word. The lexicon must be organized to allow fast search for a given word and each list should allow rapid processing to identify matching documents. Thus in the most basic case the lexicon could be stored as an array of words and each list as an array of ordinal document numbers. A mapping table, also stored in an array, can then be used to map from document numbers to matching documents. This arrangement is illustrated in Figure 5.2. For example, each of the three query terms nuclear, energy, and uranium has an entry in the lexicon (found, say, by binary search in the array) and a corresponding pointer to the inverted list. Each list contains the document
  • 166. TEXT DATABASES 159 number 12; the twelfth position in the mapping table thus points to a document containing all of the query terms. 5.2.2 Search structures For conventional databases, design of the search structure is crucial to perfor- mance. For text databases, the major bottleneck is usually the fetching and processing of the inverted lists, and any structure that allows reasonably fast access to the distinct words of the database is likely to be satisfactory. A typical arrangement would be to use a B-tree in which internal nodes con- tain words and pointers to children and external leaves contain words, pointers to inverted lists, and for each word the number of documents in the database containing the word. For many text databases such a B-tree could easily be held in memory, but the arrangement is also effective if space considerations force B-tree nodes out to disk. Use of a B-tree means that the words can be accessed in lexicographic order, allowing users to scan the lexicon and placing words with the same root but variant suffixes together. If the lexicon is not too large it is feasible to scan it for the strings that match a given pattern. Other search structures have been proposed for lexicons but none offers any clear advantage, while the logarithmic worst-case performance and good space utilization of B-trees make them a desirable choice. As a concrete example consider the database consisting of the 3 Gb of text used in the first three years of the ongoing TREC information retrieval ex- periment [Harman, 1992, Harman, 1995b]1 This database contains just over 1,000,000 documents, and, coincidentally, just over 1,000,000 distinct words at an average of about 9 characters each. There are around 480 x 106 word occurrences in total or, discounting repetitions of words within a document, there are 220 x 106 word-document pairs. (Note that figures of this kind are to a certain extent dependent on how words are defined-whether punctuation such as apostrophes are part of words or delimit them, for example, or whether words are distinguished by case.) Thus the complete TREC lexicon can be stored in the leaves of a B-tree of around 20 to 24 megabytes, given 9 bytes for each word, 4 bytes each for a count and a pointer, and making an allowance for space wastage. Assuming a block size of 8 kilobytes, and therefore a branching factor of 28 to 29 , the total space for all internal nodes of the B-tree would occupy no more than 128 kilobytes and thus even in the worst case only the leaves need be held on disk. In a basic representation the inverted lists would contain 220 x 106 document identifiers of four bytes each, or a little under 1 gigabyte in total. This high ratio of inverted list size to lexicon size is typical of text databases, and is the reason that-in
  • 167. 160 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS contrast to other database applications-inverted lists are not stored directly in the B-tree: their size would prohibit scanning of the lexicon. 5.2.3 Inverted lists A basic inverted list consists of a series of document identifiers, as illustrated in Figure 5.2. But such a list does not support the kinds of queries discussed above; ranking requires word frequencies and proximity requires word positions. Addition of frequency information to a list is straightforward: each docu- ment identifier is followed by a frequency count for that word in that document. Addition of word positions is only a little more difficult, but can add consider- ably to index size: each document identifier is followed by a frequency count f, then by f ordinal word positions. Thus the inverted list for uranium might be 3:1(61), 10:2(14,106), 12:1(9), 29:4(22,36,98,202), ... representing that the word uranium occurs in document 3 once, at position 61; in document 10 twice, at positions 14 and 106; in document 12 once, at position 9; and so on. The punctuation is of course only for the benefit of the reader; the list is stored as the sequence 3 1 61 10 2 14 106 12 1 9 29 4 22 3698 202 ... For the 3 gigabytes of TREe data discussed above the index would contain 220 x 106 document identifiers, 220 x 106 frequencies, and 480 x 106 positions. Query processing (explained in detail below) involves retrieving the inverted list corresponding to each term in the query, then processing the list to extract document numbers and, if necessary, frequencies and positions. A typical query term occurs in up to 1%of the stored documents, and may occur in many more, so in a larger collection the typical retrieved inverted list will contain thousands or tens of thousands of document identifiers.2 Fetching and processing of these lists is the major bottleneck in query evaluation, and any improvement can yield big reductions in query evaluation time.3 The first issue to address is the physical layout of the inverted lists on disk. The two costs of accessing data from disk are the head-positioning time (seek and latency) and the per-bit transfer costs. A programmer cannot directly improve transfer costs, which on current desktop machines allow transmission of approximately 10 megabytes per second. But repositioning of the disk head can be largely avoided by storing each inverted list contiguously, or as close to contiguously as the operating system will allow. A contiguous file can be fetched around ten times faster than a file of 8 kilobyte blocks randomly scattered on a
  • 168. TEXT DATABASES 161 disk, so dramatic gains can result from storing each inverted list so that it can be fetched with a single read operation. Experimental results have shown that, despite "interference" by the underlying file system-such as organizing files into randomly-placed blocks and employing header blocks to locate the parts of the file-the various optimizations used by operating systems allow large files to be fetched at close to the maximum dictated by the transfer rate. In some early implementations of inverted files, each list was stored as a linked list with one node per document, resulting in both appalling perform- ance-allowing only a few kilobytes to be fetched each second-and large in- verted files, because of the additional requirement for pointers. It was imple- mentations such as these that gave inverted files a reputation for inefficiency; a related problem was that use of linked lists discouraged programmers from maintaining inverted lists in sorted order, thus adding further to query evalu- ation costs. However, the strategy of storing inverted lists contiguously does present problems for update. These issues are considered further below. Even with inverted lists stored contiguously they have significant space re- quirements, with, in a simple implementation, 4 bytes for each word occurrence (for the in-document position) and a further 8 bytes (for the document number and frequency) for each word-document pair, giving approximately 4 gigabytes for the 3 gigabyte collection described above. It is clearly desirable that this space be reduced, not only to conserve disk usage but because reduction in size cuts transfer costs and thus, potentially at least, reduces query evaluation times. As a simple first step to reducing size we could question our assumptions: why, for example, have 4 bytes for the document number? At around 1,000,000 documents 20 bits is adequate, increasing the complexity of processing the in- verted list but reducing size significantly. Similarly, 4 bytes is excessive for a frequency or a word position. Space can also be saved by applying a stoplist, that is, not indexing the common words that contribute most to index size. Such ad hoc approaches, however, will at best halve the size of the index, to perhaps 70% of the size of the indexed data. Much greater reductions in size-that is, compression-result from more principled methods for efficient representation of integers [Bell et al., 1993, Bookstein et al., 1992, Choueka et al., 1988, Moffat and Zobel, 1996, Witten et al., 1994]. We assume in the following discussion that the numbers to be compressed are positive integers only, but it is straightforward to adapt these coding schemes to embrace zero and negative numbers. One simple family of representations is the Elias codes [Elias, 1975]. The Elias codes represent integers in variable number of bits, and contiguous se- quences of Elias codes are uniquely decodable. The basic code is unary, in which each number x is represented by a string of x bits. For example, below are some numbers in decimal and their equivalent in unary.
  • 169. 162 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS x 1 2 3 20 7, 3, 6 unary °10 110 11111111111111111110 1111110110111110 In the last line is shown a sequence of numbers; although no punctuation is given the sequence can be separated into the constituent numbers-that is, the sequence is uniquely decodable, an essential property for any such compression scheme. Unary is not particularly efficient for large numbers-"large" in this context means "about 4"-but it provides the first step in the Elias family. The next step is the gamma code, in which each number x is factored as 2P- 1 + d. For example, 1 = 21 - 1 +°and 20 = 25 - 1 +4. Storing p in unary, using p bits, and d in binary, using p - 1 bits, gives another uniquely decodable representation. (In all but the last line of the following table a comma is used to separate the unary and binary parts of each gamma code, but no such separator is required in practice.) x 1 2 3 20 7, 3, 6 gamma 0, 10,0 10,1 11110,0100 1101110111010 The gamma code for a natural number x requires 2llog2 xJ + 1 bits, so that (decimal) 1,000,000 requires 39 bits. The next Elias code is delta, in which x is factored as for gamma but p is represented using gamma rather than unary. x 1 2 3 20 7, 3, 6 delta 0, 100,0 100,1 1l001,0100 10111100110110 Using delta, 1,000,000 is represented in 29 bits; as we discuss below this saving can, in conjunction with other manipulations, yield excellent compression. Another family of representations is the Golomb codes [Golomb, 1966, Gal- lager and Van Voorhis, 1975]. These codes are of particular interest because, as
  • 170. TEXT DATABASES 163 we discuss below, for this application they yield optimal whole-bit compression.4 In the Golomb codes a single integer parameter b is used to model the distri- bution of values to be represented; this value can be approximated as b ~ 0.69 x average x. Given b, the number x is factored as 1 + (k - 1) x b + d where a :s d < b. The value k is represented in unary and d in binary; but since b may not be a power of 2 the number of bits used to represent d can vary between llog2 bJ and flog2 b1. Computing r =flog2 bland 9 =2r - b, the value d is encoded in r - 1 bits if d < 9 and as d +9 in r bits otherwise. For example, suppose b is 11 so that r is 4 and 9 is 5. Then the numbers 1 to 5 are represented by the sequence of codes 0,000 to 0,100 (where the range of suffixes is ato 4, represented in 3 bits each) and 6 to 11 are represented by 0,1010 to 0,1111 (for suffixes 5 to 10, in 4 bits each). The codes are uniquely decodable and, as for all such codes, all sequences of bits are a valid code. Variable-bit coding is a necessary tool for compression of inverted lists. How- ever, applying variable-bit codes to inverted lists in their raw form does not yield particularly good compression; for example, the average document num- ber only requires one or two bits fewer than the maximum number, and as the examples above show the coding schemes do not directly result in significant reductions in size. A simple property of inverted lists provides the basis for much greater com- pression. Observing that most of the numbers stored in inverted lists-the document numbers and the positions-are strictly increasing, by taking the difference between adjacent numbers of the same kind the values to be stored become much smaller. Our example inverted list can be written as 3:1(61), 10-3:2(14,106-14), 12-10: 1 (9), 29-12:4 (22,36-22,98-36,202-98), ... that is, 3:1(61), 7: 2 (14, 92), 2:1(9), 17:4(22,14,62,104), ... Considering for the moment just the document numbers, the sequence resulting from taking differences forms a Bernoulli distribution, for which the Golomb codes are an optimal representation [Bell et al., 1993]. An inverted index con- sisting of a lexicon and, for each indexed word, an inverted list of Golomb-coded
  • 171. 164 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS document numbers occupies under 10% of the size of the indexed data. For the 3 gigabyte database discussed above such an inverted index requires about 190 megabytes. Delta codes can also be used, at a small loss of compression efficiency. Using gamma codes for frequencies and delta codes for word posi- tions, an inverted file typically occupies about 22% of the size of the indexed data, or under 700 megabytes in our practical example-one sixth of the space required for the uncompressed index. This space saving does come at a cost: processing effort required to de- code inverted lists. However, on current desktop machines the time spent in decompression is more than offset by the time saved in data transfer [Moffat and Zobel, 1996], and in new architectures the gap between processor speed and disk transfer rates is continuing to widen, favoring the use of compression. Thus inverted file compression saves both space and time. Further refinements to representation of inverted files are discussed in Section 5.3. Although the successful application of compression to inverted files is fairly recent, compression is already used in several commercial text database systems and some of the Internet search engines. The public-domain MG text database system was developed to demonstrate the application of compression to this domain [Bell et al., 1995, Witten et al., 1994). 5.2.4 Index construction There are several possible approaches to index construction for text databases, which can be broadly classified as either one-pass or two-pass, that is, according to the number of times the text is inspected during index construction. We first outline the possibilities, then describe two of the more efficient methods in detail. The concept of indexing has often been described as "inversion"-provision of access to records according to content. Inversion is often implemented as a sorting process, and indeed a common algorithm given in textbooks for gener- ating an inverted file is as follows: 1. For each document d in the collection and each word t in d, write a pair (t, d) to a file. 2. Sort the file with t as a primary sort key and d as a secondary sort key. This algorithm is, however, almost absurdly wasteful-the document numbers are already sorted, but sorting algorithms will gain little advantage from this partial sorting. Moreover, the volume of index information dictates an expen- sive external sort. Better solutions use a dynamic structure containing the distinct words in the database, where each node in the structure points to a dynamic list of
  • 172. TEXT DATABASES 165 1. While the internal buffer is not full, get documents; for each docu- ment d, extract the distinct words and for each word t, (a) If t has a.lready occurred in a previous document, add d to t's document list. (b) Otherwise add t to the structure of distinct words and create a document list for t containing d. 2. When the internal buffer is full, write it to disk to give a partial index, with the inverted lists stored according to word order. Clear the buffer and return to step 1. 3. Merge the partial indexes to give the final inverted file. Figure 5.3. Single-pass index construction algorithm using temporary files. the document numbers containing that word. Initially the word structure is empty; as documents are processed new words are added, and for existing words new document numbers are added to the words' lists of occurrences (together with the positions of the word in each document). However, in a naive implementation the costs will still be high because of the difficulties of maintaining structures of words and lists without frequent disk accesses. Minimizing the use of disk is the key to fast index construction. There are two fast index construction methods, both of which use a dedi- cated in-memory buffer as a temporary storE:. In the first method, shown in Figure 5.3, the buffer is used to store complete partial indexes and the database is processed in a single pass. Note that compression is as useful during indexing as it is in the finished index-if the partial indexes are constructed and stored compressed, more documents can be indexed before the internal buffer is filled, and less temporary space is required for the partial indexes. The main disadvantage of this method in practice is the use of temporary space for the partial indexes, which will exceed the size of the final index because the indexed words must be repeated between files; and further space is required for merging. Note that given a fixed-size internal buffer the asymptotic cost of the merging grows more quickly than does the volume of data to be indexed. This is not usually a problem in practice because, at least historically, growth in database size has been matched by improvements in technology, but the single-pass algorithm is not suitable for "huge" databases. The alternative efficient method, however, has neither of these problems. This method is outlined in Figure 5.4. Given memory for a complete lexicon
  • 173. 166 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 1. Extract the distinct words from each document, and for each word count the number of documents in which it appears. (Additional statis- tics are required if word positions are to be stored.) 2. Use the complete lexicon and occurrence counts to create an empty, template inverted index, to be progressively filled in during the second pass. The template index contains each distinct word and, for each word, contiguous space for the word's document list. 3. Initialize the second pass by creating, in the internal buffer, an empty document list for each term in the lexicon. 4. While the internal buffer is not full, get documents; for each docu- ment d, extract the distinct words and, for each word t, add d to t's document list. 5. When the buffer is full, write the partial index into appropriate parts of the template index, clear the document lists, and go to step 4. Figure 5.4. Two-pass index construction algorithm. and for a fixed buffer to be used as a temporary store, a text database can be rapidly indexed in two passes using no temporary disk space at all [Witten et aI., 1994). In this method, the first pass is used to construct the lexicon and a skeleton for the complete index. The skeleton is progressively filled in during the second pass, by writing the contents of the buffer when it becomes full; note that each writing of the buffer requires only a single pass through the disk, thus minimizing disk head movement. Both methods are highly efficient in practice, indexing about half a gigabyte of text per hour on a large desktop machine. Indeed the principle costs tend not to be the indexing itself but the auxiliary processes such as the parser for extraction of words from each document. 5.2.5 Index update Compared to records in conventional databases, each record in a text database contains a large number of items to be indexed-usually hundreds and often thousands or more. Index update is therefore expensive: insertion of a single record involves changing the inverted list of every word occurring in that record. Since these changes can increase the length of the inverted lists, so that (if stored contiguously) they may no longer fit at their current location on disk,
  • 174. TEXT DATABASES 167 update also involves moving lists to allow for such increase. The cost of update is the most significant technical difficulty faced in implementation of a text database system. In this section we describe approaches to update of indexes for text database, principally considering record insertions as these are by far the most common update operation to text databases: in contrast to conventional databases, in which every record in a table may be modified daily by operations such as "add interest to every account balance" , there are no bulk updates, and a great many text databases are used to store streams of incoming data such as newspaper articles, court transcripts, and completed documents of one kind or another. There is no single clever strategy that dramatically reduces update costs (which, for similar reasons, are also a problem for the alternative technology of signature files). There are however several strategies for ameliorating update costs, by using temporary space, by trading update time against query evalua- tion time, and by deferring the availability of new documents. We now outline some of these strategies. Updating the index as each record is inserted is costly, but the per-record cost rapidly diminishes if insertions are batched, say into groups of R records, and all of the corresponding index updates handled at once. Such aggrega- tion of updates is effective because records share many words (in particular the common words, whose inverted lists are the most expensive to access and up- date), and because the changes to the inverted lists can be handled in order of appearance on disk, minimizing head movement-net seek time will be almost unchanged compared to updating the inverted file for a single record. Varying R trades the per-record cost of update against the delay until the record be- comes available. In some environments, for example, it may be quite reasonable to process all insertions overnight, in which case the amortized update cost is negligible but the database will be unavailable while the index is modified. In other environments, the downtime and the delay in availability of new records are unacceptable. However, simple variants of the batching strategy can still be used. For example, if the new records are not indexed immediately that does not mean that they are unavailable; they can be held in a pool that is exhaustively searched during evaluation of each query.s If this pool is large enough that exhaustive searching is an unreasonable expense, the pool can be treated as a mini-database and indexed accordingly. Once we grant the existence of a pool index, further cost ameliorations are possible. In particular, the main index can be updated on the fly, with each in- verted list updated as the opportunity arises-when that inverted list is fetched as part of query evaluation, for example, Or when a moment of inactivity allows the machine to schedule the update.
  • 175. 168 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS A further amelioration is to consider the organization of each inverted list of disk. Contiguous storage is clearly preferable for fast query evaluation, but does not allow the fastest update for the reasons discussed above. However, it does allow reasonable update. A simple free-list of available space can be used to maintain the index, for example, typically resulting in space utilization of around 67%-an unfortunate increase in index size, but not a disaster given the small initial size. An alternative is to carve each list into blocks in some way. Here again there is a trade-off, since long blocks are highly wasteful of space-the average inverted list is kilobytes but the median is only tens of bytes-but short blocks are in effect a linked list. One approach that has been suggested is to use a linked list of blocks, each one twice the length of its predecessor [Faloutsos and Jagadish, 1992]. However, if applied to all the lists this solution does not reduce storage costs and increases query evaluation costs. To see why, consider how the individual blocks must be allocated. Either each block size must be stored in a separate file or blocks must be managed within a single file via a scheme such as the buddy system; in either case significant head movements are required to fetch a single inverted list. Moreover, in either case the trailing block in each list will be only partially used, giving average space utilization of 75%. In the presence of update some of the blocks of each size will be unused, further reducing space utilization. Thus the scheme uses only slightly less space than contiguous storage but adversely impacts query evaluation. The volume of data read and written during update is reduced (in both cases the whole list must be read; in the contiguous case, if there is no room for expansion the whole list must be written elsewhere, whereas in the blocked case only the end of the list must be written), but more separate accesses are required for these accesses for the blocked lists. A practical compromise is to partition only the longest lists into fixed- or variable-length blocks, and use conventional space management strategies to manage the rest so that these lists are stored contiguously. A block size that reflects the organization of the underlying file system is likely to give good performance. Note that maintaining the contents of a contiguous list in sorted order is not a significant overhead-even if updates (as opposed to insertions) are frequent, the cost of inserting a number into an array in memory is dwarfed by the cost of reading or writing the array to disk-and maintaining sorted order significantly reduces the cost of query evaluation. 5.2.6 Signature files Our presentation of inverted files has been rather clear-cut, specifying exactly how text should be indexed with only limited options for variations that might
  • 176. TEXT DATABASES 169 improve performance. We are able to present the material in this way be- cause, currently at least, the technology is fairly settled. There is no compet- ing methodology for indexing text that efficiently supports evaluation of query types such as ranking and proximity. Inverted files have not always held such a position, however. An alternative technology for more limited applications is signature files. In signature files, each record is represented by a fixed-length bitstring, or signature [Pfaltz et al., 1980). The words in the record are hashed to decide which bits are set to 1; a record is probabilistically likely to contain a given word if all the bits in its signature that correspond to that word are set. As in all hash- based methods an explicit vocabulary is not required. Naive query evaluation requires inspection of all the signatures. However, only those bit positions corresponding to the query terms need to be inspected, so, by transposing the array of signatures into an array of bitslices, rapid evaluation of conjunctive queries is possible [Roberts, 1979). Further improvements can be obtained by organizing the slices into a multi-level structure [Kent et al., 1990, Sacks-Davis et al., 1987). Once likely matches are identified these records must be retrieved and post-processed to verify whether they contain the query terms. Signature files are well-suited to many of the older text database appli- cations, which featured: fixed-length documents such as abstracts; machines with small memories and large numbers of users; and simple Boolean and adja- cency queries. Compared to the traditional linked-list inverted files, signature files are rather smaller and give significantly better evaluation times. How- ever, signatures are not effective for current text applications, partly because they are poor at indexing databases whose records vary dramatically in length; and partly because they do not provide efficient evaluation mechanisms for the rich query paradigms that users now expect for text databases, including not only ranked and proximity queries but the structured-based querying discussed below. Moreover, they are not as compact as the current inverted file imple- mentations, which radically improve on the implementations of only a few years ago [Zobel et al., 1992, Zobel et al., 1995a). 5.3 Query evaluation 5.3.1 Boolean queries Boolean query evaluation is, conceptually, a straightforward application of el- ementary algorithms. Assuming the inverted lists are stored in sorted order (and neglecting for the moment queries involving phrases or proximity) each operation is a simple linear merge of two sorted lists, with intersection for AND and union for OR. The temporary space required to represent the result of the merge is at most one slot for each document in the database.
  • 177. 170 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Evaluation is only made slightly more complex by introduction of proximity queries. An intersecting merge is used to find the documents containing the words that must be proximate; then a comparison of positions is used to check that the words are appropriately close within the documents. Note that the word positions should be represented as ordinal word occurrences rather than byte positions, or it is not possible to reliably identify whether two words are actually adjacent. 5.3.2 Ranked queries The principle of ranking was sketched out above: a similarity measure such as cosine is used to allocate a numerical score to each document in the collection with respect to the query, then the documents with the highest scores are retrieved for presentation to the user. In this section we explain how an index can be used to rapidly compute the scores for the highest-ranked documents. Reformulating the cosine measure as C(q, d) LtEq&d Sq,d,t Wq ,Wd where 5 q,d,t = Wq,t . Wd,t, it can be seen that, for any document d, the value Sq,d,t is non-zero only if t occurs in q, that is, if t is a query term. The numer- ator LtEq&d 5 q,d,t can be computed considering only query terms; thus all the information required to compute the numerators is available in an inverted file. (For the remainder of this discussion we assume that each inverted list consists of pairs (d, ht) of number-frequency pairs, and that position information is either not stored or is ignored by the ranking process.) The query length Wq is unnecessary, but the document lengths Wd must be precomputed and stored in a separate structure; with efficient representations these lengths can be stored in a few bits each [Moffat et al., 1994]. Using the inverted file, the cosine similarity of a document d and query q can be computed as in the elementary ranking algorithm in Figure 5.5. An array of accumulators is used to store, for each document in the database, the running totals of the partial sum LtEq&d 5q,d,t. For a typical database and query, once index processing is complete a reasonable fraction of the accumulators will be non-zero. These accumulators are then normalized by the document lengths, and a partial sort such as a heapsort is used to identify the k documents with the highest cosine values. The elementary ranking algorithm provides reasonable performance, and in- deed has been employed in many practical information retrieval systems. How- ever, it does have significant costs that in many environments are unacceptable, particularly for larger document collections. First, ranked queries are often ex-
  • 178. TEXT DATABASES 171 1. Create an array A of accumulators, one for each document d in the database, and for each d initialize Ad f- O. 2. For each term t in the query, (a) Compute the term weight Wq,t. (b) Retrieve the inverted list for t from disk. (c) For each term entry (d, id,t) in the inverted list, compute Wd,t and set Ad f- Ad + Sq,d,t. 3. Divide each non-zero accumulator Ad by the document length Wd. 4. Identify the k highest accumulator values (where k is the number of documents to be presented to the user) and retrieve the corresponding documents. Figure 5.5. Elementary ranking algorithm using an array of accumulators. pressed in natural language, and therefore contain a large number of query terms; from the point of view of effectiveness this is beneficial because increas- ing the number of query terms can significantly improve the likelihood that the query will locate relevant documents. Second, some of the query terms may occur in a good fraction of the records in the database. The inverted lists for these query terms must be retrieved and processed in full, and some of them may be long. Third, the array of accumulators, which contains a floating point value for each document in the database, is accessed frequently and randomly and hence must be stored in memory; and a separate array is required for each simultaneous query. Fourth, the array of document lengths must be either held in memory or fetched in full for each query.6 In combination, there is substantial use of disk traffic, for inverted list re- trieval; memory, for accumulators and document lengths; and processor time, for decompression, accumulator update, and accumulator normalization. We need to consider ways to reduce all these costs. An observation that allows savings in all of these resources is that a to- tal ranking is unnecessary-in response to a given query users are only inter- ested in a tiny subset of the document collection. Thus it is not necessary to compute the similarity of every document. Using simple heuristics, several of which are discussed below, it is straightforward to drastically prune the number of accumulators required without degrading retrieval effectiveness. (However, note that two methods can highly rank completely different documents. That
  • 179. 172 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS is, maintenance of effectiveness does not imply that the same documents are fetched, but only that the same proportion of fetched documents are relevant.) Once the number of accumulators is reduced, index reorganizations can be used to reduce the other resource requirements. A straightforward approach to reducing the number of accumulators is to restrict their number to some fixed value Amax where Amax « N, the number of documents. In simple versions of such algorithms [Moffat and Zobel, 1996], query terms are processed in order of decreasing importance as measured by their inverse document frequency; each (d, fd,t) pair is decoded and d, if not previously encountered, is only allocated an accumulator if the limit Amax has not yet been met. Thereafter only existing accumulators can be updated, and (d, fd,t) pairs referring to other documents are ignored. Thus only documents containing rare (high inverse document frequency) terms are allocated accu- mulators, on the heuristic assumption that documents without such terms are unlikely to be relevant. Experimentally there was no impact on effectiveness with Amax set so that only around 2% of the documents have an accumulator, thus reducing memory requirements by about a factor of 15 (although there is only one-fiftieth of the number of accumulators, each accumulator now requires a document number and is stored in a sparse data structure), and eliminating some of the compu- tational requirement for accumulator update. Since most of the (d, fd,t) pairs in each inverted list are no longer used-particularly in the long inverted lists of common terms-the decompression of these pairs is wasted effort. Most of the decompression can be avoided by introducing a small amount of inter- nal structure into each inverted list to allow the unused (d, fd,t) pairs to be skipped, slightly increasing disk traffic but halving processing costs. This in- ternal structure can also be used to accelerate Boolean query processing. With these improvements the remaining important bottleneck in processing is the disk traffic. An alternative method further reduces processing costs and also reduces disk traffic [Persin et al., 1996]. The basic idea is that by only allowing sufficiently large Sq,d,t values to create an accumulator, the number of accumulators will be reduced. The principle underlying "sufficiently large" is that-because accumu- lator values grow as inverted lists are processed and because Sq,d,t values tend to diminish if inverted lists are processed in decreasing order of inverse docu- ment frequency--the effect of adding further Sq,d,t terms to the accumulators is increasingly marginal, and not only are unlikely to bring new documents into the top k but cannot even significantly perturb the ranking. By comparing each Sq,d,t value to two current thresholds (one to check whether the value should be considered at all and one to check whether it warrants a new accumulator), small Sq,d,t values can be filtered and the number of accumulators restricted.
  • 180. TEXT DATABASES 173 The thresholds are increased as inverted lists are processed. This method, as for the skipping method, drastically reduces memory requirements without de- grading retrieval effectiveness, but it requires two parameters to control the degree of filtering. If the inverted files are designed appropriately disk traffic can also be dra- matically reduced. The principle of the index design is that inverted lists are sorted by within-document frequencies rather than by document number. For example, consider the inverted list (5,3)(9,2)(12,2)(16,5)(21,1)(25,2)(32,4) , representing that the term being indexed occurs three times in document 5, twice in document 9, and so on. If the list is ordered first by decreasing within- document frequencies, with a secondary sort by document number, then it becomes (16,5)(32,4)(5,3)(9,2)(12,2)(25,2)(21,1) , With this ordering, all of the sufficiently large Sq,d,t values in each inverted list are at the start; once a small Sq,d,t value is reached then fetching and processing of that inverted list can terminate. In the experiments of Persin et 301. this allowed a five-fold reduction in disk traffic and processing time. A potential drawback to this reorganization of inverted lists is that the docu- ment numbers are no longer sorted, so that the compression strategy described above is not strictly applicable. However, a straightforward modification of it yields equally good compression. First, the frequencies are stored in decreasing order, so the duplicate frequencies are redundant and can be omitted. Sec- ond, in practice most of the frequencies are either 1 or 2, and compressing the sorted document numbers of a given frequency yields good space saving. Overall, frequency sorting slightly reduces index size. Another alternative, also based on frequency-sorted inverted lists, is to in- terleave the processing of the inverted lists rather than process them sequen- tially [Persin, 1996]. In the query evaluation methods described above, each inverted list is processed sequentially from the beginning until either the list is exhausted or the frequencies are judged to be sufficiently small that they will not affect the ranking; once processing of an inverted list is complete, it is not revisited. But consider two terms t and tl occurring in documents d and dl re- spectively. Even if t is ra.rer than tl and has higher inverse document frequency, so that t's inverted list is processed first, it may well be that Sq,d,t is less than Sq,d',t ' if t is much less frequent in d than tl is in dl . It follows that, if we are to observe the principle that high Sq,d,t values should be processed first, it is inappropriate to process the whole of the inverted list for t before commencing the list for t',
  • 181. 174 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 1. Create an empty set of accumulators. 2. For each term t in the query, identify the highest within-document frequency id,t for that term and compute the partial similarity Sq,d,t. 3. While the largest unprocessed Sq,d,t value is sufficiently large, (a) Find the query term t with the largest unprocessed Sq,d,t value. (b) If there is an accumulator Ad present in the set of accumulators, set Ad +- Ad + Sq,d,t· (c) Otherwise, if the number of accumulators is less than Amax , create a new accumulator Ad and set Ad +- Sq,d,t. (d) Compute the next highest Sq,d,t value for t. 4. Divide each accumulator Ad by the document length Wd. 5. Identify the k highest accumulator values and retrieve the correspond- ing documents. Figure 5.6. Interleaved ranking algorithm using limited accumulators In interleaved ranking, processing consists of considering the partial simi- larity values Sq,d,t in order of strictly non-increasing magnitude, independent of the inverted lists in which they occur. Efficiency gains results from two heuristics: limiting the number of accumulators so that only the larger Sq,d,t values can create an accumulator; stopping when the next greatest Sq,d,t value is sufficiently small and is unlikely to affect the relative order of the high- est ranked documents. Whether an Sq,d,t value is "sufficiently small" can be heuristically determined by examining the current accumulator values. An al- ternative approach is to explicitly bound the time required to evaluate a query, and terminate processing when the time bound is reached. Such processing is supported by frequency-sorted indexes, in which the high- est frequencies in each list (and thus highest Sq,d,t values in each list) are at the start, and (el, id,t) values can be retrieved from each list in decreasing order. Interleaved query evaluation is shown in Figure 5.6. The main potential disadvantage of interleaved ranking is that inverted lists are fetched on demand, piecemeal, rather than with a single read. Thus fetching the whole list at once incurs the overhead of retrieving unnecessary data, while fetching the list at need can incur the overhead of unnecessary disk activity. In practice, however, the problem does not appear to be significant-in most cases
  • 182. TEXT DATABASES 175 all of the required (d, Id,t) pairs are in the first few kilobytes of each inverted list, so fetching a single disk block from the start of each list is sufficient [Brown, 1995]. Moreover, in some cases not even the first block is required; if the maximum Id,t value for each term is held with the term in the lexicon, it is possible to identify that, for some terms, no 5q,d,t value will be sufficiently large. These are not the only possible approaches for improving the basic ranking algorithm. Elimination of stopwords can be used to reduce the computation costs. However, it is sometimes difficult to determine the correct set of stop- words for a particular document collection. For example, in a database of articles from the Wall Street Journal within the TREC collection, the word "text"-not a particularly common word in English-is encountered in every document in the collection. Other proposals have been based on dynamic stopping conditions. One is that the number of accumulators be limited by considering only documents that contained a term with a sufficiently high inverse document frequency [Harman and Candela, 1990]. Another possible stopping condition is to reduce the num- ber of (d,ld,t) pairs by computing an upper bound for the similarity of the current document being considered, and ignoring 5q,d,t if the computed upper bound was smaller then the weight of the least important document in the set of answers [Lucarella, 1988]. The efficiency of the basic ranking algorithm can also be improved using the assumption that only k top ranked documents are to be retrieved [Buckley and Lewit, 1985]. In this method, query processing is terminated when the upper bound of the similarity of the k +1st document be- comes less than the similarity of the kth document. However, these schemes do not provide the dramatic improvements given by the methods discussed above. 5.4 Refinements to text databases 5.4.1 Structure and fields Traditional text retrieval systems regard each document as an unstructured sequence or bag of words. However, documents consist of fields such as titles, sections, and paragraphs. These components often conform to a hierarchical structure that can be represented by a formal schema such as an SGML docu- ment type description [Goldfarb, 1990]. Compared to traditional database applications, text objects conforming to the same schema can vary widely in both structure and size. Consider, for example, a collection of documents relating to the technical details for the products of a manufacturing company. These documents might include mem- oranda, engineering reports, and surveys of technical literature, all written to conform to the company's official proforma. They might also include other memoranda written by office staff without reference to the official forms, let-
  • 183. 176 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS <letter> <head><from>Mark Twain«from> <to>W. D. Howells>«to> <date>15 June 1872«date> «head> <body><sentence> Friend Howells < (sentence> <sentence> Could you tell me how I could get a copy of your portrait as published in Hearth & Home? «sentence><sentence> I hear so much talk about it as being among the finest works of art which have yet appeared in that journal, that I feel a strong desire to see it. «sentence><sentence> Is it suitable for framing? «sentence> ... «body> </letter> Figure 5.7. SGML document illustrating hierarchical structure. tel's that have little structure in common with either of the other classes of memoranda, documents from external sources, and so on. Yet all these docu- ments must be searched as a single collection. The lack of uniformity among the documents in a single collection makes indexing and retrieval more complex than if the documents had uniform structure and size. We illustrate structure by considering a collection of documents in which markup (such as SGML tags) is included in the text to represent the structural information. Consider for example the document in Figure 5.7, which is a letter consisting of a head and body. The head consists of three fields: from, to and date and the body consists of a number of sentences. Each structural unit is delimited by a start tag and an end tag. For example, a sentence starts with a <sentence> tag and ends with a </sentence> tag. The document forms a simple tree, in which the text is in the leaves and each structural unit is a node. Structured documents can be queried in the traditional way, as if they were no more than a sequence of words, but query languages can take advantage of the structure to provide more effective retrieval. A simple example of a query involving structure is find documents with a chapter whose title contains the phrase "metal fatigue" If such queries are to be evaluated efficiently they require support from indexing mechanisms. One possibility is to use conventional relational or object-oriented database technology to store and index the leaf elements of the hierarchical
  • 184. TEXT DATABASES 177 structure, and maintain the relationships between these leaf elements and the higher level elements of the document structure in other relations (or object classes). Join operations can then be used to reconstruct the original docu- ments or document components. The problem with using such technology is that a large number of database objects may be required to store the infor- mation from a single document, so that it is expensive both to search across the document and to retrieve it for presentation. For these reasons specialized indexing techniques for structured documents have been developed. Perhaps the simplest method for supporting structure is to index the docu- ments and process queries as for unstructured docllments, so that the result of query resolution is a set of documents that potentially match the query; these documents can then be filtered to remove false matches. As a general prin- ciple it is always possible to trade the size and complexity of indexes against post-retrieval processing on fetched documents-there is a tradeoff between the amount of information in the index and the number of false matches that must be filtered out query time, and indeed for just about any class of data and in- dex type it is possible to conceive of queries that cannot be completely resolved using the index. It is often the case, however, that addition of a relatively small amount of information to an index can greatly reduce the number of false matches to process; consider how adding positional information eliminates the need to check whether query terms are adjacent in retrieved documents. More- over, the cost of query evaluation via inverted lists of known length is usually much more predictable than the cost of processing an (unknown) number of false matches. We therefore consider query evaluation techniques that involve increased index complexity and reduced post-retrieval processing. One approach is to encode document structure in the index. For each doc- ument containing a given word, rather than storing the document number and the ordinal positions at which it is possible to store, say, the document number; the chapter number within document; paragraph within chapter; and finally position within paragraph. Indexes for hierarchically structured documents require that considerably more information be stored for each word occurrence, but the magnitudes of the numbers involved are rather smaller, the "take difference and encode" com- pression strategies can be applied, and there is plenty of scope to remove re- dundancy: if a word occurs twice in a document, the document number is only stored once; if it occurs twice in a chapter, the chapter number is only stored once; and so on. Experiments have shown that, compressed, the size of such an index roughly doubles compares to storing ordinal word positions, from about 22% of the data size to 44% of the data size [Thom et al., 1995]. The resulting indexes allow much more powerful queries to be evaluated directly, without recourse to false matching.
  • 185. 178 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Rather than encode the structural information within the inverted indexes, another approach is to maintain simple word position indexes for each term in the database and record the structural information in separate indexes. In order to represent the positions of the words and the markup symbols, the words in each document are given consecutive integer numbers and the markup symbols are given intermediate rational numbers. Thus, for example, a certain word might occur at position 66, the start tag for paragraph occur at position 53.5, and the end tag at 69. I-from which it can be deduced that the word occurs in the paragraph. The positions between a start tag and the corresponding end tag constitute an interval. Evaluating Boolean queries with conventional text indexes involves merging the inverted lists the query terms. In contrast, the processing of structural queries involves merging the inverted lists of word positions and inverted lists of intervals. For example, processing the query find sentences containing "fatigue" involves merging the inverted lists of word positions for the term "fatigue" and the inverted list of intervals for the tag sentence to identify a set of intervals containing the word. An approach to query on structure based on text intervals was formalized as the GCL (Generalized Concordance Lists) model [Clarke et aI., 1995]. The GCL model includes an algebra that incorporates operators to eliminate inter- vals that wholly contain (or are wholly contained in) other intervals. These operators are important for efficient query processing. GCL evolved from two earlier structured text retrieval languages developed at the University of Wa- terloo [Burkowski, 1992, Gonnet and Tompa, 1987], one of which, the Pat text searching system, was developed for use with the New Oxford English Dictio- nary. Dao et al. [Dao et aI., 1996] extended the GCL model to manage recursive structures (such as lists within lists). Compared to the approach of incorporating document structure within the inverted indexes, the GCL model and its variants have two important advan- tages: queries on structure only (such as "find documents containing lists") can be evaluated efficiently using the interval index; and the GCL model does not require that the document structure be hierarchical. On the other hand, it is expensive to create and manipulate inverted lists of commonly occurring tags (such as section or paragraph) that are contained in every document so that, for hierarchical document collections, incorporating document structure within the inverted index is likely to have performance advantages. For example, a simple query to find sentences containing two given terms only requires, with a hierarchical index, that the inverted lists for the query terms be retrieved and processed; while with the interval approach it is also necessary to fetch and process the inverted list of sentence tags.
  • 186. TEXT DATABASES 179 5.4.2 Pattern matching Standard query languages for text databases include pattern matching con- structs such as wildcard characters and other forms of partial specification of query terms. In particular, in both ranking and Boolean queries users often use query terms such as comput* to match all words starting with the letters comput, and more general patterns may also be used. A common approach is to scan the lexicon to find all terms that satisfy the pattern matching con- struct and then retrieve all the corresponding inverted lists. Since the lexicon is ordered, prefix queries, where patterns are of the form X*, can be evaluated efficiently since, with a lexicon structure such as a B-tree, all possible matching terms are stored contiguously. However, other pattern queries can require a linear scan of the whole lexicon. The problem, in a large lexicon, is to rapidly find all terms matching the specified pattern. A standard solution is to use a trie or a suffix tree [Morrison, 1968, Gonnet and Baeza-Yates, 1991)' which indexes every substring in the lexicon. Tries provide extremely fast access to substrings but have a serious drawback in this application: the need for random access means that they must be stored in core which means that, at typically eight to ten times larger than the indexed lexicon, for TREC up to 100 megabytes of memory is required. Unless speed is the only constraint smaller structures are preferable. One alternative is to use a permuted dictionary [Bratley and Choueka, 1982, Gonnet and Baeza-Yates, 1991] containing all possible rotations of each word in the lexicon, so that, for example, the word range would contribute the original form Irange and the rotations range I, ange Ir, nge Ira, ge Iran, and e Irang, where I indicates the beginning of a word. The resulting set of strings is then sorted lexicographically. Using this mechanism, all patterns of the form X*, *X, *X* and x*y can be rapidly processed by binary search on the permuted lexicon. The permuted lexicon can be implemented as an array of pointers, one to each character of the original lexicon, or about four times the size of the indexed data. Update of the structure is fairly slow. Another approach is to index the lexicon with compressed inverted files [Zo- bel et al., 1993]. The lexicon is treated as a database that can be accessed using an index of fixed length substrings of length n, or n-grams. To retrieve strings that match a pattern, all of the n-grams in the pattern are extracted, the words in the lexicon that contain these substrings are identified via the index; and these words are checked against the pattern for false matches. This approach provides the general pattern matching and a smaller overhead, with indexes of around the same size of the indexed data; matching is significantly slower than with the methods discussed above but still much faster than exhaustive
  • 187. 180 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS search. A related approach is to index n-grams with signature files [Owolabi and McGregor, 1988], which can have similar performance for short strings. 5.4.3 Phonetic matching Pattern matching is not the only kind of string matching of value for text databases. Another kind of matching is by similarity of sound-to identify strings that, if voiced, may have the same pronunciation. Such matching is of particular value for databases of names; consider for example a telephone directory enquiry line. To provide such matching it is necessary to have a mechanism for determin- ing whether two strings may sound alike-that is, a similarity measure-and, if matching is to be fast, an indexing technique. Thus phonetic matching is a form of ranking. Many phonetic similarity measures have been proposed. The best known (and oldest) is the Soundex algorithm [Hall and Dowling, 1980, Kukich, 1992] and its derivatives, in which strings are reduced to simple codes and are deemed to sound alike if they have the same encoding. Despite the popularity of Soundex, however, it is not an effective phonetic matching method. Far better matching is given by lexicographic methods such as n-gram similarities, which use the number of n-grams in common between two strings; edit distances, which use the number of changes required to transform one string to another; and phonetically-based edit distances, which make allowance for the similarity of pronunciation of the characters involved [Zobel and Dart, 1995, Zobel and Dart, 1996]. An n-gram index can be used to accelerate matching, by selecting the strings that have short sequences of characters in common with the query string to be subsequently checked directly by the similarity measure. The speed-up available by such indexes is limited, however, because typically 10% of the strings are selected by the index as candidates. 5.4.4 Passage retrieval Documents in text databases can be extremely large-one of the documents in the TREe collection, for example, is considerably longer than Tolstoy's War and Peace. Retrieval of smaller units of information than whole documents has several advantages: it reduces disk traffic; small units are more likely to be useful to the user; and they may represent blocks of relevant material from otherwise irrelevant text. Such smaller units, or passages, could be logical units such as sections or series of paragraphs, or might simply be any contiguous sequence of words. Passages can be used to determine the most relevant documents in a collec- tion, on the principle that it is better to identify as relevant a document that
  • 188. TEXT DATABASES 181 contains at least one short passage of text with a high number of query terms rather than a document with the query terms spread thinly across its whole length. Experiments with the TREC collection and other databases shows that use of passages can significantly improve effectiveness [Callan, 1994, Hearst and Plaunt, 1993, Kaszkiel and Zobel, 1997, Knaus et al., 1995, Mittendorf and Schauble, 1994, Salton et al., 1993, Wilkinson, 1994, Zobel et al., 1995b]. Use of passages does increase the cost of ranking, because more distinct items must be ranked, but the various techniques described earlier for reducing the cost of ranking are as applicable to passages as they are to whole documents. 5.4.5 Query expansion and combination of evidence Improvement of effectiveness-finding similarity measures that are better at identifying relevant documents-is a principal goal of research in information retrieval. Passage retrieval is one approach to improving effectiveness. Two other approaches of importance are query expansion and combination of evi- dence. The longer a query, the more likely it is to be effective. It follows that it can be helpful to introduce further query terms, that is, to expand the query. One such approach is thesaural expansion, in which either users are encouraged to add new query terms drawn from a thesaurus or such terms are added automatically. Another approach is relevance feedback: after some documents have been returned as matches, the user can indicate which of these are relevant; the system can then automatically extract likely additional query terms from these documents and use them to identify further matches. A recent innovation is automatic query expansion, in which, based on the statistical observation that the most highly-ranked documents have a reasonable likelihood of relevance, these documents are assumed to be relevant and used as sources of further query terms. All of these methods can improve performance, with relevance feedback in particular proving successful [Salton, 1989]. A curious feature of document retrieval is that different approaches to mea- suring similarity can give very different rankings-and yet be equally effective. That is, different measures identify different documents, because they use differ- ent forms of evidence to construe relevance. This property can be exploited by explicitly combining the similarities from different' measures, which frequently leads to improved effectiveness [Fox and Shaw, 1993]. 5.5 Summary We have reviewed querying and indexing for text databases. Since queries to text databases are inherently approximate, text querying paradigms must be judged by their effectiveness, that is, whether they allow users to readily locate
  • 189. 182 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS relevant documents. Research in information retrieval has identified statistical ranking techniques, based on similarity measures, that can be used for effective querying. The task of text query evaluation is to compute these measures efficiently, or to efficiently compute heuristic approximations to these measures that allow faster response without compromising effectiveness. The last decade has seen vast improvements in text query evaluation and text indexes. First, compression has been successfully applied to inverted files, reducing the space requirements of an index with full positional information to less than 25% of that of the indexed data, or less than 10% for an index with only the document-level information required for ranking. This compares very favorably with the space required for traditional inverted file or signature file implementations. Use of compression has no impact on overall query evaluation time, since the additional processing costs are offset by savings in disk traffic. Also, compression makes possible new efficient index construction techniques. Second, improved algorithms have led to further dramatic reductions in the costs of text query evaluation, and in particular of ranking, giving savings in memory requirements, processing costs, and disk traffic. Currently, however, the needs of document database systems are rapidly changing, driven by the rapid expansion of the Web and in the use of intranets and corporate databases. We have described some of the new requirements for text databases, including the need to index and retrieve documents according to structure and the need to identify relevant passages within text collections. Improved retrieval methodologies are being proposed and consequently there is a need to support new evaluation modes such as query expapsion and combina- tion of evidence. These improvements are not yet well understood; and before they can be used in practice new indexing and query evaluation techniques are required. Future research in text database indexing will have to meet the demands of these advanced kinds of querying. Notes 1. The ongoing TREC text retrieval experiment, involving participants from around the world, is an N'IST-funded initiative that provides queries, large test collections, and blind evaluation of ranking techniques. Prior to TREC the cost of relevance judgments had restricted ranking experiments to toy collections of a few thousand documents. 2. Some of the online search engines, such as AltaVista, report the number of occurrences of each query term. Currently (the start of 1997) these numbers often run up to a million or so, against a database of around ten million records, showing that meaningful query terms can indeed occur in a large fraction of the database. 3. Note, however, that text databases are free of some of the costs of traditional databases. Although text database index processing can seem exorbitantly expensive in comparison to the cost of processing a query against, say, a file of bank account records, there is no equiv- alent in the text domain to the concept of join. All queries are to the same table and query evaluation has linear asymptotic complexity.
  • 190. TEXT DATABASES 183 4. Fractional-bit codes such as those produced by arithmetic coding require less space, but are not appropriate for this application because they give relatively slow decompression. 5. The effectiveness of solutions of this kind depends on the overall design of the database system. Most current text database systems are implemented as some form of client-server architecture, with the data and server resident one machine and, to simplify locking, with a single server process handling all queries and updates (perhaps via multiple threads) and communicating with multiple clients. 6. The array of document lengths is not strictly necessary. Instead of storing each document frequency as Id,t and storing the W d values separately, it would be possible to store normalized frequencies Id,t fWd in the inverted lists and dispense with the W d array. However, such normalization is incompatible with compression and on balance degrades overall query evaluation time because of the increased disk traffic. Note that the array of Wd values can be compacted to a few bits per entry without loss of effectiveness [Moffat et aI., 1994].
  • 191. 6 EMERGING APPLICATIONS Because performance is a crucial issue in database systems, indexing techniques have always been an area of intense research and development. Advances in indexing techniques are primarily driven from the need to support different data models, such as the object-oriented data model, and different data types, such as image and text data. However, advances in computer architectures may also require significant extensions to traditional indexing techniques. Such extensions are required to fully exploit the performance potential of new archi- tectures, such as in the case of parallel architectures, or to cope with limited computing resources, such as in the case of mobile computing systems. New application areas also play an important role in dictating extensions to indexing techniques and in offering wider contexts in which traditional techniques can be used. In this chapter we cover a number of additional topics, some of which are in an early stage of research. We first discuss extensions to index organizations required by advances in computer system architectures. In particular, in Sec- tion 6.1 we discuss indexing techniques for parallel and distributed database systems. We outline the main issues and present two techniques, based on B- tree and hashing, respectively. In Section 6.2 we discuss indexing techniques E. Bertino et al., Indexing Techniques for Advanced Database Systems © Kluwer Academic Publishers 1997
  • 192. 186 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS for databases on mobile computing systems. In this section, we first briefly de- scribe a reference architecture for mobile computing systems and then discuss two indexing approaches. Following those two sections, we focus on extensions required by new application areas. In particular, Section 6.3 and Section 6.4 discuss indexing issues for data warehousing systems and for the Web, respec- tively. Data warehousing and Web are currently "hot" areas in the database field and have interesting requirements with respect to indexing organizations. We then conclude this chapter by discussing in Section 6.5 indexing techniques for constraint databases. Constraint databases are able to store and manipu- late infinite relations and they are, therefore, particularly suited for applications such as spatial and temporal applications. 6.1 Indexing techniques for parallel and distributed databases Parallel and distributed systems represent a relevant architectural approach to efficiently support mission-critical applications, requiring fast processing of very large amounts of data. The availability of fast networks, like 10 Mb/sec Ethernet or 100 Mb/sec to 1 Gb/sec Ultranet [Litwin et al., 1993a], makes it possible to process in parallel large volumes of data without any communication bottleneck. In a distributed or parallel database system, a set-oriented database object such as a relation may be horizontally partitioned and each partition stored at a database node. Such a node is called store node for the data object [Choy and Mohan, 1996] and the number of nodes storing partitions of the data object is called the partitioning degree. Data are accessed from application programs and users residing on client nodes. A client node mayor may not reside on the same physical node as a store node is located. A query addressed to a given data object can be executed in parallel over the partitions into which the data object has been decomposed, thus achieving substantial performance improvements. In practice, however, efficient parallel query processing entails many issues, such as parallel join execution techniques, optimal processor allocation, and suitable indexing techniques. In particular, if indexing techniques are not designed properly, they may undermine the performance gains of parallel processing. Data structures for distributed and parallel database systems should satisfy several requirements [Litwin et al., 1993a]. Data structures should gracefully scale up with the partitioning degree. The addition of a new store node to a data object should not require extensive reorganization of the data structure. There should be no central node through which searches and updates to the data structure must go. Therefore, no central directories or similar notions should exist. Finally, maintenance operations on the data structure, like insertions or deletions, should not require updates to the client nodes.
  • 193. EMERGING APPLICATIONS 187 In the remainder of this section, we present two data structures. The first is based on organizing the access structure on two levels. Given a query, the top- most global level is used to detect the nodes where data relevant to the query are stored; the lowest local level of the access structure is used to retrieve the actual data satisfying the query. There is one local level of the data structure for each partition node of the indexed data object. The second data structure is a distributed extension of the well-known linear hashing technique [Litwin, 1980]. This data structure does not require any global component. A query is sent by the client, issuing the query, to the store node that, according to the information the client has, contains the required data. If the data are not found at that store node, the query is forwarded by that node to the appropriate store node. 6.1.1 Two-tier indexing technique Two simple approaches to indexing data in a distributed database can be de- vised based, respectively, on the notions of local index and global index [Choy and Mohan, 1996]. Under the first approach, a separate local index is main- tained at each store node of a given data object. Therefore, each local index is maintained for the respective partition like a conventional index on a non- partitioned object. This approach requires a number of local indexes equal to the number of partitions. A key lookup requires sending the key value to all the local indexes to perform local searches. Such approach is therefore convenient when qualifying records are found in most partitions. If, however, qualifying records are only found in a small fraction of partitions, this approach is very inefficient and in particular does not scale up for large number of partitions. The main advantages of this approach are that no centralized structure exists, and updates are efficient because an update to a record in a partition only involves modifications to the local index associated with the partition. Under the global index approach, a single, centralized index exists that in- dexes all records in all partitions. This approach requires globally unique record identifiers (RID) be stored in the index entries. Indeed, two different records in two different partitions may happen to have the same (local) RID and there- fore at a global level, a mechanism to uniquely identify such records must be in place. A simple approach is to concatenate each local RID with the partition identifier [Choy and IVlohan, 1996]. The global index can be stored at any node and may be partitioned. The global approach allows the direct identification, without requiring use- less local searches, of the records having a given key value. However, it has sev- eral disadvantages. First, remote updates are required whenever a partition is modified. Remote updates are expensive because of the two-phase commit pro-
  • 194. 188 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS tocols that must be applied whenever distributed transactions are performed. Second, a remote shared lock must be acquired on the index, whenever a par- tition is read, to ensure serializability. Third, the global index approach is not efficient for complex queries requiring intersection or union of lists of RIDs returned by searches on different global indexes, if these global indexes are lo- cated at different sites. In such a case, long lists of RIDs must be exchanged among sites. Storing all the global indexes at the same site would not be a viable solution. The site storing all the global indexes would become an hot spot, thus reducing parallelism. An alternative approach, called two-tier index, has been proposed [Choyand Mohan, 1996], trying to combine the advantages of the above two approaches. Under the two-tier index approach, a local index is maintained for each parti- tion. An additional coarse global index is superimposed on the local indexes. Such a global index keeps for each key value the identifier of the partition stor- ing records with this key value. The coarse global index is, however, optional. Its allocation mayor may not be required by the database administrator de- pending on the query patterns. The coarse global index may be located at any site and may be partitioned. An important requirement is that the overall index structure should be main- tained consistent with respect to the indexed objects. Therefore, updates to any of the local indexes have to be propagated, if needed, to the coarse global index. However, compared to the global index approach, the two-tier index approach is much more efficient with respect to updates. Whenever a record having a key value v is removed from a partition, the global coarse index needs to be modified only if the removed record is the last one in its partition having v as key value. By contrast, if other records with key value v are stored in the partition, the coarse global index needs not to be modified. Of course, the local index needs to be modified in both cases. Insertions are handled according to the same principle. Whenever a new record is inserted into a partition, the coarse global index needs to be modified only if the newly inserted record has a key value which is not already in the local index. Algorithms for efficient maintenance operations and locking protocols have also been proposed [Choy and Mohan, 1996]. With respect to query performance, the two-tier index approach has the same advantage as the global index approach. The coarse global index allows the direct identification of the partitions containing records with the searched key value. Then, the search is routed to the identified partitions where the local indexes are searched to determine the records containing the key value. However, unlike the global index approach, the two-tier approach maximizes opportunity for parallelism. Once the partitions are identified from the coarse global index, the search can be performed in parallel on the local indexes of
  • 195. EMERGING APPLICATIONS 189 the identified partitions. In addition, the two-tier approach provides more opportunities for optimization. For example, if a search condition is not very selective with respect to the number of partitions, the coarse global index can be bypassed and the search request be simply broadcasted to all the local indexes (as in the case of the local indexes approaches). It has been shown that the two-tier index represents a versatile and scalable indexing technique for use in distributed database systems [Choy and Mohan, 1996]. Many issues are still open to investigation. In particular, the two-tier index structure can be extended to a multi-tier index structure, where the index organization consists of more than two levels. Query optimization strategies and cost models need to be developed and analyzed. 6.1.2 Distributed lineal' hashing The distributed linear hashing technique, also called LH*, has been proposed in a precise architectural framework. Basically, the availability of very fast networks makes it more efficient to retrieve data from the RAM of another processor than from a local disk [Litwin et al., 1993a]. A system consisting of hundreds, or even thousands, of processors interconnected by a fast network would be able to provide a large, distributed RAM store adequate to large amount of data. By exploiting parallelism in query execution, such a system would be much more efficient than systems based on more tJ;aditional archi- tectures. Such an architecture may be highly dynamic with new nodes added, as more storage is required. Therefore, there is the need of access structures for use in systems with very large number of nodes, hundreds or thousands, and able to gracefully scale. A given file, in such a system, may be shared by several clients. Clients may issue both retrieval and update operations. The distributed linear hashing has been proposed with the goal of addressing the above requirements. An important feature of this organization is that it does not require any centralized directory and is rather efficient. It has been proved [Litwin et al., 1993a] that retrieval of a data item given its key value usually requires two messages, and four in the worst case. In the remainder of this section, we first briefly review the linear hashing technique and then we discuss the distributed linear hashing in more detail. Linear hashing. Linear hashing organizes a file into a collection of buckets. The number of buckets linearly increases as the number of data items in the file grows. In particular, whenever a bucket b overflows, an additional bucket is allocated. Because of the dynamic bucket allocation, the hash function must be dynamically modified to be able to address also the newly allocated buckets. Therefore, as in other hashing techniques, different hashing functions need to be
  • 196. 190 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS used because more bits of the hashed value are used as the address space grows. In particular, the linear hashing uses two functions hi and hi+1 , i =0,1,2, .... Function hi generates addresses in the range (0, N x 2i - 1),1 where N is the number of buckets that are initially allocated (N can be also equal to 1). A commonly used function [Litwin et al., 1993a] is: hi: C mod N x 2i where C is the key value. Each bucket has a parameter called bucket level denoting which hash function, between hi and hi +1, must be used to address the bucket. Whenever a bucket overflows, a new bucket is added and a split operation is performed. However, the bucket which is split is not usually the bucket which generated the overflow. Rather, another bucket is split. The bucket to split is determined by a special parameter n, called split pointer. Once the split is performed, the split pointer is properly modified. It always denotes the leftmost bucket which uses function hi. Once a bucket is split, the bucket level of the two buckets involved in the splitting, is incremented by one, thus replacing function hi with hi+1 for these two buckets. Consider the example in Figure 6.1(a) adapted from [Litwin et al., 1993a]. In the example, we assume that N = 1. Suppose that the key value 145 is added. The insertion of such a key results in an overflow for the second bucket and in the addition of a third bucket. However, the bucket which is split is not the second one; it is the first one. Figure 6.1(b) illustrates the structure after the insertion and splitting. Note that a special overflow bucket is added to the second bucket to store the record with key value 145. Because n is equal to 0, the first bucket is split; the hash function to use for the first and third buckets (the newly allocated one) is h2 . Figure 6.1(c) illustrates the organization after the insertion of records with key values 6, 12, 360, and 18. Those insertions do not cause any overflow. Suppose now that a record with key value 7 is inserted. Such insertion results in an overflow for the bucket 1. Because n is equal to 1, the bucket number 1 is split. Figure 6.1(d) illustrates the resulting organization. Note that the hash functions to use for the second and fourth buckets became now h2 . Because all buckets have the same local level, that is, 2, the split pointer is assigned O. Retrieval of a record, given its key, is very efficient. It is performed according to the following simple algorithm (AI). Let C be the key to be searched, then a f- hi(C); if a < n then a f- hi+dC). (AI)
  • 197. EMERGING APPLICATIONS 191 216 153 10 7 32 145 18 251 12 321 6 215 360 bucket 0 number 216 251 32 153 10 215 321 overflow bucket o 216 251 10 32 153 215 321 overflow bucket o 216 251 10 32 153 18 12 215 6 360 321 o 2 3 hI hi split pointer =0 (a) h2 hi h2 split pointer = I (b) h2 hI h2 split pointer = I (c) h2 h2 h2 h2 split pointer =0 (d) Figure 6.1. Organization of a file under linear hashing. Basically, the second step checks whether the bucket, obtained by applying function hi to the key, has already been split. If so, the function hi+1 is to be used. The index i or i + 1 to be used for a bucket is the bucket level, whereas i + 1 is the file level. LH* . In the distributed version of linear hashing, each bucket of the dis- tributed file is actually the RAM of a node in the system. Therefore, the hash function returns identifiers of store nodes. Note that LH* could be used also if the data were stored in the disks of the various nodes rather than in RAM. However, LH* is particularly suited for systems with a very large number of nodes, as is the case when using RAM for storing a (large) database. Data stored at the various nodes are directly manipulated by clients. A client can perform searches or updates. Whenever a client issues an operation, for example a search, the first step to perform is the address calculation to de- termine the store node interested by the operation. Calculating such addresses requires, according to algorithm (AI), that the client be aware of the up-to-date values of nand i. Satisfying such constraints in an environment where there is a large number of clients and store nodes is quite difficult. Propagating those values, whenever they change, is not feasible given the large number of clients. Therefore, LH'" does not require that clients have a consistent view of i and n. Rather, each client may have its own view for such parameters, and therefore each client may have an image of the file that may differ from the actual file. Also, the image of a file a client has may differ from the images other clients have. We denote by i' and n' the view that a client has of the file parameters i and n. The basic principle of LH* is to let a client use its own local parameters for computing the identifier of the node interested by the operation the client wishes to perform on the file. Therefore, the address calculation is performed
  • 198. 192 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS by using algorithm (AI) with the difference that the client's local parameters are used. That is, the address is computed in terms of parameters i' and n' instead of i and n. The request is then forwarded to the store node, whose address is returned by the address calculation step. Because a client may not have correct values for the file parameters, the store node may not be the correct one. An addressing error thus arises. In order to handle such error, another basic principle is that each store node performs its own address calculation; such step is called server address calculation. Note that each store node knows the level of the bucket it stores; however, it does not know the current value of n. The server address calculation is thus performed according to the following algorithm (A2). Let C be the key to be searched Let a be the address of store node s Let j be the level of the bucket stored at s, then a' f- hj(C); if a i= a' then a" f- hj -1 (C); if a" > a and a" < a' then a' f- a". (A2) The address a' returned by the above algorithm is the address of the store node to which the request should be forwarded if an addressing error has oc- curred. Therefore, whenever a store node receives a request, it performs its own address calculation. If the calculated address is its own address, the address calculated by the client is the correct one (therefore, the client has an up-to- date image of the file). If not, the server forwards the request to the store node whose address has been returned by the server address calculation, according to the above algorithm. The recipient of the forwarded operation checks again the address, by performing again the server address calculation, and may perhaps forward the request to a third store node. It has been, however, formally proved [Litwin et al., 1993a] that the third recipient is the final one. Therefore, delivering the request to the correct store node requires forwarding the request at most twice. As final step, a client image adjustment is performed by the store node firstly contacted by the client, if an addressing error occurred. The store node simply returns to the client its own values for i and n, so that the client image becomes closer to the actual image. To illustrate, consider the example in Figure 6.2(a). The example includes a client having 0 as value for both n' and i'. Suppose that the client wishes to insert a new record with key value 7. The client address calculation returns oas store node. The request is then sent to store node O. Such store node
  • 199. EMERGING APPLICATIONS 193 (a) insert key 7 (b) Figure 6.2. Message exchanges in distributed linear hashing when performing insertion of a new key. performs the address calculation according to algorithm (A2). The first step of the calculation returns 3 (as it can be easily verified by performing 7 mod 4). Note, however, that sending the request to store node 3 would result in an error because there is no such store node. The check performed by the other steps of the algorithm prevents such a situation by generating the address of store node 1 (by applying function hj _ d. The request is then forwarded to store node 1. Store node 1 again performs the calculation. The calculation returns 1 and the record can therefore be inserted at store node 1. To illustrate a situation where two forwards are performed, consider the example in Figure 6.2(b) where four store nodes are allocated and each store node has a local level equal to 2. As in the above case, the request is forwarded from store node 0 to store node 1. Store node 1 performs the address calculation which returns 3. The request is then forwarded again to store node 3 where the key is finally stored. Whenever an overflow occurs at one store node, a split operation must be performed. As for linear hashing, the store node to split is not necessarily the one where the overflow occurs. To determine the store node to split the values of nand i must be known. One of the proposed approaches to splitting [Litwin
  • 200. 194 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS et aI., 1993a) is based on maintaining such information at a fixed store node called split coordinator. Whenever an overflow occurs at a store node, such node notifies the coordinator that then starts the splitting of the proper node and calculates the new values for nand i, according to what follows: nt-n+l if n ~ 2i then n t- 0, i t- i + 1. Retrieval in LH* is extremely effkient. It takes a minimum of two messages- one for sending the request and the other for receiving the reply-and a max- imum of four. The worst case, with a cost of four messages, arises when two forward messages are required. Extensive simulation experiments have shown, however, that the average performance is very close to the optimal performance. Other indexing techniques have been also proposed, as variations of the same principles of LH*, to support order-preserving indexing [Litwin et aI., 1994) and multi-attribute indexing [Litwin and Neimat, 1996). 6.2 Indexing issues in mobile computing Cellular communications, wireless LAN, radio links, and satellite services are rapidly expanding technologies. Such technologies will make it possible for mobile users to access information independently from their actual locations. Mobile computing refers to this new emerging technology extending computer networks to deal with mobile hosts, retaining their network connections even while moving. This kind of computation is expected to be very useful for mail enabled applications, by which, using personal communicators, users will be able to receive and send electronic mail from any location, as well as be alerted about certain predefined conditions (such as a train being late or traffic conditions on a given route), irrespective of time and location [Imielinski and Badrinath, 1994]. The typical architecture of a mobile network (see Figure 6.3) consists of two distinct sets of entities: mobile hosts (MRs) and fixed hosts (FRs). Some of the fixed hosts, called Mobile Support Stations (MSSs) are equipped with a wireless interface. By using such wireless interface, a MSS is able to communicate with MHs residing in the same cell. A cell is the area in which the signal sent by a MSS can be received by MRs. The diameter of a cell, as well as the available bandwidth, may vary according to the specific wireless technology. For example, the diameter of a cell spans a few meters for infrared technology to 1 or 2 miles for radio or satellite networks. With respect to the bandwidth, LANs using infrared technology have transfer rates of the order of 1-2 Mb/sec, whereas WANs have poorer performance [Lee, 1989, Salomone, 1995). The message sent by a MSS is broadcasted within a cell. The MHs filter the messages according to their destination address. On the other hand, MHs
  • 201. EMERGING APPLICATIONS 195 FH //@ : ' ".,,' , MSS 1 ' 1 ' " ," " ;,""';,---'--'- J.:.::.:::.;::.L./ /,----78-:,@':,>... '6 ' M ' :~ , : 0 /, , - - ,,, .@ , Figure 6.3. Reference architecture of a mobile network. located in the same cell can communicate only by sending messages to the MSS associated with that cell. MSSs are connected to other FMs through a fixed network, used to support communication among cells. The fixed network is static, whereas the wireless network is mobile, since MHs may change their position (and therefore the cell in which they rely) in the time. MSSs provide commonly used application software, so that a mobile user can download the software from the closest MSS and run it on the palmtop or execute it remotely on the MSS. Each MH is associated with a specific MSS, called Home MSS. A Home MSS for a MH maintains specific information about the MH itself, such as the user profile, logic files, access rights, and user private files. The association between a MH and a MSS is replicated through the network. Additionally, a user may register as a visitor under some other MSSs. Thus, a MSS is responsible for keeping track of the addresses of users who are currently residing in the cell supervised by the MSS itself. MHs can be classified in dumb terminals or walkstations [Imielinski and Badrinath, 1994]. In the first case, they are diskless hosts (such for instance
  • 202. 196 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS palmtops) with reduced memory and computing capabilities. Walkstations are comparable to classical workstations, and can both receive and send messages on the wireless network. In any case, MRs are not usually connected to any direct power source and run on small batteries and communicate on narrow bandwidth wireless channels. The communication channel between a MSS and MRs consists of a down- link, by which information flows from the MSS to MRs, and an uplink, by which information flows from MRs to the MSS. In general, information can be acquired by a MR under two different modes: • Interactive/On-demand: The client requests a piece of data on the uplink channel and the MSS responds by sending these data to the client on the downlink channel. • Data broadcasting: Periodic broadcasting of data is performed by the MSS on the downlink cannel. This type of communication is unidirectional. The MRs do not send any specific data requests to the MSS. Rather, they filter data coming from the downlink channel, according to user specified filters. In general, combined solutions are used. However, the most frequently de- manded items will be periodically broadcasted, by creating a sort of storage on the air [Imielinski et aI., 1994a]. The main advantage of data broadcasting is that it scales well when the number of MRs grows, as its cost is independent from the number of MRs. The on-demand mode should be used for data items that are seldom required. The main problem of broadcasting is related to energy consumption. Indeed, MRs are in general powered by a battery. The lifetime of a battery is very short and is expected to increase only 20% over the next 10 years [Sheng et aI., 1992]. When a MR is listening to the channel, the CPU must be in active mode for examining data packets. This operation is very expensive from an energy point of view, because often only few data packets are of interest for a particular MR. It is therefore important for the MR to run under two different modes: • Doze mode: The MR is not disconnected from the network but it is not active. • Active mode: The MR performs its usual activities; when the MR is listening to the channel, it should be in active mode. Clearly, an important topic is to switch from doze mode to active mode in a clever way, so that energy dissipation is reduced without incurring in a loss of information. Indeed, if a MR is in doze mode when the information of interest is being broadcasted, such information is lost by the MR.
  • 203. MH EMERGING APPLICATIONS 197 ,...T~un::i=n;:g:= Broadcast8ss Filtering Broadcast Channel Figure 6.4. MH and MSS interaction. Approaches to reduce energy dissipation are therefore important for several reasons. First of all, they make it possible to use smaller and less powerful batteries to run the same applications for the same time. Moreover, the same batteries can also run for a longer time, resulting in a monetary saving. In order to develop such efficient solutions, allowing MRs to timely switch from doze mode to active mode and vice versa, indexing approaches have been proposed. In the next subsection, the general issues related to the development of an index structure for data broadcasting is described, whereas Subsection 6.2.2 illustrates some specific indexing data structures. The discussion follows the approaches presented in [Imielinski et al., 1994a]. 6.2.1 A general index structure for broadcasted data We assume that, without leading the generality of the discussion, broadcasted data consist of a number of records identified by a key. Each MSS periodically broadcasts the file containing such data, on the downlink channel (also called broadcast channel). Clients receive the broadcasted data and filter them. Fil- tering is performed by a simple pattern matching operation against the key value. Thus, clients remain in doze mode most of the time and tune in periodi- cally to the broadcast channel, to download the required data (see Figure 6.4). To provide selective tuning, the server must broadcast, together with data, also a directory that indicates the point of time in the broadcast channel when par- ticular records are broadcasted. The first issue to address is how MRs access the directory. Two solutions are possible: L. MRs cache a copy of the directory. This solution has several disadvantages. First of all, when MRs change the cell where they reside, the cached directory may not be any longer valid and the cache must be refreshed. This problem, together with the fact that broadcasted data can change between successive broadcasts, with a consequent change of the directory, may generate an excessive traffic between clients and the server. Moreover, if many different files are broadcasted on different channels, the storage occupancy at clients may become too high, and storage in MRs is usually a scarce resource.
  • 204. 198 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Current BCAST Previous BCAST Data Bucket ~ Index Bucket Figure 6.5. A general organization for broadcasted data. !. The directory is broadcasted in the form of an index on the broadcast chan- nel. This solution has several advantages. When the index is not used, the client, in order to filter the required data records, has to tune into the channel, on the average, half the time it takes to broadcast the file. This is not acceptable, because the MH, in order to tune into the channel, must be in active mode, thus consuming scarce battery resources. Broadcasting the directory together with the data allows the MH to selectively tune into the channel, becoming active only when data of interest are being broadcasted. Because of the above reasons, broadcasting the directory together with data is the preferred solution. It is usually assumed that only one channel exists. Multiple channels always correspond to a single channel with capacity equiva- lent to the combined capacity of the corresponding channels. Figure 6.5 shows a general organization for broadcasted data (including the directory). Each broadcasted version of the file, together with all the interleaved index information, is called beast. A bcast consists of a certain number of buckets, each representing the smaller unit that can be read by a MH (thus, a bucket is equivalent to the notion of block for disk organizations). Pointers to specific buckets are specified as an offset from the bucket containing the pointer to the bucket to which the pointer points to. The time to get the data pointed by an offset s is given by (s - 1) x T, where T is the time to broadcast a bucket. Figure 6.6 shows the general protocol for retrieving broadcasted data: L. The MH tunes into the channel and looks for the offset pointing to the next index bucket. During this operation, the MH must be in active mode. A common assumption is that each bucket contains the offset to the next index bucket. Thus, this step requires only one bucket access. Let n be the determined offset.
  • 205. EMERGING APPLICATIONS 199 TIME Figure 6.6. The general protocol for retrieving broadcasted data. 2. The MH switches to doze mode until time (n - 1) x T. At that time, the MH tunes into the channel (thus, it is again in active mode) and, following a chain of pointers, determines the offset m, corresponding to the first bucket containing data of interest (with respect to the considered key value). 3. The MH switches to doze mode until time (m - 1) x T. At that time, the MH tunes into the channel (thus, it is again in active mode) and retrieves data of interest. In general, no new indexing structures are required to implement the pre- vious protocol. Rather, existing data structures can be extended to efficiently support the new data organization. The main issues are therefore related to how define efficient data organizations, that is, how data and index buckets must be interleaved and which are the parameters to use in order to compare different data organizations. The considered parameters are the following: • Access time: It is the average duration from the instant in which a client wants to access records with a specific key value to the instant when all required records are downloaded by the client. The access time is based on the following two parameters: Probe time: The duration from the instant in which a client wants to access records with a specific key value to the instant when the nearest index information related to the relevant data is obtained by the client. Beast wait: The duration from the point the index information related to the relevant data is encountered to the point when all required records are downloaded.
  • 206. 200 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Note that if one parameter is reduced, the other increases. • Tuning time: It is the time spent by a client listening to the channel. Thus it measures the time during which the client is in active mode and therefore determines the power consumed by the client to retrieve the relevant data. The use of a directory reduces the tuning time, increasing at the same time the access time. It is therefore important to determine good bucket interleaving in order to obtain a good trade-off between access time (thus reducing the time the client has to wait for relevant data) and tuning time (thus reducing battery consumption) . With respect to disk organization, the tuning time corresponds to the access time, in terms of block accesses. However, the tuning time is fixed for each bucket, whereas the disk access time depends on the position of the head. There is no disk parameter corresponding to the access time. Finally, we recall that other indexing techniques, based on hash functions, have also been proposed [Imielinski et al., 1994b). However, in the remainder of this chapter we do not consider such techniques. 6.2.2 Specific solutions to indexing broadcasted data With respect to the general data organization proposed in Subsection 6.2.1, several specific indexing approaches have been proposed. In the following, we survey some of these approaches [Imielinski et al., 1994a, Imielinski et al., 1994b). With respect to how parameters are chosen, index organizations can be classified in configurable indexes and non-configurable indexes. In the latter case, parameter values are fixed. In the former case, the organizations are parameterized: by changing the parameter values, the trade-off between the costs changes. This allows to use the same organization to satisfy different user requirements. Index organizations can also be classified in clustered and non-clustered or- ganizations. In the first case, all records with the same value for the key attribute are stored consecutively in the file. Non-clustered organizations are often obtained from clustered organizations, by decomposing the file in clus- tered subcomponents. For this reason, in the following, we do not consider organizations for non-clustered files. Non-configurable indexing. Non-configurable index organizations can be classified according to their behavior with respect to access and tuning time. An optimal strategy with respect to the access time can be simply obtained by not broadcasting the directory. On the other hand, an optimal strategy
  • 207. EMERGING APPLICATIONS 201 • Full Index Figure 6.7. Beast organization in the (l-m) indexing method. Previous IIII Next BCAST .......... BCAST '----- • Relevant Index Figure 6.8. Beast organization in the distributed indexing method. with respect to the tuning time is obtained by broadcasting the complete index at the beginning of the bcast. Since in practice both access and tuning time are of interest, the above algorithms have only theoretical significance. Several intermediate solutions have therefore been devised. The (l-m) indexing [Imielinski et al., 1994a) is an index allocation method in which the complete index is broadcasted m times during a bcast (see Fig- ure 6.7). All buckets have an offset to the beginning of the next index segment. The first bucket of each index segment has a tuple containing in the first field the attribute value of the record that was broadcasted as the last and in the second field an offset pointing to the beginning of the next bcast. The main problem of the (l-m) index organization is related to the repli- cation of the index buckets. The distributed indexing [Imielinski et al., 1994a) is a technique in which the index is partially replicated (see Figure 6.8). In- deed, there is no need to replicate the complete index between successive data blocks. Rather, it is sufficient to make available only the portion of index re- lated to the data buckets which follow it. Thus, the distributed index, with respect to the (l-m) index, interleaves data buckets with relevant index buckets only. Several distributed indices can be defined by changing the degree of the replication [Imielinski et al., 1994a). The distributed index guarantees a performance comparable to those of the optimal algorithms, with respect to both the q.ccess time and the tuning time.
  • 208. 202 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS p=4 1 2 3 4 Previous Next BCAST BCAST f~Binary Contro Control Index Index Local Index Data! Records 1 Figure 6.9. Beast organization in the flexible indexing method. The (I-m) index has a good tuningtime. However, due to the index replication, the access time is high. Configurable indexing. Configurable index organizations are parameter- ized in such a way that, depending on the values of the parameters, the ratio between the access and tuning time can be modified. The first configurable in- dex that has been proposed is called flexible indexing [Imielinski et aI., I994b]. In such organization, data records are assumed to be sorted in ascending (or descending) order and the data file is divided into p data segments. It is as- sumed that each bucket contains the offset to the beginning of the next data segment. Depending on the chosen value for p, the trade-off between access time and tuning time changes. The first bucket of each data segment contains a control part, consisting of the control index, as well as some data records (see Figure 6.9). The control index is a binary index which helps locating data buckets containing records with a given key value. Each index entry is a pair, consisting of a key value and an offset to a data bucket. The control index is divided in two parts, the binary control index and the local index. The binary control index supports searches for keys preceding the ones stored in the current data segment and in the following ones. It contains flog2 il tuples, where i is the number of data segments following the one under consideration. The first tuple of the binary control index consists of
  • 209. EMERGING APPLICATIONS 203 the key of the first data record in the current data bucket and an offset to the beginning of the next bcast. The following tuples consist of the key of the first data record of the (llog2 i/2k - 1 J+l),th data segment followed by the offset to the first data bucket of that data segment. The local index supports searches inside the data segment in which it is contained. It consists of m tuples, where m is a parameter which depends on several factors, including the number of tuples a bucket can hold. The local index partitions the data segment into m+ 1 subsegments. Each tuple contains the key of the first data record of a subsegment and the offset to the first data bucket of that subsegment. The access protocol is the following: 1. First, the offset of the next data segment is retrieved and the MH switches to doze mode. 2. The MH tunes in again at the beginning of the designed next data segment and performs the following steps: • If the search key k is lower than the value contained in the first field of the first tuple of the binary control index, the MH switches to doze mode, waiting for the offset specified by the tuple, and again executes step (2). • If the previous condition is not satisfied, the MH scans the other tuples of the binary control index, from top to bottom, until it reaches a tuple whose key value is lower than k. If such tuple is reached, the MH switches to doze mode, waiting for the offset specified by the tuple, and again executes step (2). • If the previous condition is not satisfied, the'MH scans the local index, to determine whether records with key value k are contained in the current data segment. If this search succeeds, the offset is used to determine the bucket contained in the current data subsegment, from which the retrieval of the data segments starts. The retrieval terminates when the last bucket of the searched subsegment is reached. 6.3 Indexing techniques for data warehousing systems Recent years have witnessed an increasing interest in database systems able to support efficient on-line analytical processing (OLAP). OLAP is a crucial element of decision support systems in that essential decisions are often taken on the basis of information extracted by very large amount of data. In most cases, such data are stored in different, possibly heterogeneous, databases. Examples of typical queries are [Chauduri and Dayal, 1996]:
  • 210. 204 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS • What are the sale volumes by region and product categories for the last year? • How did the share price of computer manufactures correlate with quarterly profits over the past 10 years? Because requirements of OLAP applications are quite different with respect to traditional, transaction-oriented applications, specialized systems, known as data warehousing systems, have been developed to effectively support these applications. A data warehouse is a large, special-purpose database containing data integrated from a number of independent sources and supporting users in analyzing the data for patterns and anomalies [O'Neil and Quass, 1997]. With respect to traditional database systems, historical data and not only current data values must be stored in a data warehouse. Moreover, data are updated off-line and therefore no transactional issues are relevant here. By contrast, typical OLAP queries are rather complex, often involving several joins and aggregation operations. OLAP queries are in most cases "ad-hoc" queries as opposed to repetitive transactions, typical of traditional applications. It is therefore important to develop sophisticated, complex indexing techniques to provide adequate performance, also exploiting the fact that the update costs of indexing structures is not a crucial problem. A possible approach to efficiently process OLAP queries is to use material- ization techniques to precompute queries. This approach has the main incon- venience that precomputing all possible queries along all possible dimensions is not feasible, especially if there is a very large number of dynamically vary- ing selection predicates. Therefore, even though more frequent queries may be precalculated, techniques are required to efficiently execute non-precalculated querIes. In the remainder of this section, we first briefly review logical data organi- zations in data warehousing systems and exemplify typical OLAP queries. We then discuss a number of techniques supporting efficient query execution for data warehousing systems. Some of those techniques, namely the join index and the domain index, had initially been developed for traditional DBMSs. They have, however, recently found a relevant application scope in data warehousing systems. Other techniques, namely bitmap and p1'Ojection indexes, have been specifically developed for data warehousing systems. Some of them have been incorporated in commercial systems [Edelstein, 1995, French, 1995]. Another relevant technique which we do not discuss here is the bit-sliced index, whose aim is the efficient computation of aggregate functions. We refer the reader to [O'Neil and Quass, 1997] for a description of such technique.
  • 211. EMERGING APPLICATIONS 205 6.3.1 Logical data organization In a data warehouse, data are often organized according to a star schema approach. Under this approach, for each group of related data there exist a central fact table, also called detail table, and several dimension tables. The fact table is usually very large, whereas each dimension table is usually smaller. Every tuple (fact) in the fact table references a tuple in each of the dimension tables, and may have additional attributes. References from the fact table to the dimension tables are modeled through the usual mechanism of external keys. Therefore, each tuple in the fact table is related to one tuple from each of the dimension tables. Vice versa, each tuple from a dimension table may be related to more than one tuple in the fact table. Dimension tables may, in turn, be organized into several levels. A data warehouse may contain additional summary tables containing pre-computed aggregate information. As an example, consider a (classical) example of data concerning product sales [O'Neil and Quass, 1997]. Such data are organized around a central fact table, called Sales, and the following dimension tables: Time, contain- ing information about the dates of the sales; Product, containing informa- tion on the products sold; and finally, Customer, containing information about the customers involved in the sales. The schema is graphically represented in Figure 6.10. Alternative schema organization approaches exist, including the snowflake schema and the fact constellation schema [Chauduri and Dayal, 1996]. The following discussion is however quite independent on the specific schema approach adopted. Many typical OLAP queries are based on placing restrictions on the dimen- sion tables that result into restrictions on the tuples of the fact table. As an example consider the query asking for all sales of products, with price higher than $50,000, from customers residing in California during July 1996. Such type of query is often referred to as star-join query because it involves the join of the same central fact table with several dimension tables. Another important characteristic of OLAP queries is that aggregates must often be computed on the results of a star-join query and aggregate functions may also be involved in selecting relevant groups of tuples. An example of query including aggre- gate calculation is the query asking for the total dollar sales that were made for a brand of products during the past 4 weeks to customers residing in New England [O'Neil and Quass, 1997]. 6.3.2 Join index and domain index The join index technique [Valduriez, 1987] aims at optimizing relational joins by precalculating them. This technique is optimal when the update frequency
  • 212. 206 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS table CUSTOMER table customer_id PRODUCT gender 1 / producUd city brand state table size zip SALES weight hobby package_type customer_id / producUd table day TIME dollacsales dollar30st day uniCsales week month year holiday_fig week fig Figure 6.10. An example of star-schema database with a central fact table (SALES) and several dimension tables. is low. Because in OLAP applications joins are very frequent and the update frequency is low, the join index technique can be profitably used here. There are several variations of join index. The basic one is the binary join index which is formally defined as follows: Given two tables Rand S, and attributes A and B, respectively from Rand S, a binary equijoin index is Bi1= {(ri, sk)lri.A = Sk.B} where ri (Sk) denotes the row identifier (RID) of a tuple of R (5), and ri.A (Sk .B) denotes the value of attribute A (B) of the tuple whose RID is ri (Sk)' Note that comparison operators, different than equality, can be used in a join index. However, because most joins in OLAP queries are based on equijoins on external keys, we restrict our discussion to the binary join index. Moreover, in some variants of the join index technique, the primary key values for tuples in one table can be used instead of the RIDs of these tuples. A BlI can be implemented as a binary relation and two copies may be kept, one clustered on RIDs of R and the other clustered on RIDs of S. A Ell may also include the actual values of the join columns thus resulting in a set of triples {(ri.A,ri,sk)lri.A = Sk.B}. This alternative is useful when given a value of the join column, the tuples from R and from S must be determined that join with that value.
  • 213. EMERGING APPLICATIONS 207 Join indexes are particularly suited to relate a tuple from a given dimen- sion table to all the tuples in the fact table. For example, suppose that a join index is allocated on relations Sales and Customer for the join predicate Customer.customerjd =Sales.customerjd. Such join index would list for each tuple of relation Customer (that is, for each customer), the RIDs of tuples of Sales verifying the join predicates (that is, the sales of the customer). Join indexes may also be extended to support precomputed joins along several di- mensions [Chauduri and Dayal, 1996]. Another relevant generalization of the join index notion is represented by the domain index. A domain index is defined ona domain (for example, the zip code) and it may index tuples from several tables. It associates with a value of the domain the RIDs of the tuples, from all the indexed tables, having this value in the indexed column. Therefore, a domain index may support equality joins among any number of tables in the set of indexed tables. 6.3.3 Bitmap index In a traditional index, each key value is associated with the list of RIDs of tuples having this value for the indexed column. RIDs lists can be quite long. Moreover, when using multiple indexes for the same table, intersection, union or complement operations must be performed on such lists. Therefore, alternative, more efficient implementations of RID lists are relevant. The notion of bitmap index has been proposed as an efficient implementation of RID lists. Basically, the idea is to represent the list of RIDs associated with a key value through a vector of bits. Such vector, usually referred to as bitmap, has a number of elements equal to the number of tuples in the indexed table. Each tuple in the indexed table is assigned a distinct, unique bit position in the bitmap; such position is called ordinal number of the tuple in the relation. Different tuples have different bit positions, that is, different ordinal numbers. The ith element of the bitmap associated with a key value is equal to 1 if the tuple, whose ordinal number is i, has this value for the indexed column; it is equal to 0 otherwise. Figure 6.11 presents an example of a bitmap index entry for an index allocated on the column package_type of relation Product. Because the Product relation has 150 tuples, the bitmap consists of 150 bits. Consider the entry related to key value equal to A; the bitmap contains 1 in position 1 to denote that the tuple, whose ordinal number is 001, has such value for the indexed column. By contrast, the bitmap contains 0 in position 2 to denote that the tuple, whose ordinal number is 002, does not have such value for the indexed column. The bitmap representation is very efficient when the number of key values in the indexed column is low (as an example, consider a column sex of a table
  • 214. 208 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS table PRODUCT producUd brand size weight package_type 120 XXX 30 50 A 122 XXX 30 40 B 124 YYY 20 30 A 127 XXX 30 20 A 130 XXX 20 70 C 131 YYY 30 80 C ................................................ 970 ZZZ 80 80 B ordinal number 001 002 003 004 005 006 150 Entry of key value A for an index on column package_type bitmap - 150 bits position I position 2 position 3 o position 150 Figure 6.11. An example of a bitmap index entry. Person having only two values: Female and Male) [O'Neil and Quass, 1997]. In such case, the number of O's in each bitmap is not high. By contrast, when the number of values in the indexed column is very high, the number of l's in each bitmap is quite low, thus resulting in sparsely.populated bitmaps. Compression techniques must then be used. The main advantage of bitmaps is that they result in significant improvement in processing time, because operations such as intersection, union and compl~ment of RID lists can be performed very efficiently by using bit arithmetic. Operations required to compute aggregate functions, typically counting the number of RIDs in a list, are also performed very efficiently on bitmaps. Another important advantage of bitmaps is that they are suitable for parallel implementation [O'Neil and Quass, 1997]. Note that the bitmap representation can be combined with the join index technique, thus resulting in a bitmap join index [O'Neil and Graefe, 1995]. An entry in a bitmap join index, allocated on a fact table and a dimension table, will associate the RID of a tuple t from the dimension table with the bitmap of
  • 215. dimension table PRODUCT EMERGING APPLICATIONS 209 fact table SALES RID POOl P002 P003 P004 producUd brand size weight package_type producUd customer_id .. 120 XXX 30 50 A 120 C25 122 XXX 30 40 B 122 C25 124 YYY 20 30 A 120 C26 127 XXX 30 20 A 120 C28 130 XXX 20 70 C 130 C25 131 YYY 30 80 C 120 C37 ..................................................... 122 C40 970 ZZZ 80 80 B 120 C70 .....•................ 130 C40 ordinal number 0001 0002 0003 0004 0005 0006 0007 0008 1800 Entry of key value POOl for a bitmap join index allocated on the join between tables PRODUCT and SALES and inverted on RID's of table PRODUCT bitmap - 1800 bits I I 0 I position I 1position 2 position 3 RID of a tuple of table PRODUCT o position 1800 Figure 6.12. An example of a bitmap join index entry. the tuples in the fact table that join with t. Figure 6.12 presents an example of a bitmap join index. 6.3.4 Projection index Projection index is an access structure whose aim is to reduce the cost of projections. The basic idea of this technique is as follows. Consider a column C of a table T. A projection index on C consists of a vector having a number of elements equal to the cardinality of T. The ith element of the vector contains the value of C for the ith tuple of R. Such technique is thus based,. as is the bitmap representation, on assigning ordinal numbers to tuples in tables. Determining the value of column C for a tuple, given the ordinal number of
  • 216. 210 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS fact table projection index on column SALES unit_sales ordinal ordinal number number producUd customecid .... uniCsales of index entries 0001 120 C25 50 50 0001 0002 122 C25 20 20 0002 0003 120 C26 30 30 0003 0004 120 C28 70 70 0004 0005 130 C25 50 50 0005 0006 120 C37 50 50 0006 0007 122 C40 70 70 0007 0008 120 C70 20 20 0008 ............................................. 1800 130 C40 50 50 1800 Figure 6.13. An example of projection index. this tuple, is very efficient. It only requires accessing the ith entry of the vector. When the key values have a fixed length, the secondary storage page containing the relevant vector entry is determined by a simple offset calculation. Such calculation is function of the number· of entries of the vector that can be stored per page and the ordinal number of the tuple. When the key values have varying lengths, alternative approaches are possible. A maximum length can be fixed for the key values. Alternatively, a B-tree can be used, having as key values the ordinal numbers of tuples and associating with each ordinal number the corresponding value of column C. Figure 6.13 presents an example of a projection index. Projection indexes are very useful when very few columns of the fact table must be returned by the query and the tuples of the fact table are very large or not well clustered. For typical OLAP queries, projection indexes are typically best used in combination with bitmap join indexes. Recall that a typical query restricts the tuples in the fact table through selections on the dimension tables. The ordinal numbers of fact tuples satisfying the restrictions on the dimension tables are retrieved from the bitmap join indexes. By using these ordinal num- bers, projection indexes can then be accessed to perform the actual projection. Note that the actual tuples of the fact table need not to be accessed at all. 6.4 Indexing techniques for the Web In the past five years, the World Wide Web has completely reshaped the world of communication, computing and information exchange. By introduc- ing graphical user interfaces and an intuitively simple concept of navigation,
  • 217. EMERGING APPLICATIONS 211 the Web facilitated access to the Internet which during about ten years was re- stricted to a few universities and research laboratories. Appearance of advanced navigation tools like Netscape and Microsoft Explorer made it easy for everyone on the Internet to roam, browse and contribute to the Web information space. With the rapid explosion of the amount of data available through Inter- net, locating and retrieving relevant information becomes more difficult. To facilitate retrieval of information, many Internet providers (for example, stock markets, private companies, universities) offer users the possibility of using so called search engines which facilitate the search process. Search engines offer a simple interface for the query formulation and refinement, and a wide range of search options and result reporting. Moreover, with the growth of data on the Web, a number of special services has appeared on the Internet whose major goal is searching through many differ- ent information sources. Even the raw information they return to users becomes the starting point for retrieval of relevant information (for example, e-mail ad- dresses, phone numbers, Frequently Asked Questions files). Popular general purpose searching tools, such as Altavista (http://www.altavista.com/). We- bcrawler (http://www.webcrawler.com),InfoSeek (http://www.infoseek.com/). Excite (http://www.excite.com/) become indispensable in the toolkit of every- body working with the Internet information sources. Internet technology poses some specific requirements to the tools both in terms of time and space. Some indexing techniques used in standard text databases were adopted to meet those requirements. Also, several new ap- proaches were developed to overcome some limitations of standard techniques. In the remainder of this section we present a short overview and classification of indexing methods used in some Internet information systems such as WAIS, Gopher, Archie, which became popular in the late 80s and early 90s. Then we discuss some problems related to search engines on the Web. We conclude the section with a brief overview of the main ideas underlying the Internet spiders which combine indexing and navigation techniques on the Web. 6.4.1 WAfS, Gopher, Archie, Whois++ The importance of searching the information available through the Internet was realized by the Internet community from the very first years. Searching and retrieval tools were growing in both quantity and quality together with the growth of the Internet itself. Such popular tools as Archie, Gopher, Whois, WAIS [Bowman et al., 1994, Cheong, 1996] represented a good starting point for a new generation of the Internet searching tools. Archie is a tool which searches for relevant information in a distributed collection of FTP sites.2 Gopher is a distributed information system which makes available hierarchical campus-
  • 218. 212 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS wide data collections and provides a simple text search interface. Whois (and its advanced version Whois++) is a popular tool to query Internet sources about people and other entities (for example, domains, networks, and hosts). WAIS (Wide Area Information Server) is a distributed service with a simple natural-language interface for looking up information in Internet databases. Indexing techniques used in those tools are quite different. In particular, the various tools can be classified in three groups [Bowman et al., 1994J depending on the amount of information which is included in the indexes. The first group includes tools which have very space efficient indexes, but only represent the names of the files or menus they index. For example, Archie and Veronica index the file and menu names of FTP and Gopher servers. Because these indexes are very compact, a single index is able to support advanced forms of search. Yet, the range of queries that can be supported by these systems is limited to file names only, and content-based searches are possible only when the names happen to reflect some of the contents. The second group includes systems providing full-text indexing of data lo- cated at individual sites. For example, a WAIS index records every keyword in a set of documents located at a single site. Similar indexes are available for individual Gopher and WWW servers. The third group includes systems adopting solutions which are a compro- mise between the approaches adopted by the systems in the other two groups. Systems in the third group represent some of the contents of the objects they index, based on selection procedures for including important keywords or ex- cluding less important keywords. For example, Whois++ indexes templates that are manually constructed by site administrators wishing to describe the resources at their sites. 6.4.2 Search engines The two main types ofsearch against text files are based on sequential searching and inverted indexes. The sequential search works well only when the search is limited to a small area. Most pattern-based search tools like Unix's grep use the sequential search. Inverted indexes (see Chapter 5 for an extensive presentation) are a common tool in information retrieval systems [Frakes and Baeza-Yates, 1992J. An inverted index stores in a table all word occurrences in the set of the indexed documents and indexes the table using a hash method or a B-tree structure. Inverted indexes are very efficient with respect to query evaluation but have a storage occupancy which, in the worst case, may be equal to the size of the original text. To reduce the size of the table, storing the word occurrences, advanced inverted indexes use the trie indexing method [Mehlhorn and Tsakalidis, 1990J which stores together the words with common
  • 219. EMERGING APPLICATIONS 213 initial characters (like "call" and "capture"). Moreover, the use of various compression methods allows to reduce the index size to 10%-30% of the text size (see Chapter 5). Another drawback of standard inverted indexes is that their basic data struc- ture requires the exact spelling of the words in the query. Any misspelling (for example, when typing "Bhattacharya" or "Clemenc;on") would result in the empty result set. To provide a correct spelling, users should try different pos- sibilities by hand, which is frustrating and time consuming. An example of the search engine which allows the word misspelling is Glimpse [Manber and Wu, 1994]. Glimpse is based on the agrep search program [Wu and Manber, 1992] which is similar in use to Unix's grep search program. Es- sentially, Glimpse is an hybrid between the sequential search and the inverted index techniques. It is index-based but it uses the sequential search (agrep program) for approximation matching when the search area is small. To check a possible word misspelling, it allows a specified number of errors which can be insertions, deletions or substitutions of characters in a word. Also, it supports wild cards, regular expressions and Boolean queries like OR and AND. In most cases, Glimpse requires a very small index, 2%-4% of the original text. How- ever, the cost of the combination of indexing and sequential search is a longer response time. For most queries, the search in Glimpse takes 3-15 seconds. Such response time is unacceptable for classical database applications but is quite tolerable in most personal applications like the navigation through the Web. Intensive development of different techniques for indexing Web documents has resulted in the appearance of a number of advanced search engines. They offer a wide list of features for the query formulation and provide a small index size along with the fast response time. However, building metasearchers which provide unified query interfaces to multiple search engines is still a hard task. This is because most search engines are largely incompatible. They propose dif- ferent query languages and use secret algorithms for ranking documents which make hard merging data from different sources. Moreover, they do not export enough information about the source's contents which may be helpful for a bet- ter query evaluation. All these problems have led to the Stanford protocol pro- posal for Internet retrieval and search (STARTS) [Gravano et 301., 1997]. This proposal is a group effort involving 11 companies and organizations. The proto- col addresses and analyzes metasearch requirements and describes the facilities that a source needs to provide in order to help a metasearcher. If implemented, STARTS can significantly streamline the implementation of metasearchers, as well as enhance the functionality they can offer.
  • 220. 214 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 6.4.3 Internet spiders Users usually navigate through the Web to find information and resources by following hypertext links. As the Web continues to grow, users may need to traverse more and more links to locate what they are looking for. Indexing tools like search engines only help when searching on a single site or predefined set of sites. Therefore, a new family of programs, often called Web robots or spiders, has been developed with the aim of providing more powerful search facilities. Web spiders combine browsing and indexing [Cheong, 1996]. They traverse the Web space by following hypertext links and retrieve and index new Web documents. The most well-known Internet spiders are WWW Worm, Web Crawler and Harvest. The World Wide Web Worm (http://wwww.cs.colorado.comjwwwwj) was the first widely used Internet spider. It navigates through Web pages and builds an index of titles and hypertext links of over 100,000 Web documents. It provides users with a search interface. Similar to the systems in the first group in our classification, the WWW Worm does not index the content of documents. Webcrawler (http://www.webcrawler.com/) is a resource discovery tool which is able to speedily search for resources on the Web. It is able to build indexes on the Web documents and to automatically navigate on demand. WebCrawler uses an incomplete breath-first traversal to create an index (on both titles and data content) and relies on an automatic navigation mechanism to find the rest of information. The Harvest project [Bowman et al., 1995] addresses the problem of how to make effective use of the Web information in the face of a rapid growth in data volume, user base and data diversity. One of the Harvest goals is to coordinate retrieval of information among a number of agents. Harvest provides a very efficient means of gathering and distributing index information and supports the construction of very different types of indexes customized to each particular information collection. In addition, Harvest also provides caching and replication support and uses Glimpse as a search engine. 6.5 Indexing techniques for constraint databases The main idea of constraint languages is to state a set of relations (constraints) among a set of objects in a given domain. It is a task of the constraint satisfac- tion system (or constraint solver) to find a solution satisfying these relations. An example of constraint is F = 1.80 + 32, where 0 and F are respectively the Celsius and Fahrenheit temperature. The constraint defines the existing relation between F and O. Constraints have been used for different purposes, for example they have been successfully integrated with logic programming
  • 221. EMERGING APPLICATIONS 215 [Jaffar and Lassez, 1987]. The constraint programming paradigm is fully declar- ative, since it specifies computations by specifying how these computations are constrained. Moreover, it is very attractive as often constraints represent the communication language of several high-level applications. Even if constraints have been used in several fields, only recently this paradigm has been used in databases. Traditionally, constraints have been used to express conditions on the semantic correctness ofdata. Those constraints are usually referred to as semantic integrity constraints. Integrity constraints have no computational implications. Indeed, they are not used to execute queries (even if they can be used to improve execution performance) but they are only used to check the database validity. Constraints intended in a broader sense have lately been used in database systems. Constraints can be added to relational database systems at different levels [Kanellakis et aI., 1995]. At the data level, they finitely represent infi- nite relational tuples. Different logical theories can be used to model different information. For example, the constraint X < 21 Y > 3, where X and Yare integer variables, represents the infinite set of tuples having the X attribute lower than 2 and the Y attribute greater than 3. A quantifier-free conjunc- tion of constraints is called generalized tuple and the possibly infinite set of relational tuples it represents is called extension of the generalized tuple. A finite set of generalized tuples is called generalized relation. Thus, a general- ized relation represents a possibly infinite set of relational tuples, obtained as the union of the extension of the generalized tuples contained in the relation. A generalized database is a set of generalized relations. When constraints are used to retrieve data, they allow to restrict the search space of the computa- tion, increasing the expressive power of simple relational languages by allowing arithmetic computations. Constraints are a powerful mechanism for modeling spatial [Paredaens, 1995, Paredaens et al., 1994] and temporal concepts [Kabanza et al., 1990, Koubarakis, 1994], where often infinite information should be represented. Consider for ex- ample a spatial database consisting of a set of rectangles in the plane. A possible representation of this database in the relational model is that of hav- ing a relation R, containing a tuple of the form (n, a, b, c, d) for each rectangle. In such tuple, n is the name of the rectangle with corners (a, b), (a, d), (c, b) and (c, d). In the generalized relational model, rectangles can be represented by generalized tuples of the form (Z = n) 1 (a :::; X :::; c) 1 (b :::; Y :::; d), where X and Yare real variables. The latter representation is more suitable for a larger class of operations. Figure 6.14 shows the rectangles representing the extension of the generalized tuples contained in a generalized relation rl (white) and in a generalized relation r2 (shadow). rl contains the following generalized tuples:
  • 222. 216 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS r2,1 Figure 6.14. Relation rl (white) and r2 (shadow). '"1,1 : 1 :::; X :::; 4 AI:::; Y:::; 2 rl,2 : 2 :::; X :::; 7 A 2 :::; Y:::; 3 rl,3 : 3 :::; X :::; 6 A -1 :::; Y :::; 1.5. r2 contains the following tuples: r2,1 : -3 :::; X :::; -1 AI:::; Y :::; 3 r2,2 : 5 :::; X :::; 6 A -3 :::; Y :::; O. Usually, spatial data are represented using the linear constraint theory. Lin- ear constraints have the form p(X1 , ... , X n ) () 0, where p is a linear polynomial with real coefficients in variables Xl, ..., X nand () E {=,:f, ::;, <, 2::, >}. Such class of constraints is of particular interest. Indeed, a wide range of applications use linear polynomials. Moreover, linear polynomials have been investigated in various fields (linear programming, computational geometry) and therefore sev- eral techniques have been developed to deal with them [Lassez, 1990]. From a temporal perspective, constraints are very useful to represent situ- ations that are infinitely repeated in time. For example, we may think of a train, leaving each day at the same time. In such case, dense-order constraints are often used. Dense-order constraints are all the formulas of the form X()Y or X()c, where X,Y are variables, c is a constant and () E {=,:f,::;,<, 2::,>}. The domain D is a countably infinite set (for example, rational numbers) with a binary relation which is a dense linear order. It has been recognized [Kanellakis et al., 1995] that the integration of con- straints in traditional databases must not compromise the efficiency of the sys- tem. In particular, constraint query languages should preserve all the good fea-
  • 223. EMERGING APPLICATIONS 217 tures of relational languages. For example, they should be closed and bottom- up evaluable. With respect to relational databases, constraint databases should also preserve efficiency. Thus, data structures for querying and updating con- straint databases must be developed, with time and space complexities com- parable to those of data structures for relational databases. Complexity of the various operations is expressed in terms of input-output (I/O) operations. An I/O operation is the operation of reading or writing one block of data from or to a disk. Other parameters are: B, representing the number of items (gen- eralized tuples) that can be stored in one page; n, representing the number of pages to store N generalized tuples (thus, n = N/B); t, representing the num- ber of pages to store T generalized tuples, representing the result of a query evaluation (thus, t = T/B). At least two constraint language features should be supported by index struc- tures: • ALL selection. It retrieves all generalized tuples contained in a specified generalized relation whose extension is contained in the extension of a given generalized tuple, specified in the query (called query generalized tuple). From a spatial point of view, such selection corresponds to a range query. • EXIST selection. It retrieves all generalized tuples contained in a specified generalized relation whose extension has a non-empty intersection with the extension of a query generalized tuple. Equivalently, it finds a generalized relation that represents all relational tuples, implicitly represented by the input generalized relation, that satisfy the query generalized tuple. From a spatial point of view, such selection corresponds to an intersection query. Consider for example the generalized tuples representing the objects pre- sented in Figure 6.14. The EXIST selection with respect to the query gen- eralized tuple Y ~ X-I and relation 1'1 returns all three generalized tuples 1'1.1,1'1,2 and 1'1,3· The ALL selection with respect to the query generalized tuple Y ~ X-I and relation 1'1 returns only the generalized tuple 1'1,3. As constraints support the representation of infinite information, data struc- tures defined to index relations (such as B-trees and B+-trees [Bayer and Mc- Creight, 1972, Comer, 1979]) cannot be used in constraint databases, since they rely on the assumption that the number of tuples is finite. For this reason, spe- cific classes of constraints for which efficient indexing data structures can be provided must be determined. Due to the analogies between constraint databases and spatial databases, efficient indexing techniques developed for spatial databases can often be ap- plied to (linear) constraint databases. Efficient data structures are usually
  • 224. 218 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS required to process queries in G(logB n + t) I/O operations, use G(n) blocks of secondary storage, and perform insertions and deletions in G(logB n) I/O operations (this is the case ofB-trees and B+-trees). Note that all complexities are worst-case. For spatial problems, by contrast, data structures with optimal worst-case complexity have been proposed only for some specific problems, in general dealing with 1- or 2- dimensional spatial objects. Nevertheless, several data structures proposed for management of spatial data behave quite well on the average for different source data. Examples of such data structures are grid files [Nievergelt et al., 1984], various quad-trees [Samet, 1989], z-orders [Oren- stein, 1986], hB-trees [Lomet and Salzberg, 1990a], cell-trees [Gunther, 1989], and various R-trees [Guttman, 1984, Sellis et al., 1987] (see Chapter 2). Symmetrically, in the context of constraint databases two different classes of techniques have been proposed, the first consisting of techniques with optimal worst-case complexity, and the second consisting of techniques with good aver- age bounds. Techniques belonging to the first class apply to (linear) generalized tuples representing 1- or 2- dimensional spatial objects and often optimize only EXIST selection. Techniques belonging to the second class allow to index more general generalized tuples, by applying some approximation. In the following, both approaches will be surveyed. 6.5.1 Generalized 1-dimensional indexing In relational databases, the I-dimensional searching problem on a relational attribute X is defined as follows: Find all tuples such that their X attribute satisfies the condition a1 ::; X ::; a2. The problem of I-dimensional searching on a relational attribute X can be reformulated in constraint databases, defining the problem of i-dimensional searching on the generalized relational attribute X, as follows: Find a generalized relation that represents all tuples of the input generalized relation such that their X attribute satisfies the condition a1 ::; X ::; a2. A first trivial, but inefficient, solution to the generalized I-dimensional search- ing problem is to add the query range condition to each generalized tuple. In this case, the new generalized tuples represent all the relational tuples whose X attribute is between a1 and a2. This approach introduces a high level of redundancy in the constraint representation. Moreover, several inconsistent (with empty extension) generalized tuples can be generated. A better solution can be defined for convex theories. A theory <I> is convex if the projection of any generalized tuple defined using <I> on each variable X is one interval b1 :S X :S h. This is true when the extension of the generalized tuple represents a convex set. The dense-order theory and the real polynomial inequality constraint theory are examples of convex theories. The solution is
  • 225. EMERGING APPLICATIONS 219 based on the definition of a generalized I-dimensional index on X as a set of intervals, where each interval is associated with a set of generalized tuples and represents the value of the search key for those tuples. Thus, each interval in the index is the projection on the attribute X of a generalized tuple. By using the above index, the determination of a generalized relation, representing all tuples from the input generalized relation such that their X attribute satisfies a given range condition al :s X :s a2, can be performed by adding the condition to only those generalized tuples whose associated interval has a non-empty intersection with al :s X :s a2. Insertion (deletion) of a given generalized tuple is performed by computing its projection and inserting (deleting) the obtained interval into (from) a set of intervals. From the previous discussion it follows that the generalized I-dimensional indexing problem reduces to the dynamic interval management problem on secondary storage. Dynamic interval management is a well-known problem in computational geometry, with many optimal solutions in internal memory [Chi- ang and Tamassia, 1992]. Secondary storage solutions for the same problem are, however, non-trivial, even for the static case. In the following, we survey some of the proposed solutions for secondary storage. Reduction to stabbing queries. A first class of proposals is based on the reduction of the interval intersection problem to the stabbing query problem [Chiang and Tamassia, 1992]. Given a set of I-dimensional intervals, to answer a stabbing query with respect to a point x, all intervals that contain x must be reported. The main idea of the reduction is the following [Kanellakis and Ramaswamy, 1996]. Intervals that intersect a query interval fall into four categories (see Figure 6.15). Categories (1) and (2) can be easily located by sorting all the intervals with respect to their left endpoint and using a B+-tree to locate all intervals whose first endpoint lies in the query interval. Categories (3) and (4) can be located by finding all data intervals which contain the first endpoint of the query interval. This search represents a stabbing query. By regarding an interval [Xl, X2] as the point (Xl, X2) in the plane, a stabbing query reduces to a special case of the 2-dimensional range searching problem. Indeed, all points (Xl, X2), corresponding to intervals, lie above the line X = Y. An interval [Xl, X2] belongs to a stabbing query with respect to a point X if and only if the corresponding point (Xl, X2) is contained in the region of space represented by the constraint X :s X 1 Y 2: x. Such 2-sided queries have their corner on line X = Y. For this reason, they are called diagonal corner queries (see Figure 6.16).
  • 226. 220 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS y (xl,x2) Data intervals 2--------- 3 - - - - - 4------------- "'-------;..-------X ,x Query interval xl x2 Figure 6.15. Categories of possible In- tersections of a query interval with a database of intervals. Figure 6.16. Reduction of the interval intersection problem to a diagonal-corner searching problem with respect to x. The first data structure that has been proposed to solve diagonal-corner queries is the meta-block tree, and it does not support deletions (it is semi- dynamic) [Kanellakis and Ramaswamy, 1996). The meta-block tree is fairly complicated, has optimal worst-case space G(n) and optimal I/O query time G(logE n + t). Moreover, it has G(logE n + (log~ n)/B) amortized insert I/O time. A dynamic (thus, also supporting deletions) optimal solution to the stab- bing query problem [Arge and Vitter, 1996) is based on the definition of an external memory version of the internal memory interval tree. The interval tree for internal memory is a data structure to answer stabbing queries and to store and update a set of intervals in optimal time [Chiang and Tamassia, 1992). It consists of a binary tree over the interval endpoints. Intervals are stored in secondary structures, associated with internal nodes of the binary tree. The extension of such data structure to secondary storage entails two issues. First, the fan-out of nodes must be increased. The fan-out that has been chosen is VB [Arge and Vitter, 1996). This fan-out allows to store all the needed information in internal nodes, increasing only of 2 the height of the tree. If interval endpoints belong to a fixed set E, the binary tree is replaced by a balanced tree, having VB as branching factor, over the endpoints E. Each leaf represents B consecutive points from E. Segments are associated with nodes generalizing the idea of the internal memory data structure. However, since now a node contains more endpoints, more than two secondary structures are required to store segments associated with a node.' The main problem of the previous structure is that it requires the interval endpoints to belong to a fixed set. In order to remove such assumption, the weight-balanced B-tree has been
  • 227. EMERGING APPLICATIONS 221 introduced [Arge and Vitter, 1996]. The main difference between a B-tree and a weight-balanced B-tree is that in the first case, for each internal node, the number of children is fixed; in the second case, only the weight, that is, the number of items stored under each node, is fixed. The weight-balanced B-tree allows to remove the assumption on the interval endpoints, still retaining opti- mal worst-case bounds for stabbing queries. Revisiting a Chazelle's algorithm. The solutions described above to solve stabbing queries in secondary storage are fairly complex and rely on reduc- ing the interval intersection problem to special cases of the 2-dimensional range searching problem. A different and much simpler approach to solve the static (thus, not supporting insertions and deletions) generalized I-dimensional searching problem [Ramaswamy, 1997] is based on an algorithm developed by Chazelle [Chazelle, 1986] for interval intersection in main memory and uses only B+-trees, achieving optimal time and using linear space. The proposed technique relies on the following consideration. A straightfor- ward method to solve a stabbing query consists of identifying the set of unique endpoints of the set of input intervals. Each endpoint is associated with the set of intervals that contain such endpoint. These sets can then be indexed using a B+-tree, taking endpoints as key values. To answer a stabbing query it is sufficient to look for the endpoint nearest to the query point, on the right, and examine the intervals associated with it, reporting those intervals that intersect the query point. This method is able to answer stabbing queries in G(logE n). However, it requires G(nZ) space. It has been shown [Ramaswamy, 1997] that the space complexity can be reduced to G(n) by appropriately choosing the considered endpoints. More precisely, let el, ez, ... , eZn be the ordered lists of all end- points. A set of windows Wi, ... ,Wp should be constructed over endpoints Wi = el, ... , Wp+l = e2n such that Wj = [Wj, Wj+d, j = 1, ... ,p. Thus, windows represent a partition of the interval between el and e2n into p contiguous in- tervals. Each window Wj is associated with the list of intervals that intersect Wj. Window-lists can be stored in a B+-tree, using their starting points as key values. A stabbing query at point p can be answered by searching for the query point and retrieving the window-lists associated with the windows that it falls into. Each interval contained in such lists is then examined, reporting only the intervals intersecting the query point. Some algorithms have been proposed [Ramaswamy, 1997] to appropriately construct windows, in order to answer queries by applying the previous algorithm in G(logE n), using only G(n) pages.
  • 228. 222 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 6.5.2 Indexing 2-dimensionallinear constraints The approaches briefly illustrated in Subsection 6.5.1 rely on the assumption that index values are represented by intervals. Thus, they are able to index generalized tuples using information about only one variable. Less work has been done in order to define techniques for 2-dimensional generalized tuples, having optimal worst-case complexity. One of these techniques [Bertino et aI., 1997] deals with index values represented by generalized tuples with two vari- ables, say X and Y, having the form G1 1 ... 1 Gn , where each Gi , i = 1, ... , n has the form Gi == Y B aiX + bi, B E {:S, ~}. Besides the application to different types of generalized tuples, the main dif- ference of this technique with respect to the ones presented in Subsection 6.5.1 is that it is defined for solving not only EXIST selection but also ALL selection. In both cases, the query generalized tuple must represent a half-plane. The main novelty of the approach is the reduction of both EXIST and ALL selection problems, under the above assumptions, to a point location problem from computational geometry [Preparata and Shamos, 1985]. The proof of such reduction is based on the transformation of the extension of generalized tuples from a primal plane to a dual plane. In particular, each generalized tuple is transformed in a pair of non-intersecting, but possibly touching, open polygons3 in the plane, whereas a half-plane Y BaX +b, BE {:S,~} is translated in point (a, b). This translation satisfies an interesting property. Indeed, the EXIST and the ALL selection problems with respect to a half-plane query Y B aX +b reduce to the point location problem of point (a, b) with respect to the constructed open polygons. In particular, it can be shown that point (a, b) belongs to one of the open polygons that have been constructed for a generalized tuple t iff line Y = aX + b does not intersect the interior of the figure representing the extension oft (see Figure 6.17). Using this property, point location algorithms for the dual plane, equivalent to the EXIST and ALL selections in the Euclidean plane, have been proposed. The same open polygons have then be used to show that an optimal dy- namic solution to ALL and EXIST selection problems exists, using simply data structures such as B+-trees, if the angular coefficient of the line associated with the half-plane query belongs to a predefined set. 6.5.3 Filtering To facilitate the definition of indexing structures for arbitrary objects in spatial databases, a filtering approach is often used. The same approach can be used in constraint databases to index generalized tuples with complex extension.
  • 229. (a) EMERGING APPLICATIONS 223 (b) Figure 6.17. (a) A polygon p representing the extension of a linear generalized tuple; (b) A pair of open polygons representing p in the dual plane, together with the points representing lines ql, qz, q3, q4 in the dual plan. Under the filtering approach, an object is approximated by using some other object, having a simpler shape. The approximated objects are then used as index objects. The evaluation of a query under such approach consists of two steps, filtering and refinement. In the filtering step, an index is used to retrieve only relevant objects, with respect to a certain query. To this purpose, the approximated figures are used instead of the objects themselves. During the refinement step, the set of objects retrieved by the filtering step is directly tested with respect to the query, to determine the exact result. Here, the main topic is the definition of "good" approximated objects, ensuring a specific degree of filtering. The use of minimum bounding box (MBB) in spatial databases to filter ob- jects is of common use. In 2-dimensional space, the MBB of a given object is the smallest rectangle that encloses the object and whose edges are perpendicu- lar to the standard coordinate axes. The previous definition can be generalized to higher dimensions in a straightforward manner. The filtering method based on MBB is simple and has a number of advan- tages over index methods working directly on objects: • It has a low storage cost, because only a small number of intervals are main- tained in addition to each object.
  • 230. 224 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS • There is a clear separation between the complexity of the object geometry and the complexity of the search. Index structures for (multidimensional) in- tervals have better worst-case performance with respect to index techniques working on arbitrary objects. Indeed, several index structures having close to optimal worst-case bounds for managing (multidimensional) intervals have been proposed (see Chapter 2). However, similar approaches have not been defined yet for arbitrary objects. The filtering approach based on MBBs, even if appealing, has some draw- backs. In particular, it may be ineffective if the set of objects returned by the filtering step is too large. This means that there are too many intersecting MBBs. Moreover, it does not scale well to large dimensions. The issue of han- dling objects in spaces of large dimension is less crucial for spatial databases, where we can generally rely on a dimension of 3 or less, but it is critical for constraint databases. In order to improve the selectivity of filtering, an approach has been pro- posed, based on the notion of minimum bounding polybox [Brodsky et al., 1996]. A minimum bounding polybox for an object 0 is the minimum convex polyhe- dron that encloses 0 and whose facets are normal to preselected axes. These axes are not necessarily the standard coordinate axes and, furthermore, their number is not determined by the dimension of the space. Algorithms for com- puting optimal axes (according to specific optimality criteria with respect to storage overhead or filtering rate) in d-dimensions have also been proposed [Brodsky et al., 1996]. Notes 1. We assume that buckets are numbered starting from O. 2. FTP is the Internet standard high-level protocol for the file transfer. 3. An open polygon is a finite chain of line segments with the first and last segments approaching 00. An open polygon is upward (downward) open if both segments approach +00 (-00).
  • 231. REFERENCES 225 References Abel, D. J. and Smith, J. L. (1983). A data structure and algorithm based on a linear key for a rectangle retrieval problem. International Journal of Computer Vision, Graphics and Image Processing, 24(1):1-13. Abel, D. J. and Smith, J. L. (1984). A data structure and query algorithm for a database of areal entities. Australian Computing Journal, 16(4):147-154. Achyutuni, K. J., Omiecinski, E., and Navathe, S. (1996). Two techniques for on-line index modification in shared-nothing parallel systems. In Proc. 1996 ACM SIGMOD International Conference on Management of Data, pages 125-136. Ang, C. and Tan, K. (1995). The Interval B-tree. Information Processing Let- ters, 53(2):85-89. Arge, L. and Vitter, J. (1996). Optimal dynamic interval management in exter- nal memory. In Pmc. 37th Symposium on Foundations of Computer Science, pages 560-569. Aslandogan, Y. A., Yu, C., Liu, C., and Nair, K. R. (1995). Design, implemen- tation and evaluation of SCORE. In Proc. 11th International Conference on Data Engineering, pages 280-287. Bancilhon, F. and Ferran, G. (1994). ODMG-93: The object database standard. IEEE Bulletin on Data Engineering, 17(4):3-14. Banerjee, J. and Kim, W. (1986). Supporting VLSI geometry operations in a database system. In Proc. 3rd International Conference on Data Engineer- ing, pages 409-415. Bartels, D. (1996). ODMG93 - The emerging object database standard. In Proc. 12th International Conference on Data Engineering, pages 674-676. Bayer, R. and McCreight, E. (1972). Organization and maintenance of large ordered indices. Acta Informatica, 1(3):173-189. Bayer, R. and Schkolnick, M. (1977). Concurrency of operations on B-trees. Acta Informatica, 9:1-21. Beck, J. (1967). Perceptual grouping produced by line figures. Perception and Psychophysics, 2:491-495. Becker, B., Gschwind, S., T. Ohler, B. S., and Widmayer, P. (1993). On op- timal multiversion access structures. In Proc. 3rd International Symposium on Large Spatial Databases, pages 123-141. Beckley, D. A., Evens, M. W., and Raman, V. K. (1985a). Empirical com- parison of associative file structures. In Proc. International Conference on Foundations of Data Organization, pages 315-319. Beckley, D. A., Evens, M. W., and Raman, V. K. (1985b). An experiment with balanced and unbalanced k-d trees for associative retrieval. In Proc.
  • 232. 226 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS 9th International Conference on Computer Software and Applications, pages 256-262. Beckley, D. A., Evens, M. W., and Raman, V. K. (1985c). Multikey retrieval from k-d trees and quad trees. In Proc. 1985 ACM SIGMOD International Conference on Management of Data, pages 291-301. Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. (1990). The R*- tree: An efficient and robust access method for points and rectangles. In Proc. 1990 ACM SIGMOD International Conference on Management of Data, pages 322-331. Belkin, N. and Croft, W. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):29-38. Bell, T., Moffat, A., Nevill-Manning, C., Witten, I., and Zobel, J. (1993). Data compression in full-text retrieval systems. Journal of the American Society for Information Science, 44(9) :508-531. Bell, T., Moffat, A., Witten, I., and Zobel, J. (1995). The MG retrieval system: Compressing for space and speed. Communications of the ACM, 38(4):41-42. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509-517. Bentley, J. L. (1979a). Decomposable searching problems. Information Process- ing Letters, 8(5):244-251. Bentley, J. L. (1979b). Multidimensional binary search trees in database appli- cations. IEEE Transactions on Software Engineering, 5(4):333-340. Bentley, J. L. and Friedman, J. H. (1979). Data structures for range searching. ACM Computing Surveys, 11(4):397-409. Berchtold, S., Keirn, D., and Kriegel, H. (1996). The X-tree: An index structure for high-dimensional data. In Proc. 22nd International Conference on Very Large Data Bases, pages 28-39. Bertino, E. (1990). Query optimization using nested indices. In Proc. 2nd In- ternational Conference on Extending Database Technology, pages 44-59. Bertino, E. (1991a). An indexing technique for object-oriented databases. In Proc. 7th International Conference on Data Engineering, pages 160-170. Bertino, E. (1991b). Method precomputation in object-oriented databases. In Proc. A CM-SIGOIS and IEEE- TC-OA International Conference on Orga- nizational Computing Systems, pages 199-212. Bertino, E. (1994). On indexing configuration in object-oriented databases. VLDB Journal, 3(3):355-399. Bertino, E., Catania, B., and Shidlovsky, B. (1997). Towards optimal two- dimensional indexing for constraint databases. Technical Report TR-196-97, Dipartimento di Scienze dell'Informazione, University of Milano, Italy.
  • 233. REFERENCES 227 Bertino, E. and Foscoli, P. (1995). Index organizations for object-oriented database systems. IEEE Transactions on Knowledge and Data Engineering, 7(2):193-209. Bertino, E. and Guglielmina, C. (1991). Optimization of object-oriented queries using path indices. In Proc. International IEEE Workshop on Research Is- sues on Data Engineering: Transaction and Query Processing, pages 140- 149. Bertino, E. and Guglielmina, C. (1993). Path-index: An approach to the effi- cient execution of object-oriented queries. Data and Knowledge Engineering, 6(1):239-256. Bertino, E. and Kim, W. (1989). Indexing techniques for queries on nested objects. IEEE Transactions on Knowledge and Data Engineering, 1(2):196- 214. Bertino, E. and Martino, L. (1993). Object-Oriented Database Systems - Con- cepts and Architectures. Addison-Wesley. Bertino, E. and Quarati, A. (1991). An approach to support method invoca- tions in object-oriented queries. In Proc. International IEEE Workshop on Research Issues on Data Engineering: Transaction and Query Processing, pages 163-169. Blanken, H., Ijbema, A., Meek, P., and Akker, B. (1990). The generalized grid file: Description and performance aspects. In Proc. 6th International Con- ference on Data Engineering, pages 380-388. Bookstein, A., Klein, S., and Raita, T. (1992). Model based concordance com- pression. In Proc. IEEE Data Compression Conference, pages 82-91. Bowman, C., Danzig, P., Hardy, D., Manber, D., and Schwartz, M. (1995). The harvest information discovery and access system. Computer Networks and ISDN Systems, 28(1-2):119-125. Bowman, C., Danzig, P., Manber, D., and Schwartz, M. (1994). Scalable inter- net discovery: Research problems and approaches. Communications of the ACM,37(8):98-107. Bratley, P. and Choueka, Y. (1982). Processing truncated terms in document retrieval systems. Information Processing fj Management, 18(5):257- 266. Bretl, R., Maier, D., Otis, A., Penney, D., Schuchardt, B., Stein, J., Williams, E., and Williams, M. (1989). The GemStone data management system. In Object-Oriented Concepts, Databases, and Applications, pages 283-308. Addison-Wesley. Brinkhoff, T., Kriegel, H.-P., Schneider, R., and Seeger, B. (1994). Multi-step processing of spatial joins. In Proc. 1994 ACM SIGMOD International Con- ference on Management of Data, pages 197-208. Brodsky, A., Lassez, C., Lassez, J., and Maher, M. (1996). Separability of poly- hedra and a new approach to spatial storage. In Proc. 14th ACM SIGACT-
  • 234. 228 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS SIGMOD-SIGART Symposium on Principles of Database Systems, pages 54-65. Brown, E. (1995). Fast evaluation ofstructured queries for information retrieval. In Proc. 18th ACM-SIGIR International Conference on Research and De- velopment in Information Retrieval, pages 30-38. Buckley, C. and Lewit, A. (1985). Optimization of inverted vector searches. In Proc. 8th ACM-SIGIR International Conference on Research and Develop- ment in Information Retrieval, pages 97-110. Burkowski, F. (1992). An algebra for hierarchically organized text-dominated databases. Information Processing fj Management, 28(3):333-348. Callan, J. (1994). Passage-level evidence in document retrieval. In Proc. 17th ACM-SIGIR International Conference on Research and Development in In- formation Retrieval, pages 302-309. Cattell, R. (1993). The Object Database Standard: ODMG-93 Release 1.2. Mor- gan Kaufmann Publishers. Cesarini, F. and Soda, G. (1982). Binary trees paging. Information Systems, 7(4):337-344. Chan, C., Goh, C., and Ooi, B. C. (1997). Indexing OODB instances based on access proximity. In Proc. 13th International Conference on Data Engineer- ing, pages 14-21. Chan, C. Y., Ooi, B. C., and Lu, H. (1992). Extensible buffer management of indexes. In Proc. 18th International Conference on Very Large Data Bases, pages 444-454. Chang, J. M. and Fu, K. S. (1979). Extended k-d tree database organization: A dynamic multi-attribute clustering method. In Proc. 3rd International Conference on Computer Software and Applications, pages 39-43. Chang, S. K. and Fu, K. S., editors (1980). Pictorial Information Systems. Springer-Verlag. Chang, S. K. and Hsu, A. (1992). Image information systems: Where do we go from here? IEEE Transactions on Knowledge and Data Engineering, 4(5):431-442. Chang, S. K., Jungert, E., and Li, Y. (1989). Representation and retrieval of symbolic pictures using generalized 2D strings. In Proc. Visual Communi- cations and Image Processing Conference, pages 1360-1372. Chang, S. K., Shi, Q. Y., and Van, C. W. (1987). Iconic indexing by 2-d string. IEEE Transaction on Pattern Analysis and Machine Intelligence, 9(3):413- 428. Chang, S. K., Van, C. W., Dimitroff, D. C., and Arndt, T. (1988). An intel- ligent image database system. IEEE Transaction on Software Engineering, 15(5):681-688.
  • 235. REFERENCES 229 Chauduri, S. and Dayal, U. (1996). Decision support, data warehousing, and olap (tutorial notes). In Proc. 22nd International Conference on Very Large Data Bases. Chazelle, B. (1986). Filtering search: A new approach to query-answering. SIAM Journal of Computing, 15(3):703-724. Cheong, C. (1996). Internet agents. New Riders - Macmillan Publishing. Chiang, Y. and Tamassia, R. (1992). Dynamic algorithms in computational geometry. Proceedings of the IEEE, 80(9):1412-1434. Chiu, D. K. Y. and Kolodziejczak, T. (1986). Synthesizing knowledge: A cluster analysis approach using event-covering. IEEE Transactions on Systems, Man and Cybernetics, 16(2):462-467. Choenni, S., Bertino, E., Blanken, H., and Chang, T. (1994). On the selection of optimal index configuration in 00 databases. In Proc. 10th International Conference on Data Engineering, pages 526-537. Choueka, Y., Fraenkel, A., and Klein, S. (1988). Compression of concordances in full-text retrieval systems. In Proc. 11th ACM-SIGIR International Confer- ence on Research and Development in Information Retrieval, pages 597-612. Choy, D. and Mohan, C. (1996). Locking protocols for two-tier indexing of partitioned data. In Proc. International Workshop on Advanced Transaction Models and Architectures, pages 198-215. Chua, T. S., Lim, S. K., and Pung, H. K. (1994). Content-based retrieval of segmented images. In Proc. 2nd ACM Multimedia Conference, pages 211- 218. Chua, T. S., Tan, K. 1., and Goi, B. C. (1997). Fast signature-based color- spatial image retrieval. In Proc. 4th International Conference on Multimedia Computing and Systems. Chua, T. S., Teo, K. C., Goi, B. C., and Tan, K. L. (1996). Using domain knowledge in querying image database. In Proc. 3rd Multimedia Modeling Conference, pages 339-354. Clarke, C., Cormack, G., and Burkowski, F. (1995). An algebra for structured text search and a framework for its implementation. Computer Journal, 38(1):43-56. Cluet, S., Delobel, C., Lecluse, C., and Richard, P. (1989). Reloop, an algebra based query language for an object-oriented database system. In Proc. 1st International Conference on Deductive and Object Oriented Databases, pages 313-332. Comer, D. (1979). The ubiquitous B-tree. ACM Computing Surveys, 11(2):121- 137. Costagliola, G., Tucci, M., and Chang, S. K. (1992). Representing and retrieving symbolic pictures by spatial relations. In Visual Database Systems II, pages 49-59.
  • 236. 230 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Dao, T., Sacks-Davis, R, and Thorn, J. (1996). Indexing structured text for queries on containment relationships. In Pmc. 7th Australasian Database Conference, pages 82-91. Deux, O. (1990). The story of O2 . IEEE Transactions on Knowledge and Data Engineering, 2(1):91-108. Eastman, C. M. and Zemankova, M. (1982). Partially specified nearest neighbor using kd trees. Information Processing Letters, 15(2) :53-56. Easton, M. (1986). Key-sequence data sets in indeiible storage. IBM Journal of Research and Development, 30(12). Edelsbrunner, H. (1983). A new approach to rectangular intersection. Interna- tional Journal of Computational Mathematics, 13:209-219. Edelstein, H. (1995). Faster data warehouses. In Information Week, pages 77- 88. Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194-203. Elmasri, R, Wuu, G. T., and Kouramajian, V. (1990). The Time Index: An access structure for temporal data. In Proc. 16th International Conference on Very Large Data Bases, pages 1-12. Fagin, R, Nievergelt, J., Pippenger, N., and Strong, H. R (1979). Extendible hashing - A fast access method for dynamic files. A CM Transactions on Database Systems, 4(3):315-344. Faloutsos, C. (1988). Gray-codes for partial match and range queries. IEEE Transactions on Software Engineering, 14(10):1381-1393. Faloutsos, C., Equitz, W., Flickner, M., Niblack, W., Petkovic, D., and Bar- ber, R. (1994). Efficient and effective querying by image content. Journal of Intelligent Information Systems, 3(3):231-262. Faloutsos, C. and Jagadish, H. (1992). On B-tree indices for skewed distI'i- butions. In Proc. 18th International Conference on Very Large Databases, pages 363-374. Faloutsos, C. and Roseman, S. (1989). Fractals for secondary key retrieval. In Proc. 1989 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 247-252. Finkel, R. A. and Bentley, J. L. (1974). Quad trees: A data structure for retrieval on composite keys. Acta Informatica, 4:1-9. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dam, B., Gorkani, M., Hafner, J., Petkovic, D. L. D., Steele, D., and Yanker, P. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28(9):23- 32. Fox, E., editor (1995). Communications of the ACM, volume 38(4). Special issue on Digital Libraries.
  • 237. REFERENCES 231 Fox, E. and Shaw, J. (1993). Combination of multiple searches. In Proc. Text Retrieval Conference (TREC), pages 35-44. National Institute of Standards and Technology Special Publication 500-215. Frakes, W. and Baeza-Yates, R., editors (1992). Information Retrieval: Data Structures and Algorithms. Prentice-Hall. Francos, J. M., Meiri, A. Z., and Porat, B. (1993). A unified texture model based on a 2-d wold like decomposition. IEEE Transactions on Signal Processing, pages 2665-2678. Freeston, M. (1987). The BANG file: A new kind of grid file. In Proc. 1987 ACM SIGMOD International Conference on Management of Data, pages 260-269. Freeston, M. (1995). A general solution of the n-dimensional B-tree problem. In Proc. 1995 ACM SIGMOD International Conference on Management of Data, pages 80-91. French, C. (1995). One size fits all. In Proc. 1995 ACM SIGMOD International Conference on Management of Data, pages 449-450. Friedman, J. H., Bentley, J. L., and Finkel, R. A. (1987). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209-226. Gallager, R. and Van Voorhis, D. (1975). Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information Theory, IT-21(2):228-230. Gargantini, I. (1982). An effective way to represent quadtrees. Communications of the ACM, 25(12):905-910. Goh, C. H., Lu, H., Ooi, B. C., and Tan, K. L. (1996). Indexing temporal data using B+-tree. Data and Knowledge Engineering, 18:147-165. Goldfarb, C. (1990). The SGML Handbook. Oxford University Press. Golomb, S. (1966). Run-length encodings. IEEE Transactions on Information Theory,IT-12(3):399-401. Gong, Y., Chua, H. C., and Guo, X. (1995). Image indexing and retrieval based on color histograms. In Proc. 2nd Multimedia Modeling Conference, pages 115-126. Gonnet, G. and Baeza-Yates, R. (1991). Handbook of data structures and algo- rithms. Addison-Wesley, second edition. Gonnet, G. and Tompa, F. (1987). Mind your grammar: A new approach to modeling text. In Proc. 13th International Conference on Very Large Databases, pages 339-346. Graefe, G. (1993). Query evaluation techniques for large databases. ACM Com- puting Surveys, 25(2) :73-170.
  • 238. 232 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Gravano, L., Chang, C., Garcia-Molina, H., and Paepcke, A. (1997). STARTS: Stanford proposal for internet meta-searching. In Proc. 1997 ACM SIGMOD International Conference on Management of Data. Greene, D. (1989). An implementation and performance analysis of spatial data access methods. In Proc. 5th International Conference on Data Engineering, pages 606-615. Gudivada, V. and Raghavan, R. (1995). Design and evaluation of algorithms for image retrieval by spatial similarity. ACM Transactions on Information Systems, 13(1):115-144. Gunadhi, H. and Segev, A. (1993). Efficient indexing methods for temporal relation. IEEE Transactions on Knowledge and Data Engineering, 5(3):496- 509. Gunther, O. (1988). Efficient Structures for Geometric Data Management. Springer-Verlag. Gunther, O. (1989). The design of the cell tree: An object-oriented index struc- ture for geometric databases. In Proc. 5th International Conference on Data Engineering, pages 598-605. Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. In Proc. 1984 ACM SIGMOD International Conference on Management of Data, pages 47-57. Hall, P. and Dowling, G. (1980). Approximate string matching. Computing Surveys, 12(4):381-402. Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1):7-15. Harman, D., editor (1992). Proc. TREC Text Retrieval Conference. National Institute of Standards Special Publication 500-207. Harman, D., editor (1995a). Information Processing 0 Management, volume 31(3). Special Issue: The Second Text Retrieval Conference (TREC-2). Harman, D. (1995b). Overview of the second text retrieval conference (TREC- 2). Information Processing 0 Management, 31(3):271-289. Harman, D. and Candela, G. (1990). Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41 (8) :581-589.. Hearst, M. and Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proc. 16th ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 59-68. Henrich, A., Six, H.-W., and Widmayer, P. (1989a). The LSD tree: spatial access to multidimensional point and non-point objects. In Proc. 15th International Conference on Very Large Data Bases, pages 45-53.
  • 239. REFERENCES 233 Henrich, A., Six, H.-W., and Widmayer, P. (1989b). Paging binary trees with external balancing. In Proc. International Workshop on Graphtheoretic Con- cepts in Computer Science. Hinrichs, K. (1985). Implementation of the grid file: Design concepts and ex- perience. BIT, 25:569-592. Hinrichs, K. and Nievergelt, J. (1983). The grid file: A data structure designed to support proximity queries on spatial objects. In Proc. International Work- shop on Graphtheoretic Concepts in Computer Science, pages 100-113. Hirata, K., Hara, Y., Takano, H., and Kawasaki, S. (1996). Content-oriented integration in hypermedia systems. In Proc. 1996 ACM Conference on Hy- pertext, pages 11-21. Hoel, E. and Samet, H. (1992). A qualitative comparison study of data struc- tures for large line segment databases. In Proc. 1992 ACM SIGMOD Inter- national Conference on Management of Data, pages 205-214. Hsu, W., Chua, T. S., and Pung, H. K. (1995). An integrated color-spatial approach to content-based image retrieval. In Proc. 3rd ACM Multimedia Conference, pages 305-313. Hutflesz, A., Six, H.-W., and Widmayer, P. (1990). The R-file: An efficient access structure for proximity queries. In Proc. 6th International Conference on Data Engineering, pages 372-379. Iannizzotto, G., Vita, L., and Puliafito, A. (1996). A new shape distance for content-based image retrieval. In Proc. 3rd Multimedia Modeling Conference, pages 371-386. Imielinski, T. and Badrinath, B. (1994). Mobile wireless computing: solutions and challenges in data management. Communications of the ACM, 37(10):18- 28. Imielinski, T., Viswanathan, S., and Badrinath, B. (1994a). Energy efficient indexing on air. In Proc. 1994 ACM SIGMOD International Conference on Management of Data, pages 25-36. Imielinski, T., Viswanathan, S., and Badrinath, B. (1994b). Power efficient filtering of data on air. In Proc. 4th International Conference on Extending Database Technology, pages 245-258. Ioka, M. (1989). A method of defining the similarity of images on the basis of color information. Technical Report RT-0030, IBM Tokyo Research Lab. Jaffar, J. and Lassez, J. (1987). Constraint logic programming. In Proc. 14th Annual ACM Symposium on Principles of Programming Languages, pages 111-119. Jagadish, H. V. (1991). A retrieval technique for similar shape. In Proc. 1991 ACM SIGMOD International Conference on Management of Data, pages 208-217.
  • 240. 234 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Jea, K. F. and Lee, Y. C. (1990). Building efficient and flexible feature-based indexes. Information Systems, 16(6):653-662. Jenq, P., Woelk, D., Kim, W., and Lee, W. (1990). Query processing in dis- tributed ORION. In Proc. 2nd International Conference on Extending Data- base Technology, pages 169-187. Jensen, C. S., editor (1994). A consensus glossary of temporal database concepts. Jensen, C. S., Mark, L., and Roussopoulos, N. (1991). Inc'remental implemen- tation model for relational databases with transaction time. IEEE Transac- tions on Knowledge and Data Engineering, 3(4):461-473. Jensen, C. S. and Snodgrass, R. (1994). Temporal specialization and generaliza- tion. IEEE Transactions on Knowledge and Data Engineering, 6(6):954-974. Jhingran, A. (1991). Precomputation in a complex object environment. In Proc. 7th IEEE International Conference on Data Engineering, pages 652-659. Jiang, P., Ooi, B. C., and Tan, K. L. (1996). An experimental study of temporal indexing structures, unpublished manuscript, available at http://www.iscs.nus.sg/ooibc/tp.ps. Kabanza, F., Stevenne, J., and Wolper, P. (1990). Handling infinite temporal data. In Proc. 9th ACM SIGACT-SIGMOD-SIGART Symposium on Prin- ciples of Database Systems, pages 392-403. Kanellakis, P., Kuper, G., and Revesz, P. (1995). Constraint query languages. Journal of Computer and System Sciences, 51(1):26-52. Kanellakis, P. and Ramaswamy, S. (1996). Indexing for data models with con- straints and classes. Journal of Computer and System Sciences, 52(3) :589- 612. Kaszkiel, M. and Zobel, J. (1997). Passage retrieval revisited. In Proc. 20th A CM-SIGIR International Conference on Research and Development in In- formation Retrieval. Kemper, A., Kilger, C., and Moerkotte, G. (1994). Function materialization in object bases: Design, realization and evaluation. IEEE Transactions on Knowledge and Data Engineering, 6(4):587-608. Kemper, A. and Kossmann, D. (1995). Adaptable pointer swizzling strategies in object bases: Design, realization, and quantitative analysis. VLDB Journal, 4(3):519-566. Kemper, A. and Moerkotte, G. (1992). Access support relations: An indexing method for object bases. Information Systems, 17(2):117-145. Kent, A., Sacks-Davis, R., and Ramamohanarao, K. (1990). A signature file scheme based on multiple organizations for indexing very large text databases. Journal of the American Society for Information Science, 41(7):508--534. Kilger, C. and Moerkotte, G. (1994). Indexing multiple sets. In Proc. 20th International Conference on Very Large Data Bases, pages 180-191.
  • 241. REFERENCES 235 Kim, K., Kim, W., Woelk, D., and Dale, A. (1988). Acyclic query processing in object-oriented databases. In Proc. 7th International Conference on Entity- Relationship Approach, pages 329-346. Kim, W. (1989). A model of queries for object-oriented databases. In Proc. 15th International Conference on Very Large Data Bases, pages 423-432. Kim, W., Kim, K., and Dale, A. (1989). Indexing techniques for object-oriented databases. In Object-Oriented Concepts, Databases, and Applications, pages 371-394. Addison-Wesley. Knaus, D., Mittendorf, E., Schauble, P., and Sheridan, P. (1995). Highlighting relevant passages for users of the interactive SPIDER retrieval system. In Proc. 4th Text Retrieval Conference (TREC), pages 233-243. Knuth, D. E. (1973). Fundamental Algorithms: The art of computer program- ming, Volume 1. Addison-Wesley. Knuth, D. E. and Wegner, L. M., editors (1992). Proc. IFIP TC2/WG2.6 2nd Working Conference on Visual Database Systems. North-Holland. Kolovson, C. (1993). Indexing techniques for historical databases. In Temporal Databases: Theory, Design and Implementation, Chapter 17, pages 418-432. A. Benjamin/Cummings. Kolovson, C. and Stonebraker, M. (1991). Segment indexes: Dynamic indexing techniques for multi-dimensional interval data. In Proc. 1991 ACM SIGMOD International Conference on Management of Data, pages 138-147. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., and Protopapas, Z. (1996). Fast nearest neighbor search in medical image databases. In Proc. 22nd In- ternational Conference on Very Large Data Bases, pages 215-226. Koubarakis, M. (1994). Database models for infinite and indefinite temporal information. Information Systems, 19(2): 141-173. Kriegel, H. (1984). Performance comparison of index structures for multi-key retrieval. In Proc. 1984 ACM SIGMOD International Conference on Man- agement of Data, pages 186-196. Kriegel, H. and Seeger, B. (1986). Multidimensional order preserving linear hashing with partial expansion. In Proc. 1st International Conference on Database Theory, pages 203-220. Kriegel, H. and Seeger, B. (1988). PLOP-Hashing: A grid file without directory. In Proc. 4th International Conference on Data Engineering, pages 369-376. Kroll, B. and Widmayer, P. (1994). Distributing a search tree among a growing number of processors. In Proc. 1994 ACM SIGMOD International Confer- ence on Management of Data, pages 265-276. Kukich, K. (1992). Techniques for automatically correcting words in text. Com- puting Sw'veys, 24(4):377-440.
  • 242. 236 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Kumar, A., Tsotras, V. J., and Faloutsos, C. (1995). Access methods for bi- temporal databases. In Proc. International Workshop on Temporal Databases, pages 235-254. Kunii, T., editor (1989). Proc. IFfP TC2/WG2.6 1st Working Conference on Visual Database Systems. North-Holland. Larson, P. (1978). Dynamic hashing. BIT, 13:184-201. Lassez, J. (1990). Querying constraints. In Proc. 9th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, pages 288-298. Lee, D. T. and Wong, C. K. (1977). Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica, 9(1):23-'-29. Lee, S. Y. and Hsu, F. J. (1990). 2D C-String: A new spatial knowledge repre- sentation for image database system. Pattern Recognition, 23(10):1077-1087. Lee, S. Y. and Leng, C. (1989). Partitioned signature files: Design issues and performance evaluation. ACM Transactions on Office Information Systems, 7(2):158-180. Lee, S. Y, Yang, M. C., and Chen, J. W. (1992). Signature file as a spatial filter for iconic image database. Journal of Visual Languages and Computing, 3(4):373-397. Lee, W. (1989). Mobile cellular telecommunication systems. McGraw-Hill. Lin, K., Jagadish, H., and Faloutsos, C. (1995). The TV-tree: An index struc- ture for high-dimensional data. VLDB Journal, 3(4):517-542. Litwin, W. (1980). Linear hashing: A new tool for file and table addressing. In Proc. 6th International Conference on Very Large Data Bases, pages 212- 223. Litwin, W. and Neimat, M. (1996). k-RP*S: A scalable distributed data struc- ture for high-performance multi-attribute access. In Proc. 4th Conference on Parallel and Distributed Information Systems, pages 35-46. Litwin, W., Neimat, M., and Schneider, D. (1993a). LH* - Linear hashing for distributed files. In Proc. 1993 ACM SIGMOD International Conference on Management of Data, pages 327-336. Litwin, W., Neimat, M., and Schneider, D. (1994). RP*: A family of order- preserving scalable data structures. In Proc. 20th International Conference on Very Large Data Bases, pages 342-353. Litwin, W., Neimat, N. A., and Schneider, D. A. (1993b). LH* - Linear hashing for distributed files. In Proc. 1993 ACM SIGMOD International Conference on Management of Data, pages 327-336. Lomet, D. (1992). A review of recent work on multi-attribute access methods. ACM SIGMOD Record, 21(3):56-63.
  • 243. REFERENCES 237 Lomet, D. and Salzberg, B. (1989). Access methods for multiversion data. In Proc. 1989 ACM SIGMOD International Conference on Management of Data, pages 315-324. Lomet, D. and Salzberg, B. (1990a). The hB-tree: A multiattribute indexing method with good guaranteed performance. ACM Transactions on Database Systems, 15(4):625-658. Lomet, D. and Salzberg, B. (1990b). The performance of a multiversion ac- cess methods. In Proc. 1990 ACM SIGMOD International Conference on Management of Data, pages 353-363. Lomet, D. and Salzberg, B. (1993). Transaction time databases. In Temporal Databases: Theory, Design and Implementation, Chapter 16, pages 388-417. A. Benjamin/Cummings. Lovins, J. (1968). Development of a stemming algorithm. Mechanical Transla- tion and Computation, 11(1-2):22-31. Low, C. C., Ooi, B. C., and Lu, H. (1992). H-trees: A dynamic associative search index for OODB. In Proc. 1992 ACM SIGMODlnternational Conference on Management of Data, pages 134-143. Lu, H. and Ooi, B. C. (1993). Spatial indexing: Past and future. IEEE Bulletin on Data Engineering, 16(3):16-21. Lu, H., Ooi, B. C., and Tan, K. L. (1994). Efficient image retrieval by color con- tents. In Proc. 1994 International Conference on Applications of Databases, pages 95-108. Lu, W. and Han, J. (1992). Distance-associated join indices for spatial range search. In Proc. 8th International Conference on Data Engineering, pages 284-292. Lucarella, D. (1988). A document retrieval system based upon nearest neighbor searching. Journal of Information Science, 14:25-33. Maier, D. and Stein, J. (1986). Indexing in an object-oriented database. In Proc. IEEE Workshop on Object-Oriented DBMSs, pages 171-182. Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 11(7):2091-2110. Manber, U. and Wu, S. (1994). GLIMPSE: A tool to search through entire file systems. In Proc. 1994 Winter USENIX Technical Conference, pages 23-32. Maragos, P. (1989). Pattern spectrum and multiscale shape representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):701- 716. Maragos, P. and Schafer, R. W. (1986). Morphological skeleton representation and coding of binary images. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34:1228-1244.
  • 244. 238 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Matsuyama, T., Hao, L., and Nagao, M. (1984). A file organization for geo- graphic information systems based on spatial proximity. International Jour- nal on Computer Vision, Graphics, and Image Processing, 26(3):303-318. Mehlhorn, K. and Tsakalidis, A. (1990). Data structures. In Handbook of The- oretical Computer Science, Volume A, pages 301-341. Elsevier Publisher. Mehrotra, R. and Gary, J. E. (1993). Feature-based retrieval of similar shapes. In Proc. 9th International Conference on Data Engineering, pages 108-115. Melton, J. (1996). An SQL3 snapshot. In Proc. 12th International Conference on Data Engineering, pages 666-672. Mittendorf, E. and Schauble, P. (1994). Document and passage retrieval based on hidden Markov models. In Proc. 17th ACM-SIGIR International Confer- ence on Research and Development in Information Retrieval, pages 318-327. Miyahara, M. and Yoshida, Y. (1989). Mathematical transform of (R,G,B) color data to Munsell (H,Y,C) color data. Journal of the Institute of Television Engineers, 43(10):1129-1136. Moffat, A. and Zobel, J. (1996). Self-indexing inverted files for fast text re- trieval. ACM Transactions on Information Systems, 14(4):349-379. Moffat, A., Zobel, J., and Sacks-Davis, R. (1994). Memory efficient ranking. Information Processing (j Management, 30(6):733-744. Morrison, D. (1968). PATRICIA - Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514-534. Morton, G. (1966). A computer oriented geodetic data base and a new technique in file sequencing. In IBM Ltd. Moss, J. (1992). Working with the persistent objects: to swizzle or not to swiz- zle. IEEE Transactions on Software Engineering, 18(8):657-673. Nabil, M., Ngu, A. H. H., and Shepherd, J. (1996). Picture similarity re- trieval using the 2D projection interval representation. IEEE Transactions on Knowledge and Data Engineering, 8(4):533-539. Nagy, G. (1985). Image databases. Image and Vision Computing, 3(3): 111-117. Nascimento, M. A. (1996). Efficient Indexing of Temporal Database via B+- trees. PhD thesis, School of Engineering and Applied Science, Southern Methodist University. Nelson, R. and Samet, H. (1987). A population analysis for hierarchical data structures. In Proc. 1987 ACM SIGMOD International Conference on Man- agement of Data, pages 270-277. Ng, V. and Kameda, T. (1993). Concurrent accesses to R-trees. In Proc. 3rd International Symposium on Advances in Spatial Databases, pages 142-161. Niblack, W., Equitz, R. B. W., Glasman, M. F. E., Petkovic, D., YankeI', P., and Faloutsos, C. (1993). The QBIC project: Query images by content using color, texture and shape. In Storage and Retrieval for Image and Video Databases, Vulume 1908, pages 173-187.
  • 245. REFERENCES 239 Nievergelt, J. and Hinrichs, K. (1985). Storage and access structures for geo- metric data bases. In Proc. International Conference on Foundations of Data Organization, pages 335-345. Nievergelt, J., Hinterberger, H., and Sevcik, K. C. (1984). The grid file: An adaptable, symmetric multikey file structure. A CM Transactions on Database Systems, 9(1):38-71. Nievergelt, J. and Widmayer, P. (1997). Spatial data structures: Concepts and design choices. In Algorithmic Foundations of GIS, pages 1-61. Springer- Verlag. Nori, A. (1996). Object relational database management systems (tutorial notes) In Proc. 22nd International Conference on Very Large Data Bases. ObjectStore (1995). ObjectStore C++ - User Guide Release 4.0. Ogle, V. E. and Stonebraker, M. (1995). Chabot: Retrieval from a relational database of images. IEEE Computer, 28(9):40-48. Ohsawa, Y. and Sakauchi, M. (1983). The BD-tree: A new n-dimensional data structure with highly efficient dynamic characteristics. In Proc. IFIP Congres~ pages 539-544. Ohsawa, Y. and Sakauchi, M. (1990). A new tree type data structure with homogeneous nodes suitable for a very large spatial database. In Proc. 6th International Conference on Data Engineering, pages 296-303. O'Neil, P. and Graefe, G. (1995). Multi-table joins through bitmapped join indices. ACM SIGMOD Record, 24(3):8-11. O'Neil, P. and Quass, D. (1997). Improved query performance with variant indexes. In Proc. 1997 ACM SIGMOD International Conference on Man- agement of Data. Ooi, B. C. (1990). Efficient Query Processing in Geographical Information Sys- tems. Springer-Verlag. Ooi, B. C., McDonell, K. J., and Sacks-Davis, R. (1987). Spatial kd-tree: An indexing mechanism for spatial databases. In Proc. 11th International Con- ference on Computer Software and Applications. Ooi, B. C., Sacks-Davis, R., and Han, J. (1993). Spatial indexing structures, unpublished manuscript, available at http://www.iscs.nus.edu.sg/ooibc/. Ooi, B. C., Sacks-Davis, R., and McDonell, K. J. (1991). Spatial indexing by bi- nary decomposition and spatial bounding. Information Systems, 16(2):211- 237. Ooi, B. C., Tan, K. L., and Chua, T. S. (1997). Fast image retrieval using color- spatial information. Technical report, Department of Information Systems and Computer Science, NUS, Singapore. Orenstein, J. A. (1982). Multidimensional tries for associative searching. Infor- mation Processing Letters, 14(4):150-157.
  • 246. 240 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Orenstein, J. A. (1986). Spatial query processing in an object-oriented database system. In Proc. 1986 ACM SIGMOD International Conference on Manage- ment of Data, pages 326-336. Orenstein, J. A. (1990). A comparison of spatial query processing techniques for native and parameter spaces. In Proc. 1990 ACM SIGMOD International Conference on Management of Data, pages 343-352. Orenstein, J. A. and Merrett, T. H. (1984). A class of data structures for associative searching. In Proc. 1984 ACM-SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 181-190. Ouksel, M. and Scheuermann, P. (1981). Multidimensional B-trees: Analysis of dynamic behavior. BIT, 21:401-418. Overmars, M. H. and Leeuwen, J. V. (1982). Dynamic multi-dimensional data structures based on Quad- and KD- trees. Acta Information, 17:267-285. Owolabi, O. and McGregor, D. (1988). Fast approximate string matching. Soft- ware - Practice and Experience, 18:387-393. Papadias, D., Theodoridis, Y, Sellis, T., and Egenhofer, M. J. (1995). Topo- logical relations in the world of minimum bounding rectangles: A study with R-trees. In Proc. 1995 ACM SIGMOD International Conference on Man- agement of Data, pages 92-103. Paredaens, J. (1995). Spatial databases, the final frontier. In Proc. 5th Inter- national Conference on Database Theory, pages 14-31. Paredaens, J., Van den Bussche, J., and Van Gucht, D. (1994). Towards a theory of spatial database queries. In Proc. 13th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 279-288. Persin, M. (1996). Efficient implementation of text retrieval techniques. Mas- ter's thesis, Department of Computer Science, RMIT, Melbourne, Australia. Persin, M., Zobel, J., and Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Infor- mation Science, 47(10):749-764. Pfaltz, J., Berman, W., and Cagley, E. (1980). Partial-match retrieval using indexed descriptor files. Communications of the ACM, 23(9):522-528. Porter, M. (1980). An algorithm for suffix stripping. Program, 13(3):130-137. Preparata, F. and Shamos, M. (1985). Computational Geometry: An Introduc- tion. Springer-Verlag. Rabitti, F. and Savino, P. (1991). Image query processing based on multi-level signatures. In Proc. 14th ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 305-314. Rabitti, F. and Stanchev, P. (1989). GRIM-DBMS: A graphical image database management system. In Proc. IFIP TC2/WG2.6 1st Working Conference on Visual Database Systems, pages 415-430.
  • 247. REFERENCES 241 Ramaswamy, S. (1997). Efficient indexing for constraints and temporal data- bases. In Pmc. 6th International Conference on Database Theory, pages 419- 431. Ramaswamy, S. and Kanellakis, P. (1995). OODB indexing by class-division. In Proc. 1995 ACM SIGMOD International Conference on Management of Data, pages 139-150. Roberts, C. (1979). Partial-match retrieval via the method of superimposed codes. Pmceedings of the IEEE, 67(12):1624-1642. Robinson, J. T. (1981). The k-d-b-tree: A search structure for large multi- dimensional dynamic indexes. In Pmc. 1981 ACM SIGMOD International Conference on Management of Data, pages 10-18. Rosenberg, J. B. (1985). Geographical data structures compared: A study of data structures supporting region queries. IEEE Transactions on Computer Aided Design, 4(1):53-67. Rotem, D. (1991). Spatial join indices. In Pmc. 7th International Conference on Data Engineering, pages 500-509. Rotem, D. and Segev, A. (1987). Physical organization of temporal data. In Pmc. 3rd International Conference on Data Engineering, pages 547-553. Sacks-Davis, R., Kent, A., and Ramamohanarao, K. (1987). Multi-key access methods based on superimposed coding techniques. ACM Transactions on Database Systems, 12(4) :655-696. Sagiv, y. (1986). Concurrent operations on B*-trees with overtaking. Journal of Computer System Science, 33(2) :275-296. Salomone, S. (1995). Radio days. In Byte, Special Issue on Mobile Computing, page 107. Salton, G. (1989). Automatic Text Processing: The Transfol'mation, Analysis, and Retrieval of Information by Computer. Addison-Wesley. Salton, G., Allan, J., and Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In Pmc. 16th ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 49-58. Salton, G. and McGill, M. (1983). Introduction to Modern Information Re- trieval. McGraw-Hill. Salzberg, B. (1994). On indexing spatial and temporal data. Information Sys- tems, 19(6):447-465. Samet, H. (1989). The design and analysis of spatial data structures. Addison- Wesley. Scheuermann, P. and Ouksel, M. (1982). Multidimensional B-trees for associa- tive searching in database systems. Information Systems, 7(2):123-137.
  • 248. 242 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Seeger, B. and Kriegel, H. (1988). Techniques for design and implementation of efficient spatial access methods. In Proc. 14th International Conference on Very Large Data Bases, pages 360-371. Sellis, T., Roussopoulos, N., and Faloutsos, C. (1987). The R+-tree: A dynamic index for multi-dimensional objects. In Proc. 13th International Conference on Very Large Data Bases, pages 507-518. Serra, J. (1988). Image Analysis and Mathematical Morphology, Volume 2, The- oretical Advances. Academic Press. Shamos, M. I. and Bentley, J. L. (1978). Optimal algorithm for structuring geographic data. In Proc. 1st International Advanced Study Symposium on Topological Data Structure for Geographic Information Systems. Sharma, K. D. and Rani, R. (1985). Choosing optimal branching factors for k-d-B trees. Information Systems, 10(1):127-134. Shaw, G. and Zdonik, S. (1989). An object-oriented query algebra. In Proc. 2nd International Workshop on Database Programming Languages, pages 103-112. Shen, H., Ooi, B. C., and Lu, H. (1994). The TP-index: A dynamic and ef- ficient indexing mechanism for temporal databases. In Proc. 10th Interna- tional Conference on Data Engineering, pages 274-281. Sheng, S., Chandrasekaran, A., and Broderson, R. (1992). A portable mul- timedia terminal for personal communications. In IEEE Communications Magazine, pages 64-75. Shidlovsky, B. and Bertino, E. (1996). A graph-theoretic approach to indexing in object-oriented databases. In Proc. 12th International Conference on Data Engineering, pages 230-237. Snodgrass, R. (1987). The temporal query language TQuel. ACM Transaction on Database Systems, 12(2):247-298. Sreenath, B. and Seshadri, S. (1994). The hcC-tree: An efficient index structure for object oriented databases. In Proc. 20th International Conference on Very Large Data Bases, pages 203-213. Straube, D. and Ozsu, M. T. (1995). Query optimization and execution plan generation in object-oriented data management systems. IEEE Transactions on Knowledge and Data Engineering, 7(2):210-227. Swain, M. J. (1993). Interactive indexing into image database. In Storage and Retrieval for Image and Video Databases, Volume 1908, pages 95-103. Tamminen, M. (1982). Efficient spatial access to a data base. In Proc. 1982 ACM SIGMOD International Conference on Management of Data, pages 200-206. Tamura, H., Mori, S., and Yamawaki, T. (1978). Textural features correspond- ing to visual perception. IEEE Transactions on Systems, Man and Cyber- netics,8(6):460-472.
  • 249. REFERENCES 243 Tamura, H. and Yokoya, N. (1984). Image database systems: A survey. Pattern Recognition, 17(1):29-43. Thorn, J., Zobel, J., and Grima, B. (1995). Design of indexes for structured document databases. Technical Report TR-95-8, Collaborative Information Technology Research Institute, RMIT and The University of Melbourne. Treisman, A. and Paterson, R. (1980). A feature integration theory of attention. Cognitive Psychology, 12:97-136. Tsay, J. J. and Li, H. C. (1994). Lock-free concurrent tree structures for mul- tiprocessor systems. In Proc. 1994 International Conference on Parallel and Distributed Systems, pages 544-549. Valduriez, P. (1986). Optimization of complex database queries using join in- dices. IEEE Bulletin on Data Engineering, 9(4):10-16. Valduriez, P. (1987). Join indices. ACM Transactions on Database Systems, 12(2) :218-246. van Rijsbergen, C. (1979). Information Retrieval. Butterworths, second edition. Whang, K. and Krishnamurthy, R. (1985). Multilevel grid files. Technical Re- port RC-1l516, IBM Thomas J. Watson Research Center. Wilkinson, R. (1994). Effective retrieval of structured documents. In Proc. 17th ACM-SIGIR International Conference on Research and Development in In- formation Retrieval, pages 311-317. Witten, I., Moffat, A., and Bell, T. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold. Wu, S. and Manber, U. (1992). Agrep - A fast approximate pattern-matching tool. In Proc. 1992 Winter USENIX Technical Conference, pages 153-162. Xie, Z. and Han, J. (1994). Join index hierarchy for supporting efficient navi- gation in object-oriented databases. In Proc. 20th International Conference on Very Large Data Bases, pages 522-533. Zdonik, S. and Maier, D. (1989). Fundamentals of object-oriented databases. In Readings in Object-Oriented Database Management Systems. Zhou, Z. and Venetsanopoulos, A. N. (1988). Morphological skeleton represen- tation and shape recognition. In Proc. IEEE 2nd International Conference on ASSP, pages 948-951. Zobel, J. and Dart, P. (1995). Finding approximate matches in large lexicons. Software - Practice and Experience, 25(3):331-345. Zobel, J. and Dart, P. (1996). Phonetic string matching: Lessons from infor- mation retrieval. In Proc. 19th ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 166-173. Zobel, J., Moffat, A., and Ramamohanarao, K. (1995a). Inverted files versus signature files for text indexing. Technical Report TR-95-5, Collaborative Information Technology Research Institute, RMIT and The University of Melbourne.
  • 250. 244 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS Zobel, J., Moffat, A., and Ramamohanarao, K. (1996). Guidelines for pre- sentation and comparison of indexing techniques. ACM SIGMOD Record, 25(3):10-15. Zobel, J., Moffat, A., and Sacks-Davis, R. (1992). An efficient indexing tech- nique for full-text database systems. In Proc. 18th International Conference on Very Large Databases, pages 352-362. Zobel, J., Moffat, A., and Sacks-Davis, R. (1993). Searching large lexicons for partially specified terms using compressed inverted files. In Proc. 19th In- ternational Conference on Very Large Databases, pages 290-301. Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R. (1995b). Efficient retrieval of partial documents. Information Processing fj Management, 31(3):361-377.
  • 251. About the Authors Elisa Bertino is full professor of computer science in the Department of Com- puter Science of the University of Milan. She has also been on the faculty in the Department of Computer and Information Science of the University of Genova, Italy. She has been a visiting researcher at the IBM Research Labo- ratory (now Almaden) in San Jose, and at the Microelectronics and Computer Technology Corporation in Austin, Texas. She is or has been on the editorial board of the following scientific journals: IEEE Transactions on Knowledge and Data Engineering, Theory and Practice of Object Systems Journal, Journal of Computer Security, Very Large Database Systems Journal, Parallel and Dis- tributed Database, the International Journal of Information Technology. She is currently serving as Program co-chair of the 1998 International Conference on Data Engineering. Beng Chin Ooi received his B.Sc. and Ph.D in computer science from Monash University, Australia, in 1985 and 1989 respectively. He was with the Institute of Systems Science, Singapore, from 1989 to 1991 before joining the Department ofInformation Systems and Computer Science at the National University of Singapore, Singapore. His research interests include database performance issues, database UI, multi-media databases and applications, and GIS. He is the author of a monograph "Efficient Query Processing in Geographic Information Systems" (Springer-Verlag, 1990). He has published many confer- ence and journal papers and serves as a PC member in a number of international conferences. He is currently on the editorial board of the following scientific journals: International Journal of Geographical Information Systems, Journal on Universal Computer Science, Geoinformatica and International Journal of Information Technology. Ron Sacks-Davis obtained his Ph.D. from the University of Melbourne in 1977. He currently holds the position of Professor and Institute Fellow at 245
  • 252. 246 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS RMIT. He has published widely in the areas of database management and information retrieval and is an editor-in-chief of the International Journal on Very Large Databases (VLDB) and a member of the VLDB Endowment Board. Kian-Lee Tan received his Ph.D. in computer science, from the National University of Singapore in 1994. He is currently a lecturer in the Depart- ment of Information Systems and Computer Science, National University of Singapore. He has published numerous papers in the areas of multimedia in- formation retrieval, wireless computing, query processing and optimization in multiprocessor and distributed systems. Justin Zobel obtained his Ph.D. in computer science from the University of Melbourne, where he was a member of staff from 1984 to 1990. He then joined the Department of Computer Science at RMIT, where he is now a senior lecturer. He has published widely in the areas of information retrieval, text databases, indexing, compression, string matching, and genomic databases. Boris Shidlovsky received his M.Sc. in applied mathematics and Ph.D. in computer science from the University of Kiev, Ukraine, in 1984 and 1990 respec- tively. He was an assistant professor in the Department of Computer Science at the University of Kiev. From 1993 to 1996, he was with the Department of Computer Engineering at University of Salerno, Italy and currently is a member of the Scientific Stuff in RANK XEROX Research Center, Grenoble, France. His research interests include design and analysis of algorithms, indexing and query optimization in advanced database systems, processing semistructured data on the Web. Barbara Catania is enrolled in a Ph.D program .in computer science in the University of Milano, Italy, since November 1993. She received with honour the Laurea degree in computer science from the University of Genova, Italy, in 1993. She has also been a visiting researcher at the European Computer- Industry Research Center, Munich, Germany, where she joined in the ESPRIT project IDEA, sponsorized by the European Economic Community. Her main research interests include: constraint databases, deductive databases, indexing techniques for constraint and object-oriented databases.
  • 253. Index 02,4 x-tree, 25 (l-m) index, 201 I-dimensional generalized tuple, 218 2-dimensional generalized tuple, 218, 222 access support relation, 16, 19 access time, 199, 200, 202 active mode, 196 address calculation, 191 adjacency querying on, 154 aggregation, 7, 29 aggregation graph, 3 agrep, 213 ALL selection, 217, 222 Altavista, 211 AP-tree, 125-127 Archie, 211 B+ -tree, 9, 20,30 of color-spatial index, 91 with linear order, 129-132 B-tree, 2 for lexicons, 159 battery, 196, 198, 200 beast wait, 199 BD-tree, 54-55 binary join index, 10, 206 bitemporal database, 114 bitemporal interval tree, 140 bitemporal relation, 118 bitmap, 207 bitmap join index, 209 bitslices, 169 Boolean queries for text, 154-155 Boolean query evaluation for text, 169-170 bounding rectangle, 40 bounding structure, 41 broadcast channel, 197 broadcasted data, 196 bucket, 198 BY-tree, 63-64 caching, 36 CG-tree,24 CH-tree,21 color, 90 CIE L*u*v, 108 color histogram, 90 Munsell HYC, 92 color index of color-spatial index, 94 color-spatial index for image, 91 compression of inverted lists, 161-164 configurable index, 200, 202 constraint, 214 constraint programming, 214 constraint theory, 216, 218 content-based index for image, 80 content-based retrieval for image, 78 convex theory, 218 cosine measure, 155-156 247
  • 254. 248 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS data warehouse, 204 decisions support system, 203 delta code, 162 detail table, 205 diagonal corner query, 219 dimension table, 205 distributed index, 201 distributed RAM, 189 doze mode, 196 dual plane, 222 dual R-tree, 140 dumb terminal, 195 dynamic interval management, 219 effectiveness of ranking, 152 Elias codes, 161-162 emerging applications, 185-224 Excite, 211 EXIST selection, 217, 218 extension, 215 fact constellation schema, 205 fact table, 205 feature color, 90 color-spatial, 91 semantic object, 87 shape, 84 spatial relationship, 88 texture, 89 feature extraction, 78 feature-based indexing, 78 file image, 191 file image adjustment, 192 filtering, 222 for ranking, 172 fixed host, 194 flexible indexing, 202 gamma code, 162 GBD-tree, 54-55 GemStone, 4 generalized I-dimensional indexing, 218 generalized concordance lists for text, 178 generalized database, 215 generalized relation, 215 generalized relational model, 215 generalized tuple, 215 Glimpse, 213 global index, 187 Golomb codes, 162-163 Gopher, 211 grid file, 64-67 H-tree,23- Harvest, 214 hashing, 2 hB-tree, 49-51 hcC-tree, 24 image database, 77-112 image database system, 78 architecture, 79 index construction for text, 164-166 index update for text, 166-168 indexing of documents, 153 indexing graph, 9 information retrieval, 152, 155-157 InfoSeek,211 infrared technology, 194 inheritance, 5, 20, 29 inheritance graph, 4 inheritance hierarchy, 20 interleaving for ranking, 173 interval B-tree, 127-129 interval tree, 220 inverse document frequency, 156 inverted file for image, 83 inverted index, 212 for text, 157-168 inverted lists for text, 158,160-164 join explicit, 5 join implicit, 5 join index, 10 join index hierarchy, 19 K-D-B-tree, 48-49 kd-tree, 46-48 non-hon10geneous,47 lexicons, 158-160 limiting accumulators for ranking, 172 linear hashing, 189 local index, 187
  • 255. locational keys, 70-71 LSD-tree, 55-56 mapping table, 158 materialization technique, 204 meta-block tree, 220 metasearcher, 213 method invocation, 3, 36 minimum bounding polybox, 224 minimum bounding rectangle, 41, 223 mobile host, 194 mobile network, 194 multi-index, 9, 17 navigational access, 2 nested attribute, 3 nested index, 14, 17 nested predicate, 5, 10, 29 nested-inherited index, 29 non-configurable index, 200 NST-tree, 126 object identifier, 3 object query language, 2, 5 object-oriented data model, 1, 3 object-oriented database, 1-38 object-relational database, 1 ObjectStore,4 OLAP, 203 OQL,2 ordinal number, 207 palmtop, 195 partition, 186 partitioning degree, 186 passage retrieval, 180-181 path, 7 path index, 15, 17 path instantiation, 7, 15 path splitting, 18 path-expression, 5 pattern matching for text, 179-180 perceptually similar color, 108 phonetic matching for text, 180 PLOP-hashing, 68-69 point location, 222 pointer swizzling, 2, 36 precomputed join, 207 probe time, 199 projection, 16 INDEX 249 proximity querying on, 154 query expansion for text, 181 query gr.aph, 6 query precomputation, 204 R+ -tree, 25, 60-63 R*-tree,59-60 R-file,67-68 R-tree, 25, 56-59, 132-137 2-D R-tree, 133 3-D R-tree, 133 ranked query evaluation for text, 170-175 ranking, 155-157 relevance judgments, 152 of documents, 152 satellite network, 194 SC-index, 21 search engine, 211 semantic object, 87 sequential search, 212 set-oriented access, 2 SGML,175 shape,84 signature file for image, 84 for text, 168-169 of color-spatial index, 105 similarity, 155, 156 measures, 79, 82, 155 approximate match, 82 Euclidean distance, 83 exact match, 82 signature-based, 107 signature-based (weighted), 109 skd-tree,51-54 SMAT of color-spatial index, 96 snowflake schema, 205 spatial access method for image, 83 spatial database, 39-75,215 spatial index taxonomy, 42 non-overlapping, 43 overlapping, 44 transformation approach, 43 spatial operators, 39
  • 256. 250 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS adjacency, 40 containment, 40 intersection, 39, 41 spatial query processing, 40 approximation, 40 multi-step strategy, 42 spatial relationship, 88 SQL,l SQL-3,2 stabbing query, 219 star schema, 205 stemming of words, 154 stopwords, 156, 175 storage on the air, 196 structured documents, 175-178 indexing of, 177-178 suffixing of words, 154 summary table, 205 temporal database, 113-149,215 temporal index, 121-142 B+ -tree with linear order, 129 temporal query, 119-121 bitemporal key-range time-slice, 120 bitemporal time-slice, 120 key, 120 key-range time-slice, 120 time-slice, 119 inclusion, 119 intersection, 119 point, 120 time-slice query containment, 120 text database, 151-182 text indexing, 157-169 text passage retrieval, 180-181 texture, 89 time lifespan, 115 time span, 115 transaction time, 114 valid time, 114 time index, 123-125 TP-index, 137-139 transaction time, 114-116 traversal strategy, 6 TREC, 159 TSB-tree, 122-123 tuning time, 200, 202 unary code, 161-162 valid time, 114,116-117 variable-bit codes, 161-163 WAIS,211 walkstation, 195 Web Crawler, 214 Web navigation, 210 Web robot, 214 Webcrawler,211 weight, 221 weight-balanced B-tree, 220 Whois,211 Whois++, 211 wireless interface, 194 WWW Worm, 214