Indexing techniques for advanced database systems

INDEXING TECHNIQUES FOR
ADVANCED DATABASE SYSTEMS

The Kluwer International Series on
ADVANCES IN DATABASE SYSTEMS
Series Editor
Ahmed K. Elmagarmid
Purdue University
West Lafayette, IN 47907
Other books in the Series:
DATABASE CONCURRENCY CONTROL: Methods, Performance, and Analysis
by Alexander Thomasian
ISBN: 0-7923-9741-X
TIME-CONSTRAINED TRANSACTION MANAGEMENT:
Real-Time Constraints in Database Transaction Systems
by Nandit R. Soparkar, Henry F. Korth, Abraham Silberschatz
ISBN: 0-7923-9752-5
SEARCHING MULTIMEDIA DATABASES BY CONTENT
by Christos Faloutsos
ISBN: 0-7923-9777-0
REPLICATION TECHNIQUES IN DISTRIBUTED SYSTEMS
by Abdelsalam A. Helal, Abdelsalam A. Heddaya, Bharat B. Bhargava
ISBN: 0-7923-9800-9
VIDEO DATABASE SYSTEMS: Issues, Products, and Applications
by Ahmed K. Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmed
ISBN: 0-7923-9872-6
DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS
by Nabu R. Adam andAryya Gangopadhyay
ISBN: 0-7923-9924-2
INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES
by Thomas A. Mueck and Martin L. Polaschek
ISBN: 0-7923-9971-4

INDEXING TECHNIQUES FOR
ADVANCED DATABASE SYSTEMS
by
Elisa Bertino
University of Milano, Italy
Beng Chin Ooi
National University of Singapore, Singapore
Ron Sacks-Davis
RMIT, Australia
Kian-Lee Tan
National University of Singapore, Singapore
Justin Zobel
RMIT, Australia
Boris Shidlovsky
Grenoble Laboratory, France
Barbara Catania
University of Milano, Italy
SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Library of Congress Cataloging-in-Publication Data
A CLP. Catalogue record for this book is available
from the Library of Congress.
ISBN 978-1-4613-7856-3 ISBN 978-1-4615-6227-6 (eBook)
DOI 10.1007/978-1-4615-6227-6
Copyright © 1997 Springer Science+Business Media New York
Originally published by Kluwer Academic Publishers in 1997
Softcover reprint of the hardcover 1st edition 1997
All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system or transmitted in any form or by any means, mechanical, photo-
copying, recording, or otherwise, without the prior written permission of the
publisher, Springer Science+Business Media, LLC.
Printed on acid-free paper.

Contents
Preface VII
1. OBJECT-ORIENTED DATABASES 1
1.1 Object-oriented data model and query language 3
1.2 Index organizations for aggregation graphs 7
13 Index organizations for in heritance hierarchies 20
1.4 Integrated organizations 29
1.5 Caching and pointer swizzling 36
1.6 Summary 38
2. SPATIAL DATABASES 39
2.1 Query processing using approximations 40
2.2 A taxonomy of spatial indexes 42
2.3 Binary-tree based indexing techniques 46
2.4 B-tree based indexing techniques 56
2.5 Cell methods based on dynamic hashing 64
2.6 Spatial objects ordering 70
2.7 Comparative evaluation 71
2.8 Summary 73
3. IMAGE DATABASES 77
3.1 Image database systems 78
3.2 Indexing issues and basic mechanisms 80
3.3 A taxonomy on image indexes 84
3.4 Color-spatial hierarchical indexes 91
3.5 Signatu re-based color-spatial retrieval 105
3.6 Summary 109

INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
4. TEMPORAL DATABASES
4.1 Temporal databases
4.2 Temporal queries
4.3 Temporal indexes
4.4 Experimental study
4.5 Summary
5. TEXT DATABASES
5.1 Querying text databases
5.2 Indexing
5.3 Query evaluation
5.4 Refinements to text databases
5.5 Summary
6. EMERGING APPLICATIONS
6.1 Indexing techniques for parallel and distributed databases
6.2 Indexing issues in mobile computing
6.3 Indexing techniques for data warehousing systems
6.4 Indexing techniques for the Web
6.5 Indexing techniques for constraint databases
References
Index
113
114
119
121
142
148
151
152
157
169
175
181
185
186
194
203
210
214
225
247

Preface
Database management systems are widely accepted as a standard tool for ma-
nipulating large volumes of data on secondary storage. To enable fast access
to stored data according to its content, databases use structures known as in-
dexes. While indexes are optional, as data can always be located by exhaustive
search, they are the primary means of reducing the volume of data that must
be fetched and processed in response to a query. In practice large database files
must be indexed to meet performance requirements.
Recent years have seen explosive growth in use of new database applications
such as CAD/CAM systems, spatial information systems, and multimedia in-
formation systems. The needs of these applications are far more complex than
traditional business applications. They call for support of objects with complex
data types, such as images and spatial objects, and for support of objects with
wildly varying numbers of index terms, such as documents. Traditional index-
ing techniques such as the B-tree and its variants do not efficiently support
these applications, and so new indexing mechanisms have been developed. As
a result of the demand for database support for new applications, there has
been a proliferation of new indexing techniques.
The need for a book addressing indexing problems in advanced applications
is evident. For practitioners and database and application developers, this
book explains best practice, guiding selection of appropriate indexes for each
application. For researchers, this book provides a foundation for development
of new and more robust indexes. For newcomers, this book is an overview of
the wide range of advanced indexing techniques.
The book consists of six self-contained chapters, each handled by area ex-
perts: Chapters 1 and 6 by Bertino, Catania, and Shidlovsky, Chapters 2, 3
and 4 by Ooi and Tan, and Chapter 5 by Sacks-Davis and Zobel. Each of the
first five chapters discusses indexing problems and techniques for a different
VII

VIII INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
database application; the last chapter discusses indexing problems in emerging
applications.
In Chapter 1 we discuss indexes and query evaluation for object-oriented
databases. Complex objects, variable-length objects, large objects, versions,
and long transactions cannot be supported efficiently by relational database
systems. The inadequacy of relational databases for these applications has pro-
vided the impetus for database researchers to develop object-oriented database
systems, which capture sophisticated semantics and provide a close model of
real-world applications. Object-oriented databases are a confluence of two tech-
nologies: databases and object-oriented programming languages. However, the
concepts of object, method, message, aggregation and generalization introduce
new problems to query evaluation. For example, aggregation allows an object
to be retrieved through its composite objects or based on the attribute values
of its component objects, while generalization allows an object to be retrieved
as an instance of its superclass.
Spatial data is large in volume and rich in structures and relationships.
Queries that involve the use of spatial operators (such as spatial intersection
and containment) are common. Operations involving these operators are ex-
pensive to compute, compared to operations such as join, and indexes are
essential to reduction of query processing costs. Indexing in a spatial database
is problematic because spatial objects can have non-zero extent and are asso-
ciated with spatial coordinates, and many-to-many spatial relationships exist
between spatial objects. Search is based, not only on attribute values, but on
spatial properties. In Chapter 2, we address issues related to spatial indexing
and analyze several promising indexing methods.
Conventional databases only store the current facts of the organization they
model. Changes in the real world are reflected by overwriting out-of-date data
with new facts. Monitoring these changes and past values of the data is, how-
ever, useful for tracking historical trends and time-varying events. In temporal
databases, facts are not deleted but instead are associated with times, which
are stored with the data to allow retrieval based on temporal relationships. To
support efficient retrieval based on time, temporal indexes have been proposed.
In Chapter 3, we describe and review temporal indexing mechanisms.
In large collections of images, a natural and useful way to retrieve image
data is by queries based on the contents of images. Such image-based queries
can be specified symbolically by describing their contents in terms of image
features such as color, shape, texture, objects, and spatial relationship between
them; or pictorially using sketches or example images. Supporting content-
based retrieval of image data is a difficult problem and embraces technologies
including image processing, user interface design, and database management.

PREFACE IX
To provide efficient content-based retrieval, indexes based on image features
are required. We consider feature-based indexing techniques in Chapter 4.
Text data without uniform structure forms the main bulk of data in corpo-
rate repositories, digital libraries, legal and court databases, and document
archives such as newspaper databases. Retrieval of documents is achieved
through matching words and phrases in document and query, but for docu-
ments Boolean-style matching is not usually effective. Instead, approximate
querying techniques are used to identify the documents that are most likely to
be relevant to the query. Effectiveness can be enhanced by use of transforma-
tions such as stemming and methodologies such as feedback. To support fast
text searching, however, indexing techniques such as special-purpose inverted
files are required. In Chapter 5, we examine indexes and query evaluation for
document databases.
In the first five chapters we cover the indexing topics of greatest importance
today. There are however many database applications that make use of indexing
but do not fall into one of the above five areas, such as data warehousing, which
has recently become an active research topic due to both its complexity and
its commercial potential. Queries against warehouses requires large number
of joins and calculation of aggregate functions. Another example is the use
of indexes to minimize energy consumption in portable equipment used in a
highly mobile environment. In Chapter 6 we discuss indexing mechanisms for
several such emerging database applications.
We are grateful to the many people and organizations who helped with
this book, and with the research that made it possible. In particular we thank
Timothy Arnold-Moore, Tat Seng Chua, Winston Chua, Cheng Hian Goh, Peng
Jiang, Marcin Kaszkiel, Alan Kent, Ramamohanarao Kotagiri, Wan-Meng Lee,
Alistair Moffat, Michael Persin, Yong Tai Tan, and Ross Wilkinson. Dave Abel,
Jiawei Han and Jung Nievergelt read earlier drafts of several chapters, and
provided helpful comments. We are also grateful to the Multimedia Database
Systems group at RMIT, the RMIT Department of Computer Science, the
Australian Research Council and the Department of Information Systems and
Computer Science at the National University of Singapore.
Elisa Bertino
Barbara Catania
Beng Chin Ooi
Ron Sacks-Davis
Boris Shidlovsky
Kian-Lee Tan
Justin Zobel

1 OBJECT-ORIENTED DATABASES
There has been a growing acceptance of the object-oriented data model as
the basis of next generation database management systems (DBMSs). Both
pure object-oriented DBMS (OODBMSs) and object-relational DBMS (OR-
DBMSs) have been developed based on object-oriented concepts. Object-
relational DBMS, in particular, extend the SQL language by incorporating
all the concepts of the object-oriented data model. A large number of products
for both categories of DBMS is today available. In particular, all major vendors
of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].
The widespread adoption of the object-oriented data model in the database
area has been driven by the requirements posed by advanced applications, such
as CAD/CAM, software engineering, workflow systems, geographic information
systems, telecommunications, multimedia information systems, just to name a
few. These applications require effective support for the management of com-
plex objects. For example, a typical advanced application requires handling
text, graphics, bitmap pictures, sounds and animation files. Other crucial re-
quirements derive from the evolutionary nature of applications and include
multiple versions of the same data and long-lived transactions. The use of
an object-oriented data model satisfies many of the above requirements. For
E. Bertino et al., Indexing Techniques for Advanced Database Systems
© Kluwer Academic Publishers 1997

2 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
example, an application's complex objects can be directly represented by the
model, and therefore there is no need to flatten them into tuples, as when re-
lational DBMSs are used. Moreover, the encapsulation property supports the
integration of packages for handling complex objects. However, because of the
increased complexity of the data model, and of the additional operational re-
quirements, such as versions or long transactions, the design of an OODBMS
or an ORDBMS poses several issues, both on the data model and languages,
and on the architecture [Kim et al., 1989, Nori, 1996, Zdonik and Maier, 1989].
An important issue is related to the efficient support of both navigational
and set-oriented accesses. Both types of accesses occur in applications typical
of OODBMS and ORDBMS and both must efficiently supported. Navigational
access is based on traversing object references; a typical example is represented
by graph traversal. Set-oriented access is based on the use of a high-level,
declarative query language. Object query languages have today reached a
certain degree of consolidation. A standard query language, known as OQL
(Object Query Language), has been proposed as part of the ODMG standard-
ization effort [Bartels, 1996, Cattell, 1993], whereas the SQL-3 standard, still
under development, is expected to include all major object modeling concepts
[Melton, 1996]. The two means of access are often complementary. A query
selects a set of objects. The retrieved objects and their components are then
accessed by using navigational capabilities [Bertino and Martino, 1993]. A brief
summary of query languages is presented in Section 1.1.
Different strategies and techniques are required to support the two above ac-
cess modalities. Efficient navigational access is based on caching techniques and
transformation of navigation pointers into main-memory addresses (swizzling),
whereas efficient execution of queries is achieved by the allocation of suitable
access structure and the use of sophisticated query optimizers. Access struc-
tures typically used in relational DBMSs are based on variations of the B-tree
structure [Comer, 1979] or on hashing techniques. An index is maintained on an
attribute or combination of attributes of a relation. Since an object-oriented
data model has many differences from the relational model, suitable index-
ing techniques must be developed to efficiently support object-oriented query
languages. In this chapter we survey some of the issues associated with index-
ing techniques and we describe proposed approaches. Also, we briefly discuss
caching and pointer swizzling techniques, for more details on these techniques
we refer the reader to [Kemper and Kossmann, 1995]. In the remainder of this
chapter, we cast our discussion in terms of the object-oriented data model typ-
ical of OODBMSs, because most of the work on indexing techniques have been
developed in the framework of OODBMSs. However, most of the discussion
applies to ORDBMSs as well.

OBJECT·ORIENTED DATABASES 3
The remainder of the chapter is organized as follows. Section 1.1 presents
an overview of the basic concepts of object-oriented data models, query lan-
guages, and query processing. For the purpose of the discussion, we consider
an object-oriented database organized along two dimensions: aggregation, and
inheritance. Indexing techniques for each of those dimensions are discussed in
Sections 1.2 and 1.3, respectively. Section 1.4 presents integrated organizations,
supporting queries along both aggregation and inheritance graphs. Section 1.5
briefly discusses method precomputation, caching and swizzling. Finally, Sec-
tion 1.6 presents some concluding remarks.
1.1 Object-oriented data model and query language
An object-oriented data model is based on a number of concepts [Bertino and
Martino, 1993, Cattell, 1993, Zdonik and Maier, 1989]:
• Each real-world entity is modeled by an object. Each object is associated
with a unique identifier (called an OlD) that makes the object distinguish-
able from any other object in the database. OODBMSs provide objects with
persistent and immutable identifiers: an object's identifier does not change
even if the object modifies its state.
• Each object has a set of instance attributes and methods (operations). The
value of an attribute can be an object or a set of objects. The set of at-
tributes of an object and the set of methods represent the object structure
and behavior, respectively.
• The attribute values represent the object's state. This state is accessed
or modified by sending messages to the object to invoke the corresponding
methods.
• Objects sharing the same structure and behavior are grouped into classes.
A class represents a template for a set of similar objects. Each object is an
instance of some class. A class definition consists a set of instance attributes
(or simply attributes) and methods. The domain of an attribute may be an
arbitrary class. The definition of a class C results in a directed-graph (called
aggregation graph) of the classes rooted at C. An attribute of any class on
an aggregation graph is a nested attribute of the class root of the graph.
Objects, instances of a given class, have a value for each attribute defined
by the class. All methods defined in a class can be invoked on the objects,
instances of the class.
• A class can be defined as a specialization of one or more classes. A class
defined as specialization is called a subclass and inherits attributes and meth-

Figure 1.1. An object-oriented database schema.
ods from its superclasses. The specialization relationship among classes or-
ganizes them in an inheritance graph which is orthogonal to the aggregation
graph.
An example of an object-oriented database schema, which will be used as
running example, is graphically represented in Figure 1.1. In the graphical
representation, a box represents a class. Within each box there are the names
::If the attributes of the class. Names labeled with a star denote multi-valued
attributes. Two types of arcs are used in the representation. A simple arc from
a class C to a class C' denotes that C' is domain of an attribute of C. A bold
arc from a class C to a class C' indicates that C is a superclass of C' .
In the remainder of the discussion, we make the following assumptions. First,
we consider classes as having the extensional notion of the set of their instances.
Second, we make the assumption that the extent of a class does not include the
instances of its subclasses. Queries are therefore made against classes. Note
that in several systems, such as for example GemStone [Bretl et aI., 1989], O2
[Deux, 1990], and ObjectStore [Obj~ctStore, 1995] classes do not have manda-
tory associated extensions. Therefore, applications have to use collections, or
sets, to group instances of the same class. Different collections may be defined
on the same class. Therefore, increased flexibility is achieved, even if the data
model becomes more complex. When collections are the basis for queries, in-
dexes are allocated on collections and not on classes [Maier and Stein, 1986].
In some cases, even though indexes are on collections, the definitions of the
classes of the indexed objects must verify certain constraints for the index to
be allocated on the collections. For example, in GemStone an attribute with

OBJECT-ORIENTED DATABASES 5
an index allocated on must be defined as a constrained attribute in the class
definition, that is, a domain must be specified for the attributel . Similarly,
ObjectStore requires that an attribute on which an index has to be allocated
be declared as indexable in the class definition.
As we discussed earlier, most OODBMSs provide an associative query lan-
guage [Bancilhon and Ferran, 1994, Cluet et al., 1989, Kim, 1989, Shaw and
Zdonik, 1989]. Here we summarize those features that most influence indexing
techniques:
• Nested predicates
Because of object's nested structures, most object-oriented query languages
allow objects to be restricted by predicates on both nested and non-nested
attributes of objects. An example of a query against the database schema
of Figure 1.1 is:
Retrieve the authors of books published by f(luwer. (Q1)
This query contains the nested predicate "published by Kluwer". Nested
predicates are usually expressed using path-expressions. For example, the
nested predicate in the above query can be expressed as
Author.books.publisher.name = "Kluwer".
• Inheritance
A query may apply to just a class, or to a class and to all its subclasses. An
example of a query against the database schema of Figure 1.1 is:
Retrieve all instances of class Book and all its subclasses published in
1991. (Q2)
The above query applies to all the classes in the hierarchy rooted at class
Book.
• Methods
A method can used in a query as a derived attribute method or a predicate
method. A derived attribute method has a function comparable to that of
an attribute, in that it returns an object (or a value) to which comparisons
can be applied. A predicate method returns the logical constants True or
False. The value returned by a predicate method can then participate in
the evaluation of the Boolean expression that determines whether the object
satisfies the query.
A distinction often made in object-oriented query languages is between im-
plicit join (called also functional joins), deriving from the hierarchical nesting
of objects, and explicit join, similar to the relational join, where two objects are

explicitly compared on the values of their attributes. Note that some query lan-
guages only support implicit joins. The motivation for this limitation is based
on the argument that in relational systems joins are mostly used to recom-
pose entities that were decomposed for normalization [Bretl et 301., 1989] and
to support relationships among entities. In object-oriented data models there
is no need to normalize objects, since these models directly support complex
objects and multivalued attributes. Moreover, relationships among entities are
supported through object references; thus the same function that joins provide
in the relational model to support relationships is provided more naturally by
path-expressions. It therefore appears that in OODBMSs there is no strong
need for explicit joins, especially if path-expressions are provided. An example
of a path-expression (or simply path) is "Book.publisher.name" denoting the
nested attribute "publisher.name" of class Book. The evaluation of a query
with nested predicates may require the traversal of objects along aggregation
graphs [Bertino, 1990, Jenq et 301.,1990, Kim et 301., 1988, Graefe, 1993, Straube
and Ozsu, 1995]. Because in OODBMSs most joins are implicit joins along ag-
gregation graphs, it is possible to exploit this fact by defining techniques that
precompute implicit joins. We discuss these techniques in Section 1.2.
In order to discuss the various index organizations, we need to summarize
some topics concerning query processing and execution strategies. A query can
be conveniently represented by a query graph [Kim et 301., 1989]. The query ex-
ecution strategies vary along two dimensions. The first dimension concerns the
strategy used to traverse the query graph. Two basic class traversal strategies
can be devised:
• Forward traversal: the first class visited is the target class of the query (root
of the query graph). The remaining classes are traversed starting from the
target class in any depth-first order. The forward traversal strategy for query
Ql is (Author Book Publisher).
• Reverse traversal: the traversal of the query graph begins at the leaves and
proceeds bottom-up along the graph. The reverse traversal strategy for
query Ql is (Publisher Book Author).
The second dimension concerns the technique used to retrieve instances of
the classes that are traversed for evaluating the query. There are two ba-
sic strategies for retrieving data from a visited class. The first strategy, called
nested-loop, consists of instantiating separately each qualified instance of a class.
The instance attributes are examined for qualification, if there are simple pred-
icates on the instance attributes. If the instance qualifies, it is passed to its
parent node (in the case of reverse traversal) or to its child node (in case of
forward traversal). The second strategy, called sort-domain, consists of instan-
tiating all qualified instances of a class at once. Then all qualifying instances

are passed to their parent or child node (depending on the traversal strategy
used). The combination of the graph traversal strategies with instance retrieval
strategies results in different query execution strategies. We refer the reader
to [Bertino, 1990, Graefe, 1993, Jenq et al., 1990, Kim et al., 1988, Straube
and Ozsu, 1995] for details on query processing stl'ategies for object-oriented
databases.
1.2 Index organizations for aggregation graphs
In this section, we first present some preliminary definitions. We then present
a number of indexing techniques that support efficient executions of implicit
joins along aggregation graphs. Therefore, these indexing techniques can be
used to efficiently implement class traversal strategies.
Definition. Given an aggregation graph H, a path P is defined as C1.A1.A2 . ...
An(n 2:: 1) where:
• C1 is a class in H;
• A1 is an attribute of class C1 ;
• Ai is an attribute of a class Ci in H, such that Ci is the domain of attribute
Ai-1 of class Ci-1, 1 < i :S n;
len(P) = n denotes the length of the path;
class(P) = C1U{CdCj is the domain of attribute Aj - 1 of class Cj- 1, 1 < i :S
n} denotes the set of the classes along the path;
dom(P) denotes the class domain of attribute An of class Cn;
two classes Cj and CH1, 1 :S i :S n - 1, are called neighbor classes in the path.
o
A path is simply a branch in a given aggregation graph. Examples of paths
in the database schema in Figure 1.1 are:
• P1 : Author.books.publisher.name
len(Pt}=3, class(Pd={Author, Book, Publisher}, dom(Pt}=string
• P2: Book.year len(P2)=I, class(P2 )={Book}, dom(P2)=integer
• P3 : Organization.staff.books.publisher.name
len(P3 )=4, class(P3 )={Organization, Author, Publication, Publisher},
dom(P3)=string
The concept of path is closely associated with that of path instantiation. A
path instantiation is a sequence of objects found by instantiating a given path.

The objects in Figure 1.2 are instances of the classes shown in Figure 1.1. The
following are example instantiations of the path Pa:
• Ph= 0[1].A[4].B[1].P[2].Addison-Wesley
(Ph is shown in Figure 1.2 by arrows connecting the instances in Ph)
• P12= 0[2].A[3].B[2].P[4].Kluwer
• P1a= 0[2].A[3].B[3].P[4].Kluwer
Ort;:unizatilill Author Puhlisher
MadlllllSh C Pnlgramming
BI2J
PI5j
PIJJ
PI4J
I KJuwer I
IMimlSuft
I Elsevier
Manual
Hil1UJhutik
MIIJ
c++ Pnlgramming Languages
Eflkicnl Parsing fur Naluml LunguagO'
BIIJ
I c++ RcJercllt.:eMallUal ~
M12)
IT11cGUIGuide ~
BIIJ. MIl]
AlII
A15)
A12)
IJ. van LeeuwenIHIIII
10 Mark IBI4q
014)
IWisconsin u·l]
Figure 1.2. Instances of classes of the database schema in Figure 1.1.
The above path instantiations are all complete, that is, they start with an
instance belonging to the first class of path Pa (that is, Organization), con-
tain an instance for each class found along the path, and end with an in-
stance of the class domain of the path (Publisher.name). Besides the com-
plete instantiations, a path may have also partial instantiations. For example,
A[2] .B[4] .P[2].Addison-Wesley is a left-partial instantiation, that is, its first
component is not an instance of the first class of the path (Organization in the
example), but rather an instance of a class following the first class along the
path (Author in the example).
Similarly, a right-partial instantiation of a path ends with an object which
is not an instance of the class domain of the path. In other words, a right-
partial instantiation is such that the last object in the instantiation contains
a null value for the attribute referenced in the path. 0[4] is a right-partial
instantiation of path Pa.

The last relevant concept we introduce here is the concept of indexing graph.
The concept of indexing graphs (IG) was introduced in [Shidlovsky and Bertino,
1996] as an abstract representation of a set of indexes allocated along a path
P. Given a path P = C1 .A1 .A2 .....An , an indexing graph contains n + 1
vertices, one for each class Ci in the path plus an additional vertex denoting
the class domain Cn .An 2 of the path, and a set of directed arcs. A directed arc
from vertex Ci to vertex Cj indicates that the indexing organization supports
a direct associations between each instance of Ci and instances of Cj obtained
by traversing the path from the instance of Ci to class Cj. Note that if Ci and
Cj are neighbor classes, the indexing organization materializes an implicit join
between the classes.
1.2.1 Basic techniques
Multi-index
This organization was the first proposed for indexing aggregation graphs. It is
based on allocating a B+-tree index on each class traversed by the path. There-
fore, given a path P = C1 .A1.A2 .·.· .An , a multi-index [Maier and Stein, 1986]
is defined as a set of n simple indexes (called index components) h, h, ...,In,
where Ii is an index defined on Ci.Ai, 1::; i::; n. All indexes h,I2 , ... ,In - 1
are identity indexes, that is, they have as key values aIDs. Only the compar-
ison operators == (identical to) and rvrv (not identical to) are supported on
an identity index. The last index In can be either an identity index, or an
equality index depending on the domain of An. An equality index is a regular
index, like the ones used in relational DBMSs, whose key values are primitive
objects, such as numbers or characters. An equality index supports comparison
operators such as = (equal to), rv (different from), <, ::;, >, 2:.
As an example consider path P1=Author.books.publisher.name. There will
be three indexes allocated for this path, as illustrated in Figure 1.3. In the
figure, each index is represented in a tabular form. An index entry is represented
as a row in the table. The first element of such a row is a key-value (given
in boldface), and the second element is the set of aIDs of objects holding
this key-value for the indexed attribute. The first index, h, is allocated on
Author.books; similarly indexes h and Is are allocated on Book.publisher and
Publisher.name, respectively.
Note that in the first index (h) the special key-value Null is used to record
a right-partial instantiation. Therefore, the multi-index allows determining all
path instantiations having null values for some attributes along the path. By
contrast, determining left-partial instantiations does not require any special
key-value.

B[l] A[4]
B[2] Al3]
B[3] A[3]
B[4] A[2]
Null A[4J
P[lJ Null
P[2] B[l], B[4]
P[4] B[2], B[3]
Academic Press PllJ
Addison-Wesley Pl2J
Elsevier P[3]
Kluwer Pl4J
Microsoft Pl5]
figure 1.3. Multi-index for path P1 =Author.books.publisher.name.
Under this organization, solving a nested predicate requires scanning a num-
ber of indexes equal to the path-length. For example to select all authors whose
books were published by Kluwer (query Ql), the following steps are executed:
1. A look-up of index Is with key-value "Kluwer"; the result is {P[4]}.
2. A look-up of index h with key-value P[4]; the result is {B[2], B[3]}.
3. A look-up of index It with key-values B[2] and B[3]; the result is {A[3]}
which is the result of the query.
Therefore, under this organization the retrieval operation is performed by
first scanning the last index allocated on the path. Then the results of this
index lookup are used as keys for a search on the index preceding the last
one in the path, and so forth until the first index is scanned. Therefore, this
organization only supports reverse traversal strategies. Its major advantage,
compared to others we describe later on, is the low update cost.
The indexing graph for the multi-index is as follows. Let P be a path of
length 7l. The graph contains an arc from class Gi+1 to class Gi, for i =
1, ... ,7l. The IG for P3 =Organization.staff.books.publisher.name is shown in
Figure lA.a.
Join index
The notion ofjoin index was introduced to efficiently perform joins in relational
databases [Valduriez, 1987]. However, the join index has also been used to
efficiently implement complex objects. A binary equijoin index is defined as
follows:
Given two relations Rand 5 and attributes A and B, respectively from R
and 5, a binary equijoin index is
where

Ofl:anl1.:ltillll Autlltlr Book Publisher Puhlishcr.U:llIlC Or~allii',alioll Au!lwr Book Puhlisher Puhlishcf.namc
h)
Author
organizatillll
"
BllOk
o
Pul'llishcr
oPuhlishcr.nan Organization AuU)or
~)
Bouk Put'llisllCf Puhlisher.namc
Figure 1.4. Indexing graphs: a) multi-index; b) join indexes; c) nested index; d) path-
index; e) access support relation.
• ri (sd denotes the surrogate of a tuple of R (5);
• tuple 1'i (tuple Sk) refers to the tuple having ri (Sk) as surrogate.O
A Bll is implemented as a binary relation and two copies may be kept,
one clustered on ri and the other on Sk; each copy is implemented as a B+-
tree. In aggregation graphs, a sequence of Blls can be used in a multi-index
organization to implement the various index components along a given path.
We refer to such sequence of join indexes as II organization. Consider path
Pl=Author.books.publisher.name. The join indexes allocated for such path are
listed below. They are illustrated together with some example index entries in
Figure 1.5.
• The first join index BJIt is on Author.books. The copy denoted as BJIt (a)
in Figure 1.5 is clustered on aIDs of instances of Author, whereas the copy
denoted as BlIt (b) is clustered on aIDs of instances of Book.
• The second join index Blh on Book.publisher. The copy denoted as BJh(a)
in Figure 1.5 is clustered Oil aIDs of instances of Book, whereas the copy
denoted as BJ12 (b) is clustered on aIDs of instances of Publisher.
• The third join index BJ13 is on the attribute Publisher.name. The copy de-
noted as BJIs (a) in Figure 1.5 is clustered on aIDs of instances of Publisher,

BIh
A[2] B[4]
A[3] B[2]
A[3] B[3]
A[4] Bll]
BJ Ida)
B[l] P[2]
B[2] Pl4]
B[3] P[4]
B[4] P[2]
P[l] Academic Press
P[2] Addison-Wesley
P[3] Elsevier
P[4] I<Iuwer
P[5] Microsoft
BIh
BIh
B[l] Ar41
B[2] A[3]
B[3] A[3]
B[4] A[2]
BJh(b)
P[2] Brl
P[2] B[4
P[4] Bf2
P[4] B[3]
BJh(b)
Academic Press P[l]
Addison-Wesley P[2]
Elsevier pr31
Kluwer P[4]
Microsoft pr5]
BJ Is (b)
Figure 1.5. JI organization for path PI =Author.books.publisher.name.
whereas the copy denoted as BJI3 (b) is clustered on values of attribute
"nan1e" .
A JI organization supports both forward and reverse traversal strategies
when both copies are allocated for each join index. Reverse traversal is suitable
for solving queries such as query Ql ("Retrieve the authors of books published
by Kluwer."). Forward traversal arises when given an object, all objects must be
determined that are referenced directly or indirectly by this object. An example
is the query "Determine the publishers of the books written by author A[3]".
Reverse traversal is already supported by the multi-index. However, such
technique does not support forward traversal that must, therefore, executed by
directly accessing the objects. The usage of a sequence of JIs may make forward
traversal faster when object accesses are expensive (for example, very large
objects or non-optimal clustering). Moreover, forward traversal supported by a

sequence of JIs may be useful in complex queries when objects at the beginning
of the path have already been selected as the effect of another predicate in the
query. An example of more complex query is "Select all books written by an
author from AT&T Lab". Suppose that an index is allocated on attribute
"Organization.name" and moreover a JI organization is allocated on the path
P=Organization.staff.books. A possible query strategy could be to first select
the OlD of the organization named "AT&T Lab" using the index on attribute
"Organization.name", and then use the JI organization in forward traversal to
determine the books written by authors of the organization 0[1) selected by
the first index scan.
The IG for a JI organization along a path P is constructed as follows. For
each pair of neighbor classes Ci and Ci+l along path P, the graph contains two
arcs (Ci,Ci+l) and (Ci+l,C;). The former arc corresponds to the copy of the
binary join index between C; and Ci +1 clustered on class C; while the latter
arc corresponds to the copy clustered on class Ci+l. The IG for the path P3 is
presented in Figure l.4.b.
Note that when the JI organization is used for forward traversal, the se-
quence of B+-trees searched in the traversal corresponds to a chain of arcs in
the IG. Moreover, such chain consists of left-to-right directed arcs only. By
contrast, the use of the JI organization in a reverse traversal corresponds to a
chain of arcs in the IG containing only right-to-left directed arcs.
The usage of join indexes in optimizing complex queries has been discussed
in [Valduriez, 1986). A major conclusion is that the most complex part (that is,
the joins) of a query can be executed through join indexes, without accessing
the base data. However, there are cases when traditional indexing (selection
indexes on join attributes) is more efficient than the usage of a join index. For
example, a traditional index is more efficient than a join index when the query
simply consists of a join preceded by a highly selective selection. The major
conclusion is that join indexes are more suitable for complex queries, that is,
queries involving several joins.
The update costs for the JI organization are in general the double of the
costs for the multi-index organization, since in the JI organization there are
two copies of each join index. The update costs of the JI organization can be,
however, reduced by allocating a single copy for one or more join indexes in
the organization, rather than two copies. Allocating a single copy, however,
makes forward or reverse traversal more expensive, depending on which copy
is allocated, and therefore the correct allocation decision must be based on the
expected query and updates patterns and frequencies.

Nested index
Both the previous organizations require, when solving a nested predicate, to
access a number of indexes proportional to the path length. Different orga-
nizations have been proposed to reduce the number of indexes accessed. The
first of these organizations is the nested index [Bertino and Kim, 1989] provid-
ing a direct association between an object of a class at the end of a path and
the corresponding instances of the class at the beginning of the path. Consider
path Pl =Author.books.publisher.name. A nested index allocated on this path
contains as key-values names of publishers. It associates with each publisher
name the OIDs of authors that have written a book published by this publisher.
Figure 1.6 shows some example entries for a nested index allocated on path Pl.
Academic Press Null
Addison-Wesley Al2j, Al4J
Elsevier Null
Kluwer Al3J
Microsoft Nul)
Figure 1.6. Nested index for path Pl = Author.books.publisher.name.
Retrieval under this organization is quite efficient. A query such as Q1
is solved with only one index lookup. The major problem of this indexing
technique is update operations that require access to several objects in order
to determine the index entries to be updated. For example, suppose that book
B[4] is removed from the database. To update the index, the following steps
must be executed:
1. Access object B[4] and determine the value of nested attribute "Book.pub-
lisher.name"; result: "Addison-Wesley".
2. Determine all instances of class Author having B[4] in the list of authored
books; result: {A[2]}.
3. Remove A[2] from the index entry with key-value equal "'Addison-Wesley";
after the removal the index entry for "Addison-Wesley" is {A[4]}.
As this example shows, update operations in general require both forward
and backward traversals of objects. Forward traversal is required to determine
the value of the indexed attribute (that is, the value of the attribute at the end
of the path) for the modified object. Reverse traver~al is required to determine
the instances at the beginning of the path. The OIDs of those instances will

be removed (added) to the entry associated with the key value determined
by the forward traversal. Note that reverse traversal is very expensive when
there are no reverse references among objects. In such case, the nested index
organization may not be usable.
Note that a nested index as defined above can only be used for reverse traver-
sal. However, it would be possible, as for the J1 organization, to allocate two
copies of a nested index: the first having as key-values the values of attribute
An at the end of the path (examples of entries of this copy for path PI are the
ones we have shown earlier); the second having as key-values the OIDs of the
instances at the class at the beginning of the path. Therefore, for path PI this
second copy would have the entries illustrated in Figure 1.7.
A[l] Null
A[2] Addison-Wesley
A[3] Kluwer
A[4] Addison-Wesley
A[5] Null
Figure 1.7. A nested index for path PI = Author.books.publisher.name clustered on OIDs
of instances of the class at the beginning of the path.
The use of the above nested index would be more efficient than forward
traversal using the object themselves.
The IG for a nested index allocated on a path P contains only two arcs,
namely (CI,Cn+d and (Cn+I,CI). The former are, however, is only inserted
in the IG if the second copy of the nested index, supporting forward retrieval,
is allocated. The IG for a nested index allocated on path Pa is shown in
Figure 1.4.c.
Path index
A path index [Bertino and Kim, 1989] is based on a single index, like the nested
index. The difference is that a path index provides an association between an
object 0 at the end of a path and all instantiations ending with O. For a path
of length n, the leaf-node records of a path index contain the instantiation
implemented as records of n components. Example index entries for path PI
are given in Figure 1.8.
Note that a path index records, in addition to complete instantiations, left-
partial and right-partial instantiations. Unlike the nested index, a path index
can be used to solve nested predicates against all classes along the path. For
example, the path index on PI can be used to determine all authors of books
published by Kluwer, or simply to find the books published by Kluwer.

Publisher.name Path instantiations
Academic Press Null
Addison-Wesley 0[1] .A[4] .B[l] .P[2], A[2] .B[4] .P[2]
Elsevier Null
Kluwer 0[2] .A[3] .B[2] .P[4], 0[2] .A[3] .B[3] .P[4]
Null 0[4]
Figure 1.8. Path index for path P3=Organization.staff.books.publisher.name.
This feature is also very useful when dealing with complex queries. It
supports a special kind of projection, called projection on path instantiation
[Bertino and Guglielmina, 1991, Bertino and Guglielmina, 1993]. This oper-
ation allows retrieving OIDs of several classes along the path with a single
index lookup. For example, suppose we wish to determine all authors who
have their books published by Kluwer in 1991. This query can be solved by
first performing an index lookup with key-value equal Kluwer and then per-
forming a projection on positions of classes Author (pos=l) and Book (pos=2)
on the selected index entries. That is, the first and second elements of each
path instantiation verifying the nested predicate are extracted from the index.
Therefore, the results of this projection in the above example are: {(A[3], B[2]),
(A[3], B[3])}. Then the second element of each pair is extracted. The corre-
sponding object is accessed and the predicate on attribute "year" is evaluated.
If this predicate is satisfied, the first element of the pair is returned as query
result. For example, given the two pairs above, instances B[2] and B[3] of
class Book would be accessed to verify whether the value of attribute "year" is
1991. Since only B[3] verifies the predicate, A[3] is returned as the query result.
An analysis of query processing strategies using this operation is presented in
[Bertino and Guglielmina, 1993].
Updates on a path index are expensive, since forward traversals are required,
as in the case of the nested index. However, no reverse traversals are required.
Therefore, the path index organization can be used even when no reverse ref-
erences among objects on the path are present.
The IG for a path index allocated on a path P contains n arcs, namely
(Cn +1 , Cj ) for all i's in the range 1, ... , n. The IG for a path index allocated
on path P3 is shown in Figure l.4.d.
Access support relation (ASR)
This approach is very similar to the path-index in that it involves calculating
all instantiations along a path and storing them in a relation. Given a path
P = C1 .A1 .A2 . ... .An , all path instantiations are stored as records in an (n+ 1)-

ary relation. The ith attribute of that relation corresponds to the class Gi. Also,
both complete and partial instantiations are represented in the table. Example
index entries for path P1 are given in Figure 1.9. Two B+-trees are allocated
on the first and last attributes (classes G1 and Gn+d for the access relation for
accelerating forward and reversal traversals. Like the path index, the ASR has
a low retrieval cost and quite high update cost.
Org Author Book Publisher Publisher.name
O[lJ Al4J Bl1J P[2] Addison-Wesley
0[2] A[3] B[2] P[4 Kluwer
0[2] A[3] B[3] P[4] Kluwer
0[4] Null Null Null Null
Null A[2] B[4J P[2 Addison-Wesley
Null Null Null P[l Academic Press
Null Null Null Pl3 Springer
Figure 1.9. Access support relation for path P3 =Organization.staff.books.publisher.name.
In the IG for an ASR allocated on a path P, any vertex for class Gi , i =
2, ... ,n -1 has two incoming arcs (G1,G;) and (Gn.An,Gi ). Figure 1.4.e
presents the indexing graph for ASR for path P3. It contains arcs outgoing
from the first and last classes in the path on which the two B+-trees are allo-
cated.
Comparison
A comparison among three of the basic indexing techniques, namely multi-
index, nested index and path index, has been presented in [Bertino and Kim,
1989]. An important parameter in the evaluations is represented by the degree
of reference sharing. Two objects share a reference if they reference the same
object as value of an attribute. Therefore, this degree models the topology of
references among objects. A more accurate model of reference topology was
developed in [Bertino and Foscoli, 1995].
The main results of the comparison can be summarized as follows. For re-
trieval the nested index has the lowest cost as expected, and the path index
has lower cost than the multi-index. The nested index has a better perfor-
mance than the path index for retrieval, because a path index contains OIDs
of instances of all classes along the path, while the nested index contains OIDs
of instances of only the first class in the path. However, a single path index
allows predicates to be solved for all classes along the path, while the nested
index does not. For update the multi-index has the lowest cost. The nested
index has a slightly lower cost than the path index for path length 2. For paths

longer than 2, the nested index has a slightly lower cost than the path index if
updates are on the first two classes of the path; otherwise the nested index has
significantly higher cost than the path index. Note; however, that the update
costs for the nested index are computed under the hypothesis that there are
reverse references among objects. When there are no reverse references, update
operations for the nested index became much more expensive.
1.2.2 Advanced index organizations
Each of the basic organizations described in the previous subsection is biased
towards a specific kind of operation (retrieval or update). No organization
supports equally well retrieval and update operations. In this subsection, we
present some advanced approaches which are characterized by a customization
component. Such component allows tailoring the organizations with respect to
specific query and update patterns and frequencies. The customization requires
detecting an index configuration which is optimal for a given set of operations
along the indexed path.
Path splitting
The path splitting approach [Bertino, 1994, Choenni et 301.,1994] overcomes the
problem of biased performance of three basic techniques, namely high update
costs in the nested and path index and high retrieval costs in the multi-index.
The approach is based on splitting a path into several shorter subpaths, and
allocating on each subpath one among the following basic organizations: multi-
index, nested index, path index. For example, path P3=Organization.staff.
books.publisher.name could be split into two subpaths:
• P31=Organization.staff.books with a multi-index allocated
• P32=Book.publisher.name with a path index allocated.
An algorithm determining optimal configurations for paths has been devel-
oped [Bertino, 1994]. The algorithm takes as input the frequency of retrieval,
insert, and delete operations for classes along the path. Moreover, it takes into
account whether reverse references exist among objects as well as all data logi-
cal and physical characteristics. The algorithm determines the optimal splitting
of a path into subpaths, and the organization to use for each subpath. The al-
gorithm also considers, for each subpath, the choice of allocating no index. An
interesting result obtained by running the algorithm is that when the degrees
of reference sharing along a path are very low (that is, close to 1) and reverse
references are allocated among objects, the best index configuration consists of
allocating no index on the path.

The overall index configuration obtained according to the path splitting
approach can be simply represented by an IG. As an example consider the IG
for the configuration of path P3 consisting of subpaths P 3i with a multi-index
allocated, and P32 with a path index allocated, shown in Figure l.10.a.
Organization Author Book Publisher Publisher.name
a)
Organization - _
b)
Book ___--..rublisher.name
c)
Figure 1.10. Indexing graphs for advanced techniques: a) Path splitting; b) ASR decom-
position; c) join index hierarchy.
ASR decomposition
Under the ASR organization one table is maintained for all instantiations along
the path. Similarly to the path splitting approach, a path may be decomposed
and different access relations allocated for each subpath. Even though [Kemper
and Moerkotte, 1992] proves some properties of the ASR decomposition, it does
not provide any criteria or algorithm for "optimal" partitioning.
Figure 1.10.b shows the IG corresponding to a case where the ASR allocated
on path P3 is decomposed into two partitions.
Join index hierarchy
This is another approach based on the join index [Valduriez, 1987]. A complete
join index hierarchy (IJH) consists of basic join indexes and derived join indexes
[Xie and Han, 1994]. Basic indexes which form the base of the Jl hierarchy are
supported for pairs of neighbor classes in a path P, whereas derived indexes are
supported for pairs of non-neighbor classes. Derived join indexes are built from
basic join indexes and, possibly, other derived join indexes. For the path P3 ,
Figure 1.11 shows the derived join between class Author (pos=2) and attribute
Publisher.name (pos=5).
Maintenance of the complete JI hierarchy is expensive in terms of both
storage and update costs. Therefore, a partial JI hierarchy which contains
all basic JIs and only several derived indexes seems to be more efficient for

Author Publisher.name
A[2] Addison-Wesley
A[3] Kluwer
A[4] Addison-Wesley
Figure 1.11. Derived join index between Author and Publisher.name.
most real cases. In the partial hierarchy, any derived join index needed for
executing a query but not included in the partial J1 hierarchy is derived from
the indexes in the partial J1 hierarchy through a sequence of join operations.
The selection of the derived J1s to be included in the partial J1 hierarchy
is driven by some heuristics and metrics. As the performance tests reported
in [Xie and Han, 1994] show, a partial JI hierarchy behaves better than the
complete J1 hierarchy and the ASR organization.
An IG corresponding to a partial 1J hierarchy is characterized by the fol-
lowing property. If it contains an arc from class Cj to class Cj , then it contains
the arc from Cj to Ci as well. Figure 1.10.c shows the IG of a partial J1
hierarchy for path P3. Such partial J1 hierarchy supports basic join indexes
for the following pairs of neighbor classes: (Organization, Author), (Author,
Book), (Book, Publisher), (Publisher, Publisher.name). It moreover supports
an additional derived join index for the pair (Author, Publisher.name).
1.3 Index organizations for inheritance hierarchies
As we discussed in Section 1.1, an object-oriented query may apply to a class
only or to a class and all its direct and indirect subclasses. Since an attribute
of a class C is inherited by all its subclasses, a relevant issue concerns how
to efficiently evaluate a predicate against such an attribute when the scope of
the query is the inheritance hierarchy rooted at C3
. In this section we discuss
indexing techniques addressing such an issue. The various approaches are an-
alyzed with respect to storage overhead, update and retrieval costs. Retrieval
costs, in particular, depend on whether the query is a point query or a range
query. In a B+-tree index, a point query retrieves one leaf node only; the query
predicate is usually an equality predicate. By contrast, a range query specifies
an interval (or a set) of values for the search key and may require retrieving
several leaf nodes.
Consider an attribute A defined in a class C and inherited by all its sub-
classes. A query against attribute A is a single-class query (SC-query) if the
query scope consists of only one class from the inheritance hierarchy rooted at
C. Otherwise, the query is a class-hierarchy query (CH-query) and its scope

Book
1986 Bl2J
1990 B[4]
1991 B[1],B[3]
Handbook
II 1990 Q!I!I]
Figure 1.12. SC-index organization for the inheritance hierarchy rooted at class Book.
includes a subhierarchy of the inheritance hierarchy, that is, some class in the
hierarchy with all its subclasses. A CH-query is a rooted CH-query if the root
of the subhierarchy in the scope coincides with the root class C. Otherwise,
the query is a partial CH-query.
Consider the database schema shown in Figure 1.1. Consider the inheritance
hierarchy rooted at class Book and queries against its attribute "year" which
is inherited by classes Manual and Handbook. An example of SC-query is the
query which retrieves instances of one of the classes in the hierarchy (Book,
Manual or Handbook). The query against the attribute "year" which retrieves
instances of all the three classes is a rooted CH-query. If the class Manual
had a subclass called ManuaLon.CD, then a query with classes Manual and
ManuaLon.CD in the scope would be a partial CH-query.
SC-index and CH-tree
The inheritance hierarchy indexing problem was first addressed in [Kim et al.,
1989] where two possible approaches are proposed. The first approach, called
single-class index (SC-index), is based on maintaining a separate B+-tree on
the indexed attribute for each class in the inheritance hierarchy. Therefore, if
the inheritance hierarchy has m classes, the SC-index requires m B+-trees.
As an example, consider the inheritance hierarchy rooted at class Book in
Figure 1.1. If the attribute "year" is frequently referred in queries against this
hierarchy, the SC-index approach requires building three indexes, one for each
class in the hierarchy, namely Book, Manual and Handbook. The evaluation of
a predicate against the attribute "year" would then require scanning the three
indexes and performing the union of the results. The three indexes against
the attribute "year" for the classes in the inheritance hierarchy rooted at class
Book are shown in Figure 1.12.
This approach is very efficient for SC-queries. However, it is not optimal for
CH-queries, because it requires scanning all the indexes allocated on the classes
in the queried inheritance hierarchy.
The second approach, called class-hierarchy index (CH-tree), is based on
maintaining a unique B+-tree for all classes in the hierarchy. An index entry
in a leaf node may thus contain the aIDs of instances of any class in the

Book, Manual, Handbook
1986 (Book,{B[2]})
1990 (Book, {B[4]}) , (Manual, ~Mlll}),(Handbook, {Hll]})
1991 (Book, {B[l],B[3]})
1993 (Manual, {Ml2]})
Figure 1.13. Entries of CH-tree for the inheritance hierarchy rooted at class Book.
indexed inheritance hierarchy. A CH-tree allocated on the attribute "year" for
the inheritance hierarchy rooted at class Book is shown in Figure 1.13. Note,
from the figure, that the entry with key value equal 1990 contains three sets
of aIDs. The first set contains the aIDs of the instances of Book, (B[4] in the
example), whereas the second and third sets contain aIDs of manuals (M[l])
and handbooks (H[l]), respectively. Generally, a leaf node in a CH-tree consists
of a key-value, a key-directory, and for each class in the inheritance hierarchy
the number of elements in the list of aIDs for instances of this class that hold
the key-value in the indexed attribute, and the list of aIDs. The key-directory
contains an entry for each class that has instances with the key-value in the
indexed attribute. An entry for a class consists of the class identifier and the
offset in the index record where the list of aIDs for the class is located.
Under the CH-tree organization, a SC-query is evaluated as follows. Let C
be the class against which the query is issued. The index is scanned to find the
leaf-node record with the key-value satisfying the query predicate. Then the
key-directory is accessed to determine the offset in the index record where the
list of aIDs of instances of C is located. If there is no entry for class C, then
there are no instances of C satisfying the predicate. A CH-query is processed
in the same way, except that the lookup in the key-directory is executed for
each class involved in the query.
In general, the performance of the CH-tree has an inverse trend with respect
to the SC-index. The CH-tree is more efficient for queries whose access scope
involves all classes (or a significant subset of the classes) in the indexed in-
heritance hierarchy, whereas a SC-index is effective for queries against a single
class. By contrast, the CH-tree retrieves many unnecessary leaf node pages
when the query applies to a single class only.
Results of an extensive evaluation of the two indexing techniques have been
reported in [Kim et al., 1989]. An important parameter in the evaluation is
the distribution of key values across the classes in the inheritance hierarchy. In
general, if each key value is taken by instances of only one class C (that is, dis-
joint distribution), the CH-tree is less efficient than the SC-index. Conversely,
if each key value is taken by instances of several classes, the CH-tree performs

better. Also, the update cost for the CH-tree is higher that in SC-index because
the size of one B+-tree for one class is expected to be much smaller compared
to the cost of a single index for the entire hierarchy.
H-tree
The skewed performance of the SC-index and CH-tree for SC- and CH-queries
led to more attempts to overcome the problem. The H-tree [Low et al., 1992]
is a variant of the SC-index which aims at improving the performance of the
SC-index for CH-queries. Like the SC-index, a separate B+-tree is maintained
on the indexed attribute for each class in the inheritance hierarchy. However,
unlike the SC-index, in the H-tree the B+-trees are linked based on their class-
subclass relationships by pointers in the internal nodes of the B+-tree. For
each pair of classes C and C' in the inheritance hierarchy, such that class C'
is a direct subclass of C, a set of additional pointers are maintained from the
internal nodes of the B+-tree allocated on class C to internal nodes in the B+-
tree allocated on class C'. The pointers connect internal node's separators for
same values of the indexed attribute. Figure 1.14 shows a fragment of a H-tree
allocated on the inheritance hierarchy rooted at class Book which indexes the
"year" attribute.
B+-tree of
Manual
B+-lree of
Book B+-lree of
Handbook
Figure 1.14. Fragment of the H-tree organization for the inheritance hierarchy rooted at
class Book.
To execute a CH-query, the H-tree performs a complete scan on the B+-
tree allocated on the query class, followed by the partial search on each of the
B+-trees allocated on the other classes in the subhierarchy rooted at the query
class. The partial search is performed by following the additional pointers from
the B+-tree, allocated on the class root of the queried inheritance hierarchy,
to the B+-trees of the subclasses of the class root. Unfortunately, the usage of
those additional pointers solves the problem of low performance only partially.
Although the H-tree reduces the number of accesses to the B+-tree internal

nodes, it still requires accessing more leaf node pages than those accessed under
the SC-index organization. Moreover, the reduced query cost is achieved at the
expense of the additional storage overhead for the pointers between B+-trees.
As a consequence, the update cost in the H-tree is higher than in the SC-index.
CG-tree
The CG-tree [Kilger and Moerkotte, 1994] enhances the H-tree by collecting
all pointers between different class's indexes in special nodes which create one
additional level located just before the leaf node level of B+-trees.
Given an inheritance hierarchy of m classes, the CG-tree maintains m B+-
trees, one for each class. In each B+-tree, an additional level between the
internal and leaf nodes is included. Each node at this level contains a vector
of m elements (called class directory) of leaf node references. There is one
element in the array for each class in the indexed inheritance hierarchy. The ith
component of the class directory contains a reference to the leaf node containing
those elements of the class Ci whose keys have the same key values. The position
i of class Ci is given by the preorder traverse of the inheritance hierarchy.
The CG-tree has better performance than the H-tree, as it avoids reading
unnecessary internal nodes. However, it may still require reading unnecessary
leaf nodes. Moreover, the CG-tree has a high storage overhead and update cost
because of the class directories.
heC-tree
The hcC-tree [Sreenath and Seshadri, 1994] is another organization attempting
to combine the advantages of the SC-inclex and CH-tl'ee. Like the CH-tree, it
is based on maintaining a single B+-tree-like data structure to index the entire
inheritance hierarchy. In addition to the usual internal and leaf nodes of a
standard B+-tree used for indexing the attribute values, it includes a new type
of nodes, so called OlD nodes. The OlD nodes lie one level below the leaf nodes
and contain the lists of aIDs related to the attribute values.
Given an inheritance hierarchy with m classes, the hcC-tree maintains m+ 1
chains of OlD-nodes with m class chains (one chain for each class) and one
chain of OlD-nodes corresponding to the entire inheritance hierarchy. The
class chain for a class C groups the aIDs belonging to C, and the hierarchy
chain groups all the aIDs of all instances of all the classes in the inheritance
hierarchy. Practically, a class chain looks like the chain of leaf nodes in a SC-
index, whereas the hierarchy chain is similar to the chain of leaf nodes in a
CH-tree. The OlD nodes are referenced by entries in the leaf nodes. Each leaf
node entry, in addition to key values, contains a bitmap with n bits and a set
P of (m + 1) pointers. Each bit in the bitmap corresponds to a class in the

inheritance hierarchy such that if ith bit is set, the ith pointer in P points to
the first node in the class chain for the class C containing OIDs with the key
value. Each internal node entry consists of a key value, a node pointer and a
n-bit bitmap.
For SC-queries, the performance of the hcC-tree is comparable to that of the
SC-index as it requires searching only one class chain. For the range rooted
CH-queries the hcC-tree's performance is comparable to that of the CH-tree as
it requires searching only the hierarchy chain. However, for range partial CH-
queries, the hcC-tree behaves like the SC-index because it requires searching a
number of class chains equal to the number of classes in the query class scope.
Furthermore, as the hcC-tree stores each OlD twice (in one class chain and the
hierarchy chain), it incurs a high storage overhead and update cost.
x-tree
All the above approaches basically use one of two mutually exclusive grouping
methods. The SC-index, the H-tree and the CG-tree group attribute values
in the leaf nodes of B+-tree on the base of a class wherein instances with the
value appear. By contrast, the CH-tree and the hcC-tree are based on the
values of the indexed attribute regardless of the class the instances with the
value belong to. Because of this dichotomy, the various indexing techniques
behave differently for different queries. Indexing techniques based on the first
grouping method are always more efficient for SC-queries, whereas techniques
based on the second grouping method are always more efficient for CH-queries.
The above considerations have led researchers to the insight that the search
space for the class-hierarchy indexing is actually 2-dimensional, with the index-
ing attribute values extended along one dimension (attribute-dimension) and
classes in the hierarchy extended along the second dimension (class-dimension).
As a result, grouping of the indexed values should extend in both directions. In
such a case, the several techniques supporting multi-dimensional indexing like
R-tree, quad-tree, grid-file etc. [Ooi, 1990], can be used for indexing an inher-
itance hierarchies. Figure 1.15 represents data from the inheritance hierarchy
rooted at class Book as 2-dimensional search space. Using such a representa-
tion, the query Q2 "Retrieve all instances of Class Book and all its subclasses
printed in 1991" becomes a rectangular domain in the data plane.
x-tree [Chan et al., 1997] is a dynamic indexing technique similar to the
R-tree [Guttman, 1984] and R*-tree [Beckmann et al., 1990]. Data are stored
in the leaf nodes which appear at the same level of the tree. Each leaf node
entry consists of the key value J(, the object identifier aid and the identifier
cid of the class the object belongs to. If all entries with the same key value J(
do not fit one leaf node, two or more nodes are allocated and all node entries
with same class identifier are grouped together.

Attribute
dimension
1993
Q2
1990 19911986
Class
dimension (
B[I]
B[2] B[4]
B[3]
M[I] M[2]
H[I]
'---
Book
Manual
Handbook
Figure 1.15. Objects from hierarchy rooted at Book as a 2-dimensional search plane.
The internal nodes contain entries of the form (cidSet, Kmin, Kmax , P),
where cidSet is a subset of the classes in the indexed inheritance hierarchy,
[Kmin, Kmax] is a subrange of the attribute domain, and P is a pointer to a
child node on the next level. In the internal nodes of the x-tree, all node entries
with the same set of classes are clustered together into the same record.
As the node splitting strategy in the R-tree is more complicated than in
a B+-tree and often depends on the data shape and distribution, x-tree uses
some heuristics for node splitting based on a special proximity cost metric.
The heuristic generates a list of candidate node splits along both the class-
dimension and the attribute-dimension. The candidates are generated on the
base of a low proximity cost of the split. After the generation step, the best
candidate is selected as a final node split.
As performance tests show, the x-tree outperforms the CH-tree for most
types of query. As it can be expected, the only exception is for queries against
all the classes in the indexed inheritance hierarchy. In such case, the x-tree
fetches about 80% more pages than the CH-tree. Also, like the R-tree which
has a lower space utilization than the B+-tree, the x-tree is higher and requires
larger storage space than the CH-tree.
Good worst case indexing techniques
The x-tree is more efficient than all the previous index organizations for a wide
range of queries and data distributions. Yet, it does not have a good worst-
case performance because it uses R-tree as underlying data structure and some
heuristics for node splitting.
An approach with a proven good worst-case performance was proposed in
[Kanellakis and Ramaswamy, 1996, Ramaswamy and Kanellakis, 1995]. A key
assumption is that the class-dimension in the 2-dimensional data space is static,

A
A'
a)
"'CD' /0:;:::',
"~'~'
{AI {BI ICi {D} {EI {F}
h)
class CH-qllcry against d.l.o;s C
dimension tiJA ..... -- .... -----.- --.-
B ---- ---~'-:'.--'-. ---
C ---+ ,- ..... ------. ----
D ----.-------------.-
E - ..... ----.----.----+--
F ----.-----.---- ...... ---
allrihutc
lIimcnsiun
c)
Figure 1.16. Class-division: a) Example hierarchy; b) Binary tree on the class-dimension;
c) A CH-query against class C in the 2-dimensional data space.
that is, no classes in the hierarchy may be removed or inserted even though
objects of the classes may be updated. This redu·ces indexing the inheritance
hierarchy to a special case of the external dynamic 2-dimensional range search-
ing when data in the 2-dimensional space are points with their y-coordinates
being in a static set corresponding to the set of classes.
A given class hierarchy H is preprocessed as follows. We create a family G
where each member is a set of classes from H. After the preprocessing, B+-tree
indexes are maintained for the union of the classes in each member of G. If a
CH-query is against class C in the hierarchy H, a subset of indexes is queried,
which exactly covers C's subclasses and which involves at most q indexes, where
q is a small integer. On the other hand, a class is allowed to appear in at most
a small number r members of G, so an object can have at most r replicas.
Updates are processed by changing all replicas.
In other words, the preprocessing solves the following combinatorial problem,
which is named class-division of H according to maximal replication factor r
and maximal query factor q:
Input: Class hierarchy H with m classes, and positive integers rand q.
Output: A family G, whose members are sets of classes from H such that
(1) No class appears in more than r members of G.
(2) For any class C' in Hand C' its set of subclasses in H including C itself,
there are at most q members of G that exactly cover C' (the union of at most
q members of G is C').
SC-tree is an example of the class-division withq =m and r =1. Similarly,
class-division is possible for and q = 1 and l' = m, when B+-tree indexes
are maintained for all subhierarchies in H and each object can have up to
m replicas. In the general case, there exists the following efficient space-time
tradeoff:

For any class hierarchy H with m classes, it is possible to perform class-division
of H according to r = flog2 m1+ 1 and q =2flog2 m1.
To prove this, we recall that every CH-query can be represented as a 2-
dimensional range query, with two ranges extended along the attribute-dimen-
sion and the class-dimension. To make all classes from a subhierarchy be con-
tiguous along the class-dimension, classes of H should be sorted due to the
preorder hierarchy traversal. When performing such traversal of H, we build a
binary tree on the class-dimension. Leaves in the tree scanned from the left to
right, contain classes due to the preorder traversal of H while an internal node
contains the union of classes in all leaves of its subtree. Therefore, the tree has
m leaves and flog2 m1+ 1 levels. In Figure 1.16.a the class hierarchy consists
of six classes and the preorder traversal of the hierarchy is given by {A, B, C,
D, E, F}. The binary tree built for the hierarchy is given in Figure 1.16.b.
Once the tree is built, the family G is obtained by generating family members
for all nodes of the tree. Because each class is present in at most one node on
each level and the binary tree has flog2 c1+ 1 levels, no object has more than
flog2 c1+ 1 replicas in G.
A CH-query corresponds to a range along the class-dimension in the preorder
sort. To minimize the number of members of G (or nodes of the tree) covering
the query class range, we select those nodes Vi of the binary tree which are
completely contained in the query range while their parents do not. The query
issued against class C (see Figure 1.16.c) gives the class range {A, B, C} and
the minimal cover for the range is given by nodes {A, B} and {C} (see shadow
nodes in Figure 1.16.b). In the worst case, the query class range has two such
nodes Vi on each level of the tree and 2flog2 C1nodes in total. That is, one can
answer class indexing queries on any class by looking at no more than 2flog2 C1
indexes. This gives the time-space tradeoff previously stated.
As a B+-tree is maintained for each member of G, this tradeoff allows
to construct an efficient data structure in external storage which occupies
o(log2 m(N/ B)) pages and has the worst case I/O query time O(log2 c10gB N +
T / B), where B is the size of the external memory page, m is the number of
classes in the inheritance hierarchy, N is the number of objects in the inheri-
tance hierarchy and T is the number of objects the query retrieves. The update
time in such a structure is O(log2 dogB N).
The above schema provides the worst case complexities for any class hierar-
chy. Hovever, for many hierarchies, values of rand q may be further improved
by using heuristics, some of them were discussed in [Ramaswamy and Kanel-
lakis, 1995]. Also, an improvement of the data structure that reduces the query
time from O(log2 c10gB n +i/B) to O(logB n + log2 B +i/B) was proposed in
[Kanellakis and Ramaswamy, 1996].

1.4 Integrated organizations
Even though we have addressed indexing techniques separately for each dimen-
sion along which an object database is organized (namely, aggregation and in-
heritance), most object-oriented queries involve classes along both dimensions.
Such queries typically contain nested predicates and have as a target any num-
ber of classes in a given inheritance hierarchy. The query that retrieves all
books and manuals written by authors from AT&T Lab. is an example of
such queries. Developing integrated indexing techniques able to support such
queries is crucial. In principle, every indexing technique defined for one dimen-
sion could be combined with any technique defined for the other dimension.
However, no integrated indexing technique has been proposed, with the excep-
tion of the nested-inherited index [Bertino and Foscoli, 1995], that we describe
in the remainder of this section.
The nested-inherited index is defined as a combination of concepts from the
nested index, the join index and the CH-tree techniques. In order to present
this indexing technique, we need some additional definitions. To simplify the
following discussion, we make the assumption that a class occurs only once in
a path.
First we recall that, given a class C, C' denotes the set of classes in the
inheritance hierarchy rooted at C. As an example, consider the object-oriented
schema in Figure 1.1:
Book' = {Book, Manual, Handbook}.
Given a path P = C1 .A1.A2 ... . An (n 2 1), the scope of P is defined
as the set UC,EclasS('P) Ct. Class C1 is the root of the scope. Given a class
C in the scope of a path, the position of C is given by an integer i, such
that C belongs to the inheritance hierarchy rooted at class Cj4, where Cj E
class(P). The scope of a path simply represents the set of all classes along
the path and all their subclasses. For example, consider the path P= Orga-
nization.staff.books.publisher.name, scope(P) = {Organization, Author, Book,
Publisher}. Class Organization is the root of P. Class Organization has posi-
tion one, class Author has position two, classes Book, Manual and Handbook
have position three, and class Publisher has position four. In the remainder of
the discussion, given an object 0, we will use the term parent object to denote
an object that references O. For example, the parents of the instance M[I] of
class Manual are objects A[I] and A[4], instances of class Author.
Given a path P = C1 .A1.A2 ... An, the nested-inherited index associates
with a value v of attribute An OIDs of instances of each class in the scope of P
having vas value of the (nested) attribute An. A nested-inherited index on path
P=Organization.staff.books.publisher.name associates with a given publisher
name all organizations having in their staff authors of books or manuals or

handbooks published by the publisher. Similarly for all the other classes in the
scope. Logically, the index will contain the following entries
Academic Press (Publisher, {P[l]})
Addison-Wesley (Organization,{O[I])), (Author,{A[I], A[2], A[4]}) ,
(Book,{B[I],B[4]}), (Manual, {M[l]},
(Publisher, {P[2]})
Elsevier (Organization, {O[3]}), (Author, {A[5J}),
(Handbook, {H[l]}), (Publisher, {P[3]})
Kluwer (Organization, {O[2]}), (Author, {A[3]}),
(Book, {B[2], B[3]}), (Publisher, {P[4]})
Microsoft (Manual, {M[2]}), (Publisher, {P[5]})
Figure 1.17. Nested-inherited index for path P=Organization.Author.Book.Publisher.
The nested-inherited index, as the nested index and path index, supports
efficient retrieval operations. However, unlike those two organizations, the
nested-inherited index does not require object traversals for update operations,
because of some additional information that is stored in the index. The format
of a non-leaf node has a structure similar to that of traditional indexes based
on B+-tree. The record in a leaf node, called primary l'ecord, has a different
structure. It contains the following information:
• record-length
• key-length
• key-value
• class-directory
• for each class in the path scope, the number of elements in the list of OIDs
for the objects that hold the key-value in the indexed attribute, and the list
of OIDs.
The class-directory contains a number of entries equal to the number of
classes having instances with the key-value in the indexed attribute. For each
such class Ci , an entry in the directory contains:
• the class identifier
• the offset in the primary record where the list of OIDs of Ci instances are
stored

• the pointer to an auxiliary record where the list of parents is stored for each
instance of Gi . An auxiliary record is allocated for each class, except for the
root class of the path and for its subclasses. An auxiliary record consists of
a sequence of 4-tuples. A 4-tuple has the form:
(oidi , pointer to primary record, no-oids, {p - oidi 1 ' •.. , p - oidi J ).
There are as many 4-tuples as the number of instances of Gi having the
key-value in the indexed attribute. For an object Oi, the tuple contains the
identifier of Oi, the pointer to the primary record, the number of parent
objects of Oi, the list of parent objects. In the 4-tuple definition above,
no-oids denotes the number of parent objects, and p - oidi . denotes the j-thJ
parent of Oi.
Auxiliary records are stored in different pages than primary records. Given
a primary record, there are several auxiliary records that are connected to it. A
second B+-tree is superimposed on the auxiliary records. The second B+-tree
indexes the 4-tuples based on the OIDs that appear as the first elements of
4-tuples. Therefore, the index organization actually consists of two indexes.
The first, called the primary index, is keyed on the values of attribute An.
It associates with a value v of An the set of OIDs of instances of all classes
relative to the path that have v as value of the (nested) attribute. The second
index, called the auxiliary index, has OIDs as indexing keys. It associates
with the OlD of an object 0 the list of OIDs of the parents of O. Leaf-
node records in the primary index contain pointers to the leaf-node records
in the auxiliary index, and vice versa. The reason for the auxiliary index is
to provide all information for updating the primary index without accessing
the objects themselves. Recall that when updates are executed, the nested
index may require object forward and reverse traversals, while the path index
only requires forward traversals. By contrast, the nested inherited index does
Dot require any access to the objects. The reason for this organization will be
however more clear when discussing the operations.
Figure 1.18 provides an example of the partial index contents for the objects
shown in Figure 1.2.
The IG for a nested-inherited index contains three sets of arcs. First, because
the primary index associates each value of attribute Gn .An with the instances
of all classes in the scope of the indexed path, the IG contains arcs from vertex
Gn.An to classes Gi, where i = 1, ... , n. Second, it contains arcs from Gi to
Gn.An, i = 2, ... , n. Finally, the IG contains arcs 0;+1 to Gi, i = 1, ... , n - 1.
The IG for the path P=Organization.staff.books.publisher.name is shown in
Figure 1.19.
We now discuss how retrieval, insert, and delete operations are performed
on the nested-inherited index. For ease of presentation, we will use examples

offl
Non·leaf nude record
in the primary B··tree
"ff2 oflJ uff4 "ffS
Organizatitln offl
AddiStHl· Authur "fll ( AIIJ. (B[IJ.
,I fi Book ufO I (O(lJ) AI2J. (MIl) I (PI2J)
Wesley Manual "lf4 AI4J)
B[4J)
Handht)(}k
Puhlisher olTS
Auxiliary rCl,;urd for
c1as.' Alllhur
Nt)O·Jcaf Il{}dc rcc(lCt!
in i.luxilary B·lrce
Figure 1.18. Example of index contents in a nested-inherited index.
Figure 1.19. Indexing graph of the
P =0rga nization.Author. Book. Publisher. na me.
nested-inherited index for path

to describe the operations. Formal algorithms are presented in [Bertino and
Foscoli, 1995].
Retrieval
The nested inherited index supports a fast evaluation of predicates on the
indexed attribute for queries having as target any class, or class hierarchy,
in the scope of the path ending with the indexed attribute. As an example,
consider a query that retrieves the organizations whose staff members have
published books with Addison-Wesley. This query is executed by first executing
a lookup on the primary index with key value equal to "Addison-Wesley". The
primary record is then accessed. A lookup in the class directory is executed
to determine the offset where the aIDs of Organization instances are stored.
Then those aIDs are fetched and returned as result of the query. For our query,
the result is {O[l]}.
We now consider a query that retrieves the books published by Addison-
Wesley. The same steps as before are executed. The only difference is that the
class-directory lookup is executed for classes Book, Manual, and Handbook.
Since the entry for class Handbook is empty, only the record portions for classes
Book and Manual are accessed, with offsets obtained from the class-directory.
The query result, {B[I]' B[4], M[I]}, is generated by merging the lists of aIDs
returned for classes Book and Manual. Therefore, the retrieval operation is
similar to retrieval in an CH-tree [Kim et aI., 1989J. The main difference,
however, is that a nested-inherited index can be used for queries on all class
hierarchies found along a given path. By contrast, the CH-tree is allocated on
a single inheritance hierarchy. Therefore, if a path has length n, the number of
CH-trees allocated would be n.
Insert
Suppose that a new manual B[5J with author A[4J is created with P[2J as value
of attribute "publisher". B[5J is therefore a new parent of P[2J. The overall
effect of the insertion in the index must be that B[5J is added to the primary
record with key-value equal to "Addison-Wesley" , and to the parent list of P[2J.
The following steps are executed:
1. The auxiliary index is accessed with key-value equal to P[2J.
2. The 4-tuple of P[2J is retrieved and modified by adding B[5J to the list of
P[2J parents.
3. From the 4-tuple of P[2J the pointer to the primary record is determined.
4. The primary record is accessed.
5. A look-up is executed of the class directory in the primary record to deter-
mine the offset where aIDs of the class Book are stored.

6. B[5] is added to the list of OIDs stored at the offset determined at the
previous step.
7. A 4-tuple for B[5] is inserted in the auxiliary index with {A[4]} as the author
list.
Note that there is no need to execute a look-up of the primary index, since the
address of the primary record can be directly determined from the auxiliary
record.
Delete
Suppose now that manual M[l] is removed. The overall effect of this operation
on the index must be that M[l] and all instances referencing M[l] (that is, 0[1]'
A[l] and A[4]) be eliminated from the primary record with key-value equal to
"Addison-Wesley". Moreover, the 4-tuples for instances M[l], 0[1], A[l] and
A[4] must be eliminated. Finally, M[l] must be eliminated from the parent list
of P[2]. Note that the update to the parent lists of P[2] may not be needed
if P[2] is removed as well; in this case it may be better to accumulate several
delete operations on the same index. However, we will include that update to
exemplify the algorithm.
1. The value of attribute "publisher" of M[l] is determined. This value is the
aID P[2].
2. The auxiliary index is accessed with key-value equal to P[2].
3. The 4-tuple of P[2] is retrieved and modified by removing M[l] from the list
of parents of P[2].
4. From the 4-tuple of P[2] the pointer to the primary record is determined.
5. The primary record is accessed.
6. A look-up is executed on the class-directory in the primary record to de-
termine the offset where the aIDs of the class Manual are stored and the
pointer to the auxiliary record for class Manual.
7. M[1] is removed from the list of a IDs stored at the offset determined at the
previous step.
8. The auxiliary record of class Manual is accessed and the 4-tuple containing
as first element the OlD M[l] is determined. From this tuple, the aIDs of
the M[l] parents are determined. Those are A[l] and A[4]. Then the 4-tuple
of M[l] is removed.
9. The 4-tuples of A[l] and A[4] are accessed to retrieve the parent lists.

O. A lookup is executed on the class-directory in the primary record to deter-
mine the offset where the OIDs of the class Author are stored.
1. A[l] and A[4] are removed from the list of OIDs stored at the offset deter-
mined at the previous step.
2. A lookup is executed on the class-directory in the primary record to deter-
mine the offset where the OIDs of class Organization are stored.
3. 0[1] is removed from the list of OIDs stored at the offset determined at the
previous step.
The delete operation may appear rather costly. However, note that the
primary record is accessed only once from secondary storage. Several modifi-
cations may be required on this record. However, the record can be kept in
memory and written back after all modifications have been executed. Also note
that the algorithm may require accessing several auxiliary records. However,
they are all connected to the same primary record. Therefore, they are likely
to be in the same page.
A preliminary comparison among the nested-inherited index and two other
organizations ha.s been presented in [Bertino, 1991a, Bertino and Foscoli, 1995].
The first of the two organizations is a multi-index organization and simply con-
sists of allocating a.n index on each class in the scope of the path. In the exam-
ple of path P=Organization.staff.books.publisher.name, seven indexes would
be allocated. The second organization, called inherited-multi-index, consists
of allocating an inherited index on each inheritance hierarchy found along the
path. Therefore, the inherited-multi-index is a combination of the CH-tree
organization (defined for inheritance hierarchies) with the multi-index organi-
zation (defined for aggregation hierarchy). For the same path P, there would be
an CH-tree rooted at class Book (thus, indexing Book, Manual and Handbook),
and three B+-tree indexes on classes Organization, Author and Publisher. Ma-
jor results from the comparison are the following:
• The nested-inherited index has the best retrieval performance.
• The nested-inherited index has quite good performance for the insert oper-
ation, since it requires an additional cost of at most three I/O operations
with respect to the other two organizations.
• The delete operation for the nested-inherited index has in the worst case an
additional cost of 4 x i (where i is the position of the class in the path) with
respect to the other organizations.
An accurate model of those costs has been recently developed in [Bertino
and Foscoli, 1995].

The nested-inherited index does not support any customization with respect
t.o t.he operation profile (see Subsection 1.2.1). Nevertheless, it may be success-
fully used in t.he path splitting approach together wit.h other basic techniques
as an index allocated on some subpath which contains one or more inherita.nce
hierarchies.
1.5 Caching and pointer swizzling
The indexing techniques we discussed so far are based on object structures, that
is, on object attributes. Another possibility is to provide indexing based on ob-
ject behavior, that is, on method results [Bretl et al., 1989]. Techniques based
on this approach have been proposed in [Bertino, 1991b, Bertino and Quarati,
1991, Jhingran, 1991, Kemper et al., 1994]. Most techniques are based on
precomputing or caching the results of method invocations. Moreover, precom-
puted results can be stored in an index, or other access structures, so that it is
possible to efficiently evaluate queries containing the invocation of the method.
A major issue of this approach is how to detect when the computed method
results are no longer valid. In most approaches some dependency information is
kept. This dependency information keeps track of which objects (and possibly
which attributes of each object) have been used to compute a given method.
When an object is modified, all method precomputed results that have used
that object are invalidated. Different solutions can be devised to the problem
of dependencies, also depending on the characteristics of the method. In the
approach proposed in [Kemper et al., 1994], a special structure (implemented
as a relation) keeps track of these dependencies. A dependency has the format
This dependency records the fact that the object whose identifier is oidj
has been used in computing the method of name method_name with input
parameters < oidl , oid2 , .... , oidk >. Note that the input parameters include
also the identifier of t.he object to which the message invoking the method has
been sent.
A more sophisticated approach has been proposed in [Bertino and Quarati,
1991]. If a method is local, that is, uses only the attributes of the object
upon which it has been invoked, all dependencies are kept within the object
itself. Those dependencies are coded as bit-strings, therefore they require a
minimal space overhead. If a method is not local, that is, uses attributes of
other objects, all dependencies are stored in a special object. All objects whose
attributes have been used in the precomputation of a method, have a reference
to this special object. This approach is similar to the one proposed in [Kemper

et al., 1994]. The main difference is that in the approach proposed by Bertino
and Quarati, dependencies are stored not in a single data structure, rather they
are distributed among several "special objects". The main advantage of this
approach is that it provides a greater flexibility with respect to object allocation
and clustering. For example, a "special object" may be clustered together with
one of the objects used in the precomputation of the method, depending on the
expected update frequencies.
To further reduce the need of invalidation, it is important to determine the
actual attributes used in the precomputation of a method. As noted in [Kemper
et al., 1994], not all attributes are used in executing all methods. Rather, each
method is likely to require a small fraction of an object's attributes. Two basic
approaches can be devised exploiting such observation. The first approach is
called static and it is based on inspecting the method implementation. There-
fore, for each method the system keeps the list of attributes used in the method.
In this way, when an attribute is modified, the system has only to invalidate
a method if the method uses the modified attribute. Note, however, that an
inspection of method implementations actually determines all attributes that
can be possibly used when the method is executed. Depending on the method
execution flow, some attributes may never be used in computing a method
on a given object. This problem is solved by the dynamic approach. Under
this approach, the attributes used by a method are actually determined only
when the method is precomputed. Upon precomputation of the method, the
system keeps track of all attributes actually accessed during the method exe-
cution. Therefore, the same method precomputed on different objects may use
different sets of attributes for each one of these objects. Performance studies
of method precomputation have been carried out in [Jhingran, 1991, Kemper
et al., 1994].
Besides caching and precomputing, a close class of techniques, commonly
referred to as "pointer swizzling" [Kemper and Kossmann, 1995, Moss, 1992],
was investigated for managing references among main-memory resident per-
sistent objects. Pointer swizzling is a technique to optimize accesses through
such references to objects residing in main-memory. Generally, each time an
object is referenced through its OlD, the system has to determine whether the
object is already in main memory by performing a table lookup. If the object
is not already in main memory, it must be loaded from secondary storage. The
basic idea of pointer swizzling is to materialize the address of a main-memory
resident persistent object in order to avoid the table lookup. Thus, pointer
swizzling converts database objects from an external (persistent) format con-
taining aIDs into an internal (main memory) format replacing the aIDs by the
main-memory address of the referenced objects. Though the choice of a specific
swizzling strategy is strongly influenced by the characteristics of the underly-

ing object lookup mechanism, a systematic classification of pointer swizzling
techniques, quite independent from system characteristics, has been developed
[Moss, 1992]. Later, this classification was extended and a new dimension of
swizzling techniques, when swizzling objects can be replaced from the main-
memory buffer, was proposed [Kemper and Kossmann, 1995].
1.6 Summary
In this chapter, we have discussed a number of indexing techniques specifi-
cally tailored for object-oriented databases. We have first presented indexing
techniques supporting an efficient evaluation of implicit joins among objects.
Several techniques have been developed. No one of them, however, is opti-
mal from both retrieval and update costs. Techniques providing lower retrieval
costs, such as path indexes or access relations, have a greater update costs com-
pared to techniques, such as multi-index, that, however have greater retrieval
costs.
Then we have discussed indexing techniques for inheritance hierarchies. Fi-
nally, we have presented an indexing technique that provides integrated support
for queries on both aggregation and inheritance hierarchies [Bertino and Foscoli,
1995].
Overall, an open problem is to determine how all those indexing techniques
perform for different types of queries. Studies along that direction have been
carried out in [Bertino, 1990, Kemper and Moerkotte, 1992, Valduriez, 1986].
Similar studies should be undertaken for all the other techniques. Another
open problem concerns optimal index allocation.
In the chapter we have also briefly discussed techniques for an efficient exe-
cution of queries containing method invocations. This is an interesting problem
that is peculiar to object-oriented databases (and in general, to DBMSs sup-
porting procedures or functions as part of the data model). However, few
solutions have been proposed so far and there is, moreover, the need for com-
prehensive analytical models.
Notes
1. Note that in GemStone, unlike other OODBMSs, attributes must not necessarily have
a domain.
2. For sake of homogeneity, we will denote the class domain Cn.An as class Cn+1 •
3. A set containing class C itself and all classes in the inheritance hierarchy rooted at C
is denoted as C'
4. Note that if a class occurs at several points in a path, the class has a set of positions.

2 SPATIAL DATABASES
Many applications (such as computer-aided design (CAD), geographic infor-
mation systems (GIS), computational geometry and computer vision) operate
on spatial data. Generally speaking, spatial data are associated with spatial
coordinates and extents, and include points, lines, polygons and volumetric
objects.
While it appears that spatial data can be modeled as a record with multiple
attributes (each corresponding to a dimension of the spatial data), conven-
tional database systems are unable to support spatial data processing effec-
tively. First, spatial data are large in quantity, complex in structures and
relationships, and often represent non-zero sized objects. Take GIS, a popular
type of spatial database systems, as an example. In such a system, the database
is a collection of data objects over a particular multi-dimensional space. The
spatial description of objects is typically extensive, ranging from a few hun-
dred bytes in land information system (commonly known as LIS) applications
to megabytes in natural resource applications. Moreover, the number of data
objects ranges from tens of thousands to millions.
Second, the retrieval process is typically based on spatial proximity, and em-
ploys complex spatial opemtors like intersection, adjacency, and containment.

Such spatial operators are much more expensive to compute compared to the
conventional relational join and select operators. This is due to irregularity in
the shape of the spatial objects. For example, consider the intersection of two
polyhedra. Besides the need to test all points of one polyhedron against the
other, the result of the operation is not always a polyhedron but may sometimes
consist of a set of polyhedra.
Third, it is difficult to define a spatial ordering for spatial objects. The con-
sequence of this is that conventional techniques (such as sort-merge techniques)
that exploits ordering can no longer be employed for spatial operations.
Efficient processing of queries manipulating spatial relationships relies upon
auxiliary indexing structures. Due to the volume of the set of spatial data
objects, it is highly inefficient to precompute and store spatial relationships
among all the data objects (although there are some proposals that store pre-
computed spatial relationships [Lu and Han, 1992, Rotem, 1991]). Instead,
spatial relationships are materialized dynamically during query processing. In
order to find spatial objects efficiently based on proximity, it is essential to have
an index over spatial locations. The underlying data structure must support
efficient spatial operations, such as locating the neighbors of an object and
identifying objects in a defined query region.
In this chapter, we review some of the more promising spatial data struc-
tures that have been proposed in the literature. In particular, we focus on
indexing structures designed for non-zero sized objects. The review of these
indexes is organized in two steps: first, the structures are described; second,
their strengths and weaknesses are highlighted. The readers are referred to
[Nievergelt and Widmayer, 1997, Ooi et al., 1993) for a comprehensive survey
on spatial indexing structures.
The rest of this chapter is organized as follows. In Section 2.1, we briefly
discuss various issues related to spatial processing. Section 2.2 presents a tax-
onomy of spatial indexing structures. In Section 2.3 to Section 2.6, we present
representative indexing techniques that are based on binary tree structure, B-
tree structure, hashing and space-filling techniques. Section 2.7 discusses the
issues on evaluating the performance of spatial indexes, and approaches adopted
in the literature are reviewed, and finally, we summarize in Section 2.8.
2.1 Query processing using approximations
Spatial data such as objects in spatia.! database systems, and roads and lakes
in GIS, do not conform to any fixed shape. Furthermore, it is expensive to per-
form spatial operations (for example, intersection and containment) on their
exact location and extent. Thus, some simpler structure (such as a bounding
rectangle) that approximates the objects are usually coupled with a spatial in-

SPATIAL DATABASES 41
dex. Such bounding structures allow efficient proximity query processing by
preserving the spatial identification and dynamically eliminating many poten-
tial tests efficiently. Consider the intersection operation. Two objects intersect
implies that their bounding structures intersect. Conversely, if the bounding
structures of two objects are disjoint, then the two objects do not intersect.
This property reduces the testing cost since the test on the intersection of two
polygons or a polygon and a sequence of line segments is much more expensive
than the test on the intersection of two bounding structures.
By far, the most commonly used approximation is the container approach. In
the container approach, the minimum bounding rectangle/circle (box/sphere)
- the smallest rectangle/circle (box/sphere) that encloses the object - is
used to represent an object, and only when the test on container succeeds
then the actual object is examined. The bounding box (rectangle) is used
throughout this chapter as the approximation technique for discussion purposes.
The k-dimensional bounding boxes can be easily defined as a single dimensional
array of k entries: (10, ft, ... ,h-d where Ii is a closed bounded interval [a, b]
describing the extent of the spatial object along dimension i. Alternatively, the
bounding box of an object can be represented by its centroid and extensions
on each of the k directions.
Objects extended diagonally may be badly approximated by bounding boxes,
and false matches may result. A false match occurs when the bounding boxes
match but the actual objects do not match. If the approximation technique is
very inefficient, yielding very rough approximations, additional page accesses
will be incurred. More effective approximation methods include convex hull
[Preparata and Shamos, 1985] and minimum bounding m-corner. The covering
polygons produced by these two methods are however not axis-parallel and
hence incur more expensive testing. The construction cost of approximations
and storage requirement are higher too.
Decomposition of regions into convex cells has been proposed to improve ob-
ject approximation [Gunther, 1988). Likewise, an object may be approximated
by a set of smaller rectangles/boxes. In the quad-tree tessellation approach
[Abel and Smith, 1984], an object is decomposed into multiple sub-objects
based on the quad-tree quadrants that contain them. The decomposition has
its problem of having to store object identity in multiple locations in an index.
The problems of the redundancy of object identifiers and the cost of object-
reconstruction can be very severe if the decomposition process is not carefully
controlled. They can be controlled to a certain extent by limiting the num-
ber of elements generated or by limiting the accuracy of the decomposition
[Orenstein, 1990].
The object approximation and spatial indexes supporting such concepts are
used to eliminate objects that could not possibly contribute to the answer of

queries. This results in a multi-step spatial query processing strategy [Brinkhoff
et al., 1994]:
1. The indexing structure is used to prune the search space to a set of candidate
objects. This set is usually a superset of the answer.
2. Based on the approximations of the candidate objects, some of the false hits
can be further filtered away. The effectiveness of this step depends on the
approximation techniques.
3. Finally, the actual objects are examined to identify those that match the
query.
Clearly, the multi-step strategy can effectively reduce the number of pages
accessed and the number ofredundant data to be fetched and tested through the
index mechanism, and reduce the computation time through the approximation
mechanism.
The commonly used conventional key-based range (associative) search, which
retrieves all the data falling within the range of two specified values, is general-
ized to an intersection search. In other words, given a query region, the search
finds all objects that intersect it. The intersection search can be easily used to
implement point search and containment search. For point search, the query
region is a point, and is used to find all objects that contain it. Containment
search is a search for all objects that are strictly contained in a given query
region and it can be implemented by ignoring objects that fail such a condition
in intersection search.
The search operation supported by an index can be used to facilitate a spatial
selection or spatial join operation. While a spatial selection retrieves all objects
of the same entity based on a spatial predicate, a spatial join is an operation
that relates objects of two different entities based on a spatial predicate.
2.2 A taxonomy of spatial indexes
Various types of data structures, such as B-trees [Bayer and McCreight, 1972,
Comer, 1979], ISAM indexes, hashing and binary trees [Knuth, 1973], have
been used as a means for efficient access, insertion and deletion of data in large
databases. All these techniques are designed for indexing data based on pri-
mary keys. To use them for indexing data based on secondary keys, inverted
indexes are introduced. However, this technique is not adequate for a database
where range searching on secondary keys is a common operation. For this
type of applications, multi-dimensional structures, such as grid-files [Nievergelt
et al., 1984]' multi-dimensional B-trees [Kriegel, 1984, Ouksel and Scheuer-
mann, 1981, Scheuermann and Ouksel, 1982], kd-trees [Bentley, 1975] and

quad-trees [Finkel and Bentley, 1974] were proposed to index multi-attribute
data. Such indexing structures are known as point indexing structures as they
are designed to index data objects which are points in a multi-dimensional
space.
Spatial search is similar to non-spatial multi-key search in that coordinates
may be mapped onto key attributes and the key values of each object represent
a point in a k-dimensional space. However, spatial objects often cover irregular
areas in multi-dimensional spaces and thus cannot be solely represented by
point locations. Although techniques such as mapping regular regions to points
in higher dimensional spaces enable point indexing structures to index regions,
such representations do not help support spatial operators such as intersection
and containment.
Based on existing classification techniques [Lomet, 1992, Seeger and Kriegel,
1988], the techniques used for adapting existing indexes into spatial indexes can
be generally classified as follows:
The transformation approach. There are two categories of transformation
approach:
• Parameter space indexing. Objects with n vertices in a k-dimensional space
are mapped into points in an nk-dimensional space. For example, a two-
dimensional rectangle described by the bottom left corner (Xl, yt} and the
top right corner (X2, Y2) is represented as a point in a four-dimensional
space, where each attribute is taken from a different dimension. After the
transformation, points can be stored directly in existing point indexes. An
advantage of such an approach is that there is no major alteration of the
multi-dimensional base structure. The problem with the mapping scheme is
that the spatial proximity between the k-dimensional objects may no longer
be preserved when represented as points in an nk-dimensional space. Con-
sequently, intersection search can be inefficient. Also, the complexity of
insertion operation typically increases with higher dimensionality.
• Mapping to single attribute space. The data space is partitioned into grid
cells of the same size, which are then numbered according to some curve-
filling methods. A spatial object is then represented by a set of numbers
or one-dimensional objects. These one-dimensional objects can be indexed
using conventional indexes such as B+-trees.
The non-overlapping native space indexing approach. This category
comprises two classes of techniques:

• Object duplication. A k-dimensional data space is partitioned into pairwise
disjoint subspaces. These subspaces are then indexed. An object identifier
is duplicated and stored in all the subspaces it intersects.
• Object clipping. This technique is similar to the object duplication approach.
Instead of duplicating the identifier, an object is decomposed into several
disjoint smaller objects so that each smaller sub-object is totally included in
a subspace.
The most important property of object duplication or clipping is that the data
structures used are straightforward extensions of the underlying point indexing
structures. Also, both points and multi-dimensional non-zero sized objects
can be stored together in one file without having to modify the structure.
However, an obvious drawback is the duplication of objects which requires extra
storage and hence more expensive insertion and deletion procedures. Another
limitation is that the density (the number of objects that contain a point) in
a map space must be less than the page capacity (the maximum number of
objects that can be stored in a page).
The overlapping native space indexing approach. The basic idea of
this approach to indexing spatial database is to hierarchically partition its
data space into a manageable number of smaller subspaces. While a point
object is totally included in an unpartitioned subspace, a non-zero sized object
may extend over more than one subspace. Rather than supporting disjoint
subspaces as in the non-overlapping space indexing approach, the overlapping
native space indexing approach allows overlapping subspaces such that objects
are totally included in only one of the subspaces. These subspaces are organized
as a hierarchical index and spatial objects are indexed in their native space. A
major design criterion for indexes using such an approach is the minimization
of both the overlap between bounding subspaces and the coverage of subspaces.
A poorly designed partitioning strategy may lead to unnecessary traversal of
multiple paths. Further, dynamic maintenance of effective bounding subspaces
incurs high overhead during updates.
A number of indexing structures use more than one extending technique.
Since each extending method has its own weaknesses, the combination of two or
more methods may help to compensate the weaknesses of each other. However,
an often overlooked fact is that the use of more than one extending method may
also produce a counter effect: inheriting the weaknesses from each method.
Figure 2.1 shows the evolution of spatial indexing structures we adapted
from [Lu and Ooi, 1993]. A solid arrow indicates a relationship between a new
structure and the original structures that it is based upon. A dashed arrow

1984
1985
1986
1987
1988
1989
1990 GBD-lree
1991
1992
1993
1994
1995
binar~-lree
LSD-lree
B-tree
TV-tree
EXCELL
Hashing
Grid-files
Quad-tree
based location
keys
DOT
1996 X-tree
Figure 2.1.
Filter-tree
Evolution of spatial index structures.

indicates a relationship between a new structure and the structures from which
the techniques used in the new structure originated, even though some were
proposed independently of the others. In the diagram and also in the subse-
quent sections, the indexes are classified into four groups based on their base
structures: namely, binary trees, B-trees, hashing, and space filling methods.
Most spatial indexing structures (such as R-trees, R*-trees, skd-trees) are
nondeterministic in that different sequences of insertions result in different tree
structures and hence different performance even though they have the same set
of data. The insertion algorithm must be dynamic so that the performance of
an index will not be dependent on the sequence of data insertion. During the
design of a spatial index, issues that need to be minimized are:
• The area of covering rectangles maintained in internal nodes.
• The overlaps between covering rectangles for indexes developed based on the
overlapping native space indexing approach.
• The number of objects being duplicated for indexes developed based on the
non-overlapping native space indexing approach.
• The directory size and its height.
There is no straightforward solution to fulfill all the above conditions. The
fulfillment of the above conditions by an index can generally ensure its efficiency,
but this may not be true for all the applications. The design of an index needs to
take the computation complexity into consideration as well, which although is
a less dominant factor considering the increasing computation power of today's
systems. Other factors that affect the performance of information retrieval as a
whole include buffer design, buffer replacement strategies, space allocation on
disks, and concurrency control methods.
2.3 Binary-tree based indexing.. techniques
The binary search tree is a basic data structure for representing data items
whose index values are ordered by some linear order. The idea of repetitively
partitioning a data space has been adopted and generalized in many sophisti-
cated indexes. In this section, we will examine spatial indexes originated from
the basic structure and concept of binary search trees.
2.3.1 The kd-tree
The kd-tree [Bentley, 1975], a k-dimensional binary search tree, was proposed
by Bentley to index multi-attribute data. A node in the tree (see Figure 2.2)
serves two purposes: representation of an actual data point and direction of a

search. A discriminator whose value is between 0 and k-1 inclusive, is used to
indicate the key on which the branching decision depends. A node P has two
children, a left son LOSON(P) and a right son HISON(P). If the discriminator
value of node P is the jth attribute (key), then the jth attribute of any node in
the LOSON(P) is less than the jth attribute of node P, and the jth attribute
of any node in the HISON(P) is greater than or equal to that of node P. This
property enables the range along each dimension to be defined during a tree
traversal such that the ranges are smaller in the lower levels of the tree.
(0,100) (100, 100)
0) discriminator
o(x-axis)
8(10,75) tD
30,90)
• F(80,
4.A(40,60)
-OC(2 -,15) E(70,20)
I (y-axis)
o(x-axis)(100,0)(0,0)
(a) The planar representation. (b) The structure of a kd-tree.
Figure 2.2. The organization of data in a kd-tree.
Complications arise when an internal node is deleted. When an internal
node is deleted, say Q, one of the nodes in the subtree whose root is Q must
be obtained to replace Q. Suppose i is the discriminator of node Q, then
the replacement must be either a node in the right subtree with the smallest
ith attribute value in that subtree, or a node in the left subtree with the
biggest ith attribute value. The replacement of a node may also cause successive
replacements.
To reduce the cost of deletion, a non-homogeneous kd-tree [Bentley, 1979b]
was proposed. Unlike a homogeneous index, a non-homogeneous index does
not store data in the internal nodes and its internal nodes are used merely as
directory. When splitting an internal node, instead of selecting a data point,
the non-homogeneous kd-trees selects an arbitrary hyperplane (a line for the
two dimensional space) to partition the data points into two groups having
almost the same number of data points and all data points reside in the leaf
nodes.
The kd-tree has been the subject of intensive research over the past decade
[Banerjee and Kim, 1986, Beckley et al., 1985a, Beckley et al., 1985b, Beckley

et al., 1985c, Bentley and Friedman, 1979, Bentley, 1979a, Chang and Fu,
1979, Eastman and Zemankova, 1982, Friedman et al., 1987, Lee and Wong,
1977, Matsuyama et al., 1984, Ohsawa and Sakauchi, 1983, Orenstein, 1982,
Overmars and Leeuwen, 1982, Robinson, 1981, Rosenberg, 1985, Shamos and
Bentley, 1978, Sharma and Rani, 1985]. Many variants have been proposed
in the literature to improve its performance with respect to issues such as
clustering, searching, storage efficiency and balancing.
2.3.2 The K-D-B-tree
To improve the paging capability of the kd-tree, the K-D-B-tree was proposed
[Robinson, 1981]. K-D-B-tree is essentially a combination of a kd-tree and a
B-tree [Bayer and McCreight, 1972, Comer, 1979], and consists of two basic
structures: region pages and point pages (see Figure 2.3). While point pages
contain object identifiers, region pages store the descriptions of subspaces in
which the data points are stored and the pointers to descendant pages. Note
that in a non-homogeneous kd-tree [Bentley, 1979b], a space is associated with
each node: a global space for the root node, and an unpartitioned subspace
for each leaf node. In the K-D-B-tree, these subspaces are explicitly stored in
a region page. These subspaces (for example, 811, 812 and 813) are pairwise
disjoint and together they span the rectangular subspace of the current region
page (for example, 81), a subspace in the parent region page.
During insertion of a new point into a full point page, a split will occur. The
point page is split such that the two resultant point pages will contain almost
the same number of data points. Note that a split of a point page requires an
extra entry for the new point page, this entry will be inserted into the parent
region page. Therefore, the split of a point page may cause the parent region
page to split as well, which may further ripple all the way to the root; thus the
tree is always perfectly height-balanced.
When a region page is split, the entries are partitioned into two groups
such that both have almost the same number of entries. A hyperplane is used
to split the space of a region page into two subspaces and this hyperplane
may cut across the subspaces of some entries. Consequently, the subspaces
that intersect with the splitting hyperplane must also be split so that the new
subspaces are totally contained in the resultant region pages. Therefore, the
split may propagate downward as well. If the constraint of splitting a region
page into two region pages containing about the same number of entries is not
enforced, then downward propagation of split may be avoided. The dimension
for splitting and the splitting point are chosen such that both the resultant
pages have almost the same number of entries and the number of splittings is
minimized. However, there is no discussion on the selection of splitting points.

51
52
• • •• •
• •
•
•• • •
511
• •
521
522
DQ
(a) Planar partition. (b) A hierarchical I<-D-8-tree structure.
Figure 2.3. The K-D-B-tree structure.
The upward propagation of a split will not cause the underflow of pages
but the downward propagation is detrimental to storage efficiency because a
page may contain less than the usual page threshold, typically half of the page
capacity. To avoid unacceptably low storage utilization, local reorganization
can be performed. For example, two or more pages whose data space forms a
rectangular space and who have the same parent can be merged followed by a
resplit if the resultant page overflows.
The K-D-B-tree has incorporated the pagination of the B-tree and the tree
is height-balanced as a result. Nevertheless, poorer storage efficiency is the
trade-off.
2.3.3 The hE-tree
In the K-D-B-tree, a region node is split by cutting the region with a plane,
possibly cutting through some subregions as well. The child nodes with their
space being cut must also invoke the splitting process, causing sparse nodes at
lower levels. To overcome such a problem, a new multi-attribute index structure
called the holey brick B-tree (the hB-tree) [Lomet and Salzberg, 1990a] allows
the data space to be holey, enabling removal of any data subspace from a

data space. The concept of holey bricks is not new - it has been used to
improve the clustering of data in a kd-tree known as the BD-tree [Ohsawa and
Sakauchi, 1983]. The hB-tree structure is based on the K-D-B-tree structure
and hence preserves the height-balanced property. However, it allows the data
space associated with a node to be non-rectangular and it uses kd-trees for
space representation in its internal nodes. In an hB-tree, the leaf nodes are
known as data nodes and the internal nodes as index nodes. The data space
of an index node is the union of its child node subspaces which are obtained
through kd-tree recursive partitioning.
Indexnode
N:
AA 2
NI N2
A B C E G F
NI: N2~
~
. ex(~A1yl H
x_ y.
F y
ABC E
G F
(a) Internal structure of an hB-tree index
node.
Figure 2.4.
(b) The resultant pages after a split.
The hB-tree structure.
A k-dimensional data space represented by its boundaries requires 2k co-
ordinates. To obtain a data space of interest to the search, half of the data
subspaces in a node have to be searched on average and for each data space,
2k comparisons are required. For m data spaces, we need on average m . k
comparisons. The m data subspaces derived through kd-tree recursive parti-
tioning can be represented by a kd-tree with m - 1 kd-tree nodes. It requires
one comparison at each internal node and 2k comparisons for the unpartitioned
subspace. The average number of comparisons is much smaller than that of the
boundary representation. The use of kd-trees therefore reduces the search time
as well as the storage space requirement.
Like conventional kd-trees, internal nodes of the kd-tree structure in an hB-
tree index node partition the search space recursively. Its leaf nodes reference

some index nodes of the hB-tree. However, multiple leaves of a kd-tree structure
may refer to the same hB-tree index node (see Figure 2.4a), giving rise to the
"holey brick" representation. As such, the hB-tree is not truly a tree. During
a split, the kd-tree is split into two subtrees, with each having between 1/3 and
2/3 of the nodes. In order to achieve this, a subtree may have to be extracted
from the original tree structure. This causes duplication of a portion of the
tree close to the root in the parent index node. A leaf node of such a kd-
tree references either an hB-tree data node, an index node, or a marker (ext
in Figure 2.4b) indicating that a subtree has previously been extracted and
is referenced from a higher level index node. The deletion algorithm is not
addressed in the paper.
The hB-tree overcomes the problem of sparse nodes in the K-D-B-tree. How-
ever, this is achieved at the expense of more expensive node splitting and node
deletion. The multiple references of an hB-tree node may cause a path to be
traversed more than once. Of course, this can. be avoided by checking the list
of traversed hB-tree nodes. Deletion may result in the kd-tree being collapsed
to remove the duplicated portion of kd-trees, followed by a resplit if necessary.
2.3.4 The skd-tree
Ooi et al. [Ooi et al., 1987, Ooi et al., 1991] developed an indexing structure
called the spatial kd-tree (the skd-tree) in an attempt to avoid object duplica-
tion and object mapping. At each node of a kd-tree, a value (the discriminator
value) is chosen in one of the dimensions to partition a k-dimensional space
into two subspaces. The two resultant subspaces, HISON and LOSON, nor-
mally have almost the same number of data objects. Point objects are totally
included in one of the two resultant subspaces, but non-zero sized objects may
extend over to the other subspace. To avoid the division of objects for and the
duplication of identifiers in several subspaces, and yet to be able to retrieve
all the wanted objects, a virtual subspace for each original subspace was in-
troduced such that all objects are totally included in one of the two virtual
subspaces [Ooi et al., 1987]. With this method, the placement of an object in
a subspace is based solely upon the value of its centroid.
Since a space is always divided into two, an additional value for each subspace
is required: the maximum of the objects in the LOSON subspace (maxLOsoN),
and the minimum of the objects in the HISON subspace (minHISON ), along
the dimension defined by the discriminator. Thus, the structure of an internal
node of the skd-tree consists of two child pointers, a discriminator (0 to k -1 for
a k-dimensional space), a discriminator-value, (maxLosoN) and (minHIsoN)
along the dimension specified by the discriminator. The maximum range value

of LOSON (maxLosoN) is the nearest virtual line that bounds the data objects
whose centroids are in the LOSON subspace, and the minimum range value of
HISON (minH/SoN) is the nearest virtual line that bounds the data objects
whose centroids are in the HISON subspace.
Leaf nodes contain min-range and max-range (in place of maXLOSON and
minH/ SON of an internal node) respectively, describing the minimum and max-
imum values of objects in the data page along the dimension specified by bound,
and a pointer to the secondary page which contains the object bounding rect-
angles and identifiers. The minimum and maximum values could be kept for k
dimensions. However, for storage efficiency, the range along one dimension that
results in the smallest bounding rectangle is chosen. It has been shown [Ooi,
1990] that such a range increases the height of the tree when it is stored as a
multiway tree, and hence the improvement becomes fairly marginal. Figure 2.5
shows the structure of a two-dimensional skd-tree and illustrates the virtual
boundary (dotted line), minH/SON or maxLOSON of each resultant subspace.
An implicit rectangular space is associated with each node and it is ma-
terialized during traversal. This rectangle is tested against the query region,
and the subtree is examined if they intersect. Since the virtual boundary may
sometimes bound the objects tighter than the partitioning line, the intersec-
tion search takes advantage of the existing virtual boundary to prune the search
space efficiently. To further exploit the virtual boundaries, containment search
which retrieves all spatial objects contained in a given query rectangle was pro-
posed. During tree traversal, the algorithm always selects the boundaries that
yield smaller search space. The direct support of containment search is useful
to operators like within and contain. The search rapidly eliminates all objects
that are not totally contained in the query region.
Inserting index records for new data objects is similar to insertion into a
point kd-tree. As new index records are added to a bucket, the bucket is split
if it overflows. At each node, the algorithm uses the centroid of the bounding
rectangle of the new object to determine which subspace the object will be
placed, and updates the virtual boundary if necessary.
To delete an object, the centroid of its bounding rectangle is used to deter-
mine where the object resides. The removal of an object may cause a bucket
to underflow, and merging or reinsertion is then required. If the neighboring
node is a leaf-node, then the two buckets are merged and the resultant bucket
is resplit if overflow occurs. Otherwise, the records are required to be inserted
into the neighboring subtree, and the neighboring node is promoted to replace
the parent node. The merging follows the principle of buddy system [Niev-
ergelt et aI., 1984], that is the region of two merged nodes is rectangular and
a proper subspace derivable from discriminator values in parent nodes. The
major problem with deletion occurs when an object contributes to the bound-

~.,,~
?12.~ ?'14'~
? 13.~ Ix.xI..h2] ?15,(IOJ (X.h2,! .x2J
(y,y2•• y!l {y.Y3" h4) Ix.. f J Ix,. f dalapagc
~ ~ ",'"'~(a) A 2-d directory of the skd-tl'ee.
x=x I bI II b2 x2
1-- - ~ - - - - - - - - - :- "I ~ - - - - - -p3. I
1 ¥r pl.: ~ .. :1 1'2 I. p2•.. .·... 1
1 ~ LI1J'~--------~1 : p4. 1'3: I : [[4..i
I : : 1 I
y=b3 I· •.•.•.. : • • . . . . . • . • . • • . • • : 1 : pII I
12 l- - - - - - ~ - T - - - - - -, .
b41" ~ 1.. 'LJ7.. ':" .: : r9 :
I · .......... , ..
yltij...... 1 rSI . I
I 1'6 1 I'.:.... ..i
1 .W~ -,~ . r .
p5 : y~y3 :... 'p'~~ ...' • p9 I
• I ' , pS 1
y2 ..... : I . . . • • plOl
'- :... _ 1 _: ....: _ I _ :.... I
x=b5 13 b6
(b) A 2-d space coordinate representation.
Figure 2.5. The structure of a spatial kd-tree.
y=b7
14
bS
b9
blO
15
ary of a virtual space is deleted. A new tighter boundary needs to replace the
old boundary which may not be as effective. The operation can be expensive
as several pages whose space is adjacent to the deleted boundary need to be
searched. The operation cost can be reduced by periodically sweeping the sub-
trees that are affected by deletions. It should be noted that the delay of finding
replacements does not result in any invalid answer.
The directory of the skd-tree is stored in secondary memory. The bottom-up
approach for binary tree paging [Cesarini and Soda, 1982] is modified to store
the skd-tree as a multiway-tree. When such a page splits, one of the subtrees

is migrated to an existing page that can accommodate the subtree or a new
page, and the root of the subtree is promoted to the parent page.
It was shown that the containment search is insensitive to the different sizes
of objects and distribution of objects, and it is always more efficient than the
intersection search due to a smaller search space [Ooi et al., 1991]. It can be
noticed that the leaf nodes of the skd-tree take up about half of the storage
requirement for the directory. The main objective of having such layer of leaf
nodes is to reduce the fetching of data pages. Experiments have been conducted
to evaluate the performance of skd-trees with and without the leaf nodes, under
different data distributions [Ooi, 1990]. The experiments show that for uniform
distributions of spatial objects, the leaf nodes can reduce the page accesses.
However, when the distributions are skewed, the extra layers are not effective
and large directory sizes incur more page reads than that by the modified skd-
tree. The modified skd-tree, which has less number of nodes, saves up to 40%
of the directory storage space.
2.3.5 The BD- and GBD-trees
The BD-tree [Ohsawa and Sakauchi, 1983], a variant of kd-trees, allows a more
dynamic partitioning of space. Each non-leaf node in the BD-tree contains
a variable-length string, called the discriminator zone (DZ) expression, con-
sisting of D's and l's. The 0 means "<" and 1 "2::", with the leftmost digit
corresponding to the first binary division, and the nth bit corresponding to
the nth binary division. The string describes the left subspace while the right
subspace is its complement. Each string uniquely describes a space. A data
space with the DZ expression (for example, 0100) which is the initial substring
of a longer DZ expression (for example, 010001) encloses its data space. A BD-
tree is different from a kd-tree in the following aspects. One, the data space
of a BD-tree node is not a hyper-rectangle. The use of complement makes the
space holey. Two, unlike the conventional kd-tree, the use of DZ expression
enables rotation, achieving a greater degree of balancing. Three, the partition
divides a space into two equal sized subspaces. Four, the discriminators are
used cyclically so that each bit of a DZ expression can be correctly associated
with a dimension.
The BD-tree is expanded to a balanced multi-way tree called the GBD-
tree (generalized BD-tree) [Ohsawa and Sakauchi, 1990]. In addition to a DZ
expression, a bounding rectangle is used to describe a data space that boupds
the objects whose centroids fall inside the region defined by the DZ expression.
Centroids of objects are used to determine placement of objects in the correct
bucket. While a DZ expression is used to determine the position in the tree

structure where an entity is located based on its centroid, a bounding rectangle
is used in intersection search.
In an internal node, each entry describes a data space obtained through
binary decomposition. The union of these data spaces forms the data space of
the node. While the data spaces described by the entries' DZ expressions do
not overlap, their associated bounding rectangles overlap. During point search
of an entity, an inclusion check of the DZ expression of the entity is performed
against the DZ expression of a node. For the data space that includes the entity,
its subtree is traversed. For the intersection search, the bounding rectangles
stored in a node are used instead to select subtrees for traversal.
When a leaf node overflows, it is split into two. A recursive binary decom-
position on alternative axis is performed on the overflowed data space until a
subspace contains at least 2(M+1)/3 entries, where M is the maximum number
of entries a node can contain. While the smaller space has a new DZ expression,
the other subspace takes the DZ expression of the space before splitting. We
call such a space a complementary subspace. A new entry is inserted into the
parent node and the affected bounding rectangles are re-adjusted accordingly.
In an internal node splitting, the subspaces are checked in decreasing order
of their sizes to find a data space that contains almost (M + 1)/2 entries. A
data space described by the DZ expression el contains the data space described
by the DZ expression e2, if el forms the initial substring of e2. In the testing, all
DZ expressions must be checked. The worst case is when a node is split into two
nodes respectively having M entries and one entry. The DZ expression obtained
is used as the DZ expression of a new node. The other new node, which re-
uses the original node, is assigned with the DZ expression of the original space.
When an entry is deleted, a node may underflow. Like B-trees, tree collapsing
is required.
Conceptually, the GBD-tree is similar to the BANG file [Freeston, 1987].
The use of bounding rectangles can be applied to the BANG file. The GBD-
tree has been shown to have better efficiency than the R-tree in terms of tree
construction time for a small set of data [Ohsawa and Sakauchi, 1990].
2.3.6 The LSD-tree
As an improvement to the fixed size space partitioning of the grid files, a binary
tree, called the Local Split Decision tree (LSD-tree), that supports arbitrary
split position was proposed [Henrich et aI., 1989a]. A split position can be
chosen such that it is optimal with respect to the current cell. The directory
of an LSD-tree is similar to that maintained by the kd-tree [Bentley, 1975].
Each node of the LSD-tree represents one split and stores the split dimension

(cf: the discriminator of kd-trees) and position (cf: the discriminator value of
kd-trees), and each leaf node points to a data bucket.
In an LSD-tree, the nodes in a directory T are divided into two directories:
the internal directory and the external directory. The internal directory consists
of a subtree that contains the root and is stored in main memory. The external
directory consists of multiway-trees and is stored in secondary memory. In an
external directory page, the subtree is organized as a heap. When a directory
page is split, the root node of that directory page is inserted into the directory
T and the left and right subtrees are stored in two distinct directory pages.
The main objective of the paging algorithm [Henrich et al., 1989b] is to ensure
that the heights of multiway-trees differ by at most one directory page. The
proposed paging strategy is similar to binary paging strategy [Cesarini and
Soda, 1982], although the latter makes no distinction between the external
and internal directories. The major difference is that the internal directory is
restructured such that the heights of multi-way trees in the external directory
always differ by at most one page. l'o achieve this, nodes close to the boundary
that separates the internal and external directories must be moved around
between these two directories. Note that the size of the internal directory
depends on the allocated internal memory. Like kd-trees, rotation of the tree is
not possible. If the data is very skewed, the property of the height differences
of at most one cannot be upheld.
The deletion algorithm is not presented. We believe that the deletion of
[Cesarini and Soda, 1982] can be applied here.
2.4 B-tree based indexing- techniques
B+-trees have been widely used in data intensive systems to facilitate query
retrieval. The wide acceptance of the B+-tree is its height-balanced elegant
characteristic, making it ideal for disk I/O where data transfer is in the unit
of page. It has become an underlying structure for many new indexes. In this
section, we discuss indexes based on the concept of the hierarchical structure
of B+-trees.
2.4.1 Tlle R-tree
The R-tree [Guttman, 1984] is a multi-dimensional generalization of the B-tree,
that preserves height-balance. Like the B-tree, node splitting and merging are
required for inserting and deleting objects. The R-tree has received a great
deal of attention due to its well defined structure and the fact that it is one of
the earliest proposed tree structures for non-zero sized spatial object indexing.
Many papers have used the R-tree as a model to measure the performance of
their structures.

An entry in a leaf node consists of an object-identifier of the data object
and a k-dimensional bounding rectangle which bounds its data objects. In a
non-leaf node, an entry contains a child-pointer pointing to a lower level node
in the R-tree and a bounding rectangle covering all the rectangles in the lower
nodes in the subtree. Figure 2.6 illustrates the structure of an R-tree.
R.!. _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2
R3 :~- - - - ;'. ~i6----~----I~~---------I~~~-P3-:
I • p4 I II 1 I I
I I I D 31 r- - - - - - - - - -I'
l - - - - - - - I ' 1 I R7 I II
I~ L • pll
I I 1 II
'~p:p~ RSFpl-: N::I I I I :>7 I I II
I I I ,I 1_ - - L - - - - - - I
1"5 p5 I I I plj I I
_:=====~ -_i_-_-,..j _II I"S' : :
I p S . • :>10 I
1 -_ 1 I
(a) A planar representation.
~
(b) The directory of an R-tree.
Figure 2.6. The structure of an R-tree.
In order to locate all objects which intersect a query rectangle, the search
algorithm descends the tree from the root. The algorithm recursively traverses
down the subtrees of bounding rectangles that intersect the query rectangle.
When a leaf node is reached, bounding rectangles are tested against the query
rectangle and their objects are fetched for testing if they intersect the query
rectangle.
To insert an object, the tree is traversed and all the rectangles in the current
non-leaf node are examined. The constraint of least coverage is employed to
insert an object: the rectangle that needs least enlargement to enclose the
new object is selecteel, the one with the smallest area is chosen if more than

one rectangle meets the first criterion. The nodes in the subtree indexed by
the selected entry are examined recursively. Once a leaf node is obtained, a
straightforward insertion is made if the leaf node is not full. However, the leaf
node needs splitting if it overflows after the insertion is made. For each node
that is traversed, the covering rectangle in the parent is readjusted to tightly
bound the entries in the node. For a newly split node, an entry with a covering
rectangle that is large enough to cover all the entries in the new node is inserted
in the parent node if there is room in the parent node. Otherwise, the parent
node will be split and the process may propagate to the root.
To remove an object, the tree is traversed and each entry of a non-leaf node
is checked to determine if the object overlaps its covering rectangle. For each
such entry, the entries in the child node are examined recursively. The deletion
of an object may cause the leaf node to underflow. In this case, the node needs
to be deleted and all the remaining entries of that node are reinserted from
the root. The deletion of an entry may also cause further deletion of nodes
in the upper levels. Thus, entries belonging to a deleted ith level node must
be reinserted into the nodes in the ith level of the tree. Deletion of an object
may change the bounding rectangle of entries in the ancestor nodes. Hence
readjustment of these entries is required.
In searching, the decision to visit a subtree depends on whether the covering
rectangle overlaps the query region. It is quite common for several covering
rectangles in an internal node to overlap the query rectangle, resulting in the
traversal of several subtrees. Therefore, the minimization of overlaps of covering
rectangles as well as the coverage of these rectangles is of primary importance
in constructing the R-tree.
The heuristic optimization criterion used in the R-tree is the minimization
of the area of internal nodes covering rectangles. Two algorithms involved in
the process of minimization are the insertion and its node splitting algorithms.
Of the two, the splitting algorithm affects the index efficiency more. Guttman
[Guttman, 1984] presented and studied splitting algorithms with exponential,
quadratic and linear cost, and showed that the performance of the quadratic
and linear algorithms were comparatively similar. The quadratic algorithm
in a node splitting first locates two entries that are furthest apart, that is a
pair of entries that would waste the largest area if they are put in the same
group. These two rectangles are known as the seeds and the pair chosen tend
to be small relative to others. Two groups are formed, each with one seed.
For the remaining entries, each entry rectangle is used to calculate the area
enlargement required in the covering rectangle of each group to include the
entry. The difference of two area enlargements is calculated and the entry that
has the maximum difference is selected as the next entry to be included into the
group whose covering rectangle needs the least enlargement. As the selection

is mainly based on the minimal enlargement of covering rectangles and the
rectangle that has been enlarged before requires less expansion to include the
next rectangle, it is quite often that a single covering rectangle is enlarged till
the group has M - m + 1 rectangles (M is the maximum number of entries
per node). The two resultant groups will respectively contain M - m + 1 and
m rectangles. The linear algorithm chooses the first two objects based on the
separation between the objects in relation to the width of the entire group along
the same dimension. Greene proposed a slightly different splitting algorithm
[Greene, 1989]. In her splitting algorithm, two most distant rectangles are
selected and for each dimension, the separation is calculated. Each separation
is normalized by dividing it with the interval of the covering rectangle on the
same dimension, instead of by the total width of the entire group [Guttman,
1984]. Along the dimension with the largest normalized separation, rectangles
are ordered on the lower coordinate. The list is then divided into two groups,
with the first (M + 1)/2 rectangles into the first group and the rest into the
other.
2.4.2 The R*-tl'ee
Minimization of both coverage and overlaps is crucial to the performance of
the R-tree. It is however impossible to minimize the two at the same time. A
balancing criterion must be found such that the near optimal of both minimiza-
tion can produce the best result. Beckmann et al. introduced an additional
optimization objective concerning the margin of the covering rectangles; squar-
ish covering rectangles are preferred [Beckmann et al., 1990]. Since clustering
rectangles with little variance of the lengths of the edges tend to reduce the area
of the cluster's covering rectangle, the criterion that ensures the quadratic cov-
ering rectangles is used in the insertion and splitting algorithms. This variant
of R-tree is referred to as the R*-tree.
In the leaf nodes of the R*-tree, a new record is inserted into the page whose
entry covering rectangle if enlarged has the least overlap with other covering
rectangles. A tie is resolved by choosing the entry whose rectangle needs the
least area enlargement. However, in the internal nodes, an entry whose covering
rectangle needs the least area enlargement is chosen to include the new record,
and a tie is resolved by choosing the entry with the smallest resultant area.
The improvement is particularly significant when both the query rectangles
and data rectangles are small, and when the data is non-uniformly distributed.
In the R*-tree splitting algorithm, along each axis, the entries are sorted by
the lower value, and also sorted by the upper value of the entry rectangles. For
each sort, M - 2m + 2 distributions of splits are considered; and in the kth
distribution (1 :s: k :s: M - 2m + 2), the first group contains the first m - 1 + k

entries and the other group contains the remaining !l1 - m - k entries. For
each split, the total area, the sum of edges and the overlap-area of the two new
covering rectangles are used to determine the split. Note that not all three can
be minimized at the same time. Three selection criteria were proposed based on
the minimum over one dimension, the minimum of the sum of the three values
over one dimension or one sort, and the overall minimum. In the algorithm,
the minimization of the edges is used.
Dynamic hierarchical spatial indexes are sensitive to the order of the inser-
tion of data. A tree may behave differently for the same data set but with a
different sequence of insertions. Data rectangles inserted previously may result
in a bad split in R-tree after some insertions. Hence it may be worth to do
some local reorganization, which is however expensive. The R-tree deletion
algorithm provides reorganization of the tree to some extent, by forcing the
entries in underflowed nodes to be inserted from the root. The performance
study shows that the deletion and reinsertion can improve the R-tree perfor-
mance quite significantly [Beckmann et al., 1990]. Using the idea of reinsertion
of the R-tree, Beckmann et al. proposed a reinsertion algorithm when a node
overflows. The reinsertion algorithm sorts the entries in decreasing order of
the distance between the centroids of the rectangle and the covering rectan-
gle and reinserts the first p (variable for tuning) entries. In some cases, the
entries are reinserted back into the same node and hence a split is eventually
necessary. The reinsertion increases the storage utilization; and this can be
expensive when the tree is large. Experimental study conducted indicates that
the R*-tree is more efficient than some other variants, and the R-tree using lin-
ear splitting algorithm is substantially less efficient than the one with quadratic
splitting algorithm [Beckmann et aI., 1990].
2.4.3 The R+-tree
The R+-tree [Sellis et al., 1987] is a compromise between the R-tree and the
K-D-B-tree [Robinson, 1981] and was proposed to overcome the problem of the
overlapping covering rectangles of internal nodes of the R-tree. The R+-tree
differs from the R-tree in the following constraints: nodes of an R+-tree are
not guaranteed to be at least half filled; the entries of any internal node do not
overlap; and an object identifier may be stored in more than one leaf node.
The duplication of object identifiers leads to the non-overlapping of entries.
In a search, the subtrees are examined only if the corresponding covering rect-
angles intersect the query region. The disjoint covering rectangles avoid the
multiple search paths of the R-tree for point queries. For the space in Fig-
ure 2.7, only one path is traversed to search for all objects that contain point
P7; whereas for the R-tree, two search paths exist. However, for certain query

rectangles, searching the R+-tree is more expensive than searching the R-tree.
For example, suppose the query region is the left half of object rs. To retrieve
all objects that intersect the query region using the R-tree, two leaf nodes have
to be searched, respectively through Rs and Rs, and it incurs five page ac-
cesses. To evaluate such a query, three leaf nodes of the R+-tree have to be
searched, respectively through R6 , Rg , and RiO, and a total of six page accesses
is incurred.
R.!. ______________ R2
R3 ~~- - - - ;1. :FR~I=2= ====1~~·-p.1-~1
I R4 ,I .r . ~ p2 II
, ep4 I r.O--- , I "
I _ _ _ _ _ _ _ I I r3 II I' II
~ - - - - - - - - - - - II II
,I I pll
., fr~'-~----: :~ ::~EF1: ,: :--JI I I I I
_I =====~ 1 II r8 it: :
Rld~ =~8~= =e-!~o :
(a) A planar representation.
(b) The directory of an R+-tree.
Figure 2.7. The structure of an R+ -tree.
To insert an object, multiple paths may be traversed. At a node, the subtrees
of all entries with covering rectangles that intersect with the object bounding
rectangle must be traversed. On reaching the leaf nodes, the object identifier
will be stored in the leaf nodes; multiple leaf nodes may store the same object
identifier.

Three cases of insertions need to be handled with care [Gunther, 1988, Ooi,
1990]. The first is when an object is inserted into a node where the covering
rectangles of all entries do not intersect with the object bounding rectangle.
The second is when the bounding rectangle of the new object only partially
intersects with the bounding rectangles of entries; this requires the bounding
rectangle to be updated to include the new object bounding rectangle. Both
cases must be handled properly such that the coverage of bounding rectangles
and duplication of objects could be minimized.
The third case is more serious in that the covering rectangles of some entries
can prevent each other from expanding to include the new object. In other
words, some space ("dead space") within the current node cannot be covered
by any of the covering rectangles of the entries in the node. If the new object
occupies such a region, it cannot be fully covered by the entries. To avoid
this situation, it is necessary to look ahead to ensure that no dead space will
result when finding the entries to include an object. Alternatively, the crite-
rion proposed by Guttman [Guttman, 1984] can be used to select the covering
rectangles to include a new node. When a new object cannot be fully covered,
one or more of the covering rectangles are split. This means that the split may
cause the children of the entries to be split as well, which may further degrade
the storage efficiency.
During an insertion, if a leaf node is full and a split is necessary, the split
attempts to reduce the identifier duplications. Like the K-D-B-tree, the split
of a leaf node may propagate upwards to the root of the tree and the split
of a non-leaf node may propagate downwards to the leaves. The split of a
node involves finding a partitioning hyperplane to divide the original space
into two. The selection of a partitioning hyperplane was suggested to be based
on the following four criteria: the clustering of entry rectangles, minimal total
x- and y-displacement, minimal total space coverage of two new subspaces,
and minimal number of rectangle splits. While the first three criteria aim to
reduce search by tightening the coverage, the fourth criterion confines the height
expansion of the tree. The fourth criterion can only minimize the number of
covering rectangles of the next lower level that must be split as a consequence.
It cannot guarantee that the total number of rectangles being split is minimal.
Note that all four criteria cannot possibly be satisfied at the same time.
While the R+-tree overcomes the problem of overlapping rectangles of the R-
tree, it inherits some problems of the K-D-B-tree [Robinson, 1981]. Partitioning
a covering rectangle may cause the covering rectangles in the descendant sub-
tree to be partitioned as well. Frequent downward splits tend to partition the
already under populated nodes, and hence the nodes in an R+-tree may contain
less than M /2 entries. Object identifiers are duplicated in the leaf nodes, the
extent of duplication is dependent on the spatial distribution and the size of

the objects. To delete an object, it is necessary to delete all identifiers that
refer to that object. Deletion may necessitate major reorganization of the tree.
2.4.4 The BY-tree
The BY-tree, proposed by Freeston, is a generalization of the B-tree to higher
dimensions [Freeston, 1995]. While the BY-tree guarantees that it can specialize
to (and hence preserves the properties of) a B-tree in the one-dimensional case,
at higher dimensions, it may not be height-balanced and its storage utilization
is reduced to no worst than 33% (instead of 50% in B-tree). Despite foregoing
these two properties, it is able to maintain the logarithmic access and update
time.
Based on the BANG file [Freeston, 1987], a subspace 5 is split into two
regions 51 and 52 such that the boundary of 51 encloses that of 52. Each
region is uniquely identified by a key, and the key is used to direct the search in
the BY tree. Although the physical boundaries of regions may be recursively
nested, there is no correspondence between the level of nesting of a region and
the index tree hierarchy which represents it. In fact, whenever a region r1
whose boundary directly encloses the boundary of a region r2 resulting from a
split, then r1 is "promoted" closer to the root. To facilitate searching correctly,
the actual level in which r1 belongs to (called a guard) is stored.
Figure 2.8 illustrates a BY-tree. As shown in the figure, boundary of region
aO encloses that of region bO, which in turns encloses the boundary of regions
cO, dO and eO. In this example, region bO has been promoted to the root as it
serves as a guard for region bi.
--- ....
-------
(a) A planar representation. (b) The BY-tree.
Figure 2.8. The structure of a BV-tree.
The search begins at the root, and descends down the tree. At each node,
every entry is checked to identify a guard set that represents regions that best

match the search region. Two types of entries can be found in the guard set -
those that correspond to the set guards of an unpromoted entry, and the best
match unpromoted entry that encloses the best match guard. As the tree is
descended from level h to level h - 1, the guard sets found at levels h - 1 and
h are merged in the process of which some may be pruned away. Once the leaf
node is reached, the guard set contains the regions where the search region may
be found. The data corresponding to the regions of the guard set are searched
to answer the query.
During insertion, complication arises when a promoted region is to be split
into two such that one region encloses higher-level regions while the other does
not. In this case, the entry for the second region will have to be demoted to its
unpromoted position in the tree. Deletion may require merging and resplitting.
This requires finding a region to merge, and finding a way to split the merged
. .
regIOn agam.
2.5 Cell methods based on dynamic hashing
Both extendible hashing [Fagin et aI., 1979] and linear hashing [Kriegel and
Seeger, 1986, Larson, 1978] lend themselves to an adaptable cell method for
organizing k-dimensional objects. The grid file [Nievergelt et aI., 1984] and the
EXtendible CELL (EXCELL) method [Tamminen, 1982] are extensions of dy-
namic hashed organizations incorporating a multi-dimensional file organization
for multi-p,ttribute point data. We shall restrict our discussion to the grid file
and its variants.
2.5.1 The grid file
The grid file structure [Nievergelt et aI., 1984] consists of two basic structures:
k linear scales and a k-dimensional directory (see Figure 2.9). The fundamental
idea is to partition a k-dimensional space according to an orthogonal grid. The
grid on a k-dimensional data space is defined as scales which are represented by
k one-dimensional arrays. Each boundary in a scale forms a (k-l )-dimensional
hyperplane that cuts the data space into two subspaces. Boundaries form k-
dimensional unpartitioned rectangular subspaces, which are represented by a
k-dimensional array known as the grid directory. The correspondence between
directory entries and grid cells (blocks) is one-to-one. Each grid cell in the grid
directory contains the address of a secondary page, the data page, where the
data objects that are within the grid cell are stored. As the structure does not
have the constraint that each grid cell must at least contain m objects, a data
page is allowed to store objects from several grid cells as long as the union of
these grid cells together form a rectangular rectangle, which is known as the
storage region. These regions are pairwise disjoint, and together they span the

data space. For most applications, the size of the directory dictates that it be
stored on secondary storage, however, the scales are much smaller and may be
cached in main memory.
data pages
grid directory
I----i---+----------i
•
• ••
• ••
•
•
• •
:.
Figure 2.9. The grid file layout.
Like other tree structures, splitting and merging of data pages are respec-
tively required during insertion and deletion. Insertion of an object entails
determining the correct grid cell and fetching the corresponding page followed
by a simple insertion if the data page is not full. In the case where the page
is full, a split is required. The split is simple if the storage region covers more
than one grid cell and not all the data in the region fall within the same cell;
the grid cells are allocated to the existing data page and a new page with the
data objects distributed accordingly. However, if the page region covers only
one grid cell or the data of a region fall within only one cell, then the grid
has to be extended by a (k-l)-dimensional hyperplane that partitions the stor-
age region into two subspaces. A new boundary is inserted into one of the
k grid-scales to maintain the one-to-one correspondence between the grid and
the grid directory, a (k-l )-dimensional cross-section is added into the grid di-
rectory. The resulting two storage regions are disjoint and, to each region a
corresponding data page is attached. The objects stored in the overflowing page
are distributed among the two pages, one new and one existing page. Other

grid cells that are partitioned by the new hyperplane are unaffected since both
parts of the old grid cell will now be sharing the same data page.
Deletions may cause the occupancy of a storage region to fall below an ac-
ceptable level, and these trigger merging operations. When the joint occupancy
of a storage region whose records have been deleted and its adjacent storage re-
gion drops below a certain threshold, the data pages are merged into one. Based
on the average bucket occupancy obtained from simulation studies, Nievergelt
et al. [Nievergelt et aI., 1984] suggested that 70% is an appropriate value of the
resulting bucket. Two different methods were proposed for merging, the neigh-
bor system and the buddy system. The neighbor system allows two data pages
whose storage regions are adjacent to merge so long as the new storage region
remains rectangular; this may lead to "dead space" where neighboring pages
prevent any merging for a particular under-populated page. A more restrictive
merging policy like the buddy system is required to prevent the dead space.
For the buddy system, two pages can be merged provided their storage regions
can be obtained from the subsequent larger storage region using the splitting
process. However, total elimination of dead space for a k-dimensional space is
not always possible. The merging process will also make the boundary along
the two old pages redundant, when there are no storage regions adjacent to
the boundary. In this case, the redundant boundary is removed from its scale
and the one-to-one correspondence is maintained by removing the redundant
entries from the grid directory.
The grid file has also been proposed as a means for spatial indexing of non-
point objects [Nievergelt and Hinrichs, 1985]. To index k-dimensional data
objects, mapping from a k-dimensional space to a nk-dimensional space where
objects exist as points is necessary. One disadvantage of the mapping scheme is
that it is harder to perform directory splitting in the higher dimensional space
[Whang and Krishnamurthy, 1985]. To index a rectangle, it is represented as
(ex, ey, dx, dy), where (ex, ey) is the centroid of the object and (dx, dy) are the
extensions of the object from the centroid. The (ex, ey, dx, dy) representation
causes objects to cluster close to x-axis, while objects cluster on top of x = y
for (Xl, X2, Yl, Y2) representation. For ease of grid partitioning, the former
representation is therefore preferred. For an object (ex, ey, dx, dy) to intersect
with the query region (qex, qey, qdx, qdy), the following conditions must be
satisfied:
ex - dx < qex + qdx and
ex + dx > qex - qdx and
ey - dy < qey + qdy and
ey+ dy > qey - qdy

Consider Figure 2.10a, where rectangle q is the query rectangle. The inter-
section search region on ex - dx hyperplane, the shaded region in Figure 2.10b,
is obtained by the first two inequality equations of the above intersection con-
dition. Note that the search region can be very large if the global space is
large and the largest rectangle extension along the x-axis is not defined. In
Figure 2.10, the known upper bound, udx, for any rectangle extension along
the x-axis, reduces the search region to the enclosed shaded region. The same
argument applies for the other coordinate. Objects that fall in both search
regions satisfy the intersection condition.
qcx-
qdx
qcy
lJCX
(a) Object distribution.
,," "" ,,/ ,," ,"/ /
udx -h~/_":"/---''-/---L_/ --'-_/....,,£.
,,/ ,," .~/ ,,"
I' I' I' I'
I' " I' "
"".... ,,I' //.d
"'It .-" 1'. .g
he I' I' I.: • f
qcx-qdx qcx qcx+qdx
(b) Search regions on
cx-dx hyperplane.
dy
'Icy·
'lily
IIdy +-~~........---,.4-~~+
L-.:::...:.....-'O"'--4---;<-;;~ cy
lJcY·lJdy 'Icy lIcY+lJdy
(c) Search regions on
cy-dy hyperplane.
Figure 2.10. Intersection search region in the grid file.
The mapping of regions from a k-dimensional space to points in a nk-
dimensional space undesirably changes the spatial neighborhood properties.
Regions that are spatially close in a k-dimensional space may be far apart when
they are represented as points in an nk-dimensional space. Consequently, the
intersection search may not be efficient.
2.5.2 The R-file
The grid file structure was originally designed to guarantee two disk accesses for
exact match queries, one to access the directory and the other to access the data
page. The "two disk access" property can only be ensured if the directory is
stored as an array and all grid cells are of the same size. However, with such an
implementation, the size of the directory is doubled whenever a new boundary
is introduced. Most of these directory entries correspond to empty grid cells
that do not contain any data objects. Simulated results [Nievergelt et al.,
1984] indicate that the size of the directory grows approximately linearly with
the size of the file. To alleviate this problem, multi-level directories [Blanken
et al., 1990, Hinrichs, 1985, Hutflesz et al., 1990, Freeston, 1987, Whang and
Krishnamurthy, 1985] where grid cells are organized in a hierarchical structure

have been suggested. We shall present the R-file approach which is designed for
non-zero sized objects. In the R-file [Hutftesz et al., 1990], cells are partitioned
using the partitioning strategy of the grid file and a cell is split when overflowed.
In order for cells to tightly contain the spatial objects, cells are partitioned
recursively by repeated halving till the smallest cell that encloses the spatial
objects is obtained. Spatial objects that are totally contained in a cell are
stored in its corresponding data page, and those that intersect the partitioning
line are stored in the original cell. If the number of spatial objects that intersect
a partitioning is more than what can be stored in a data page, partitioning line
along the other dimensions will be used. If all records lie on the cross point of
partitioning lines, they cannot be partitioned by any partitioning lines, and in
such a case, a chain of buckets is used.
After a split, the original cell and the two new cells overlap and to keep the
directory small, empty cells are not maintained. After a split, both the original
and new cells have almost the same number of spatial objects. Figure 2.11
illustrates a case in point. Even so, a high number of cells will be inspected
for intersection queries, especially those original large cells. The fact that
spatial objects stored in the original unpartitioned cells tend to intersect the
partitioning lines of the cells suggests the clustering property of these objects.
In order to make intersection search more efficient, two extra values that bound
the objects in the partitioning dimension are kept with the original cells. Due
to the overlapping cells, the directory is potentially large. To avoid storing the
cell boundaries, a z-ordering scheme [Orenstein, 1986] is used to number the
cells. With such a scheme, cells are partitioned cyclically. For each cell, the
directory stores the cell number, the bounding interval, and the data bucket
reference. Experiments conducted [Hutftesz et al., 1990] strongly indicate that
the bounding information leads to substantial saving of page accesses.
2.5.3 PLOP-hashing
In [Kriegel and Seeger, 1988], the grid file was extended for the storage of
non-zero sized objects. The method is a multi-dimensional dynamic hashing
scheme based on Piecewise Linear Order Preserving (PLOP) hashing. Like the
grid file, the data space is partitioned by an orthogonal grid. However, instead
of using k arrays to store scales that define partitioning hyperplanes, k binary
trees are used to represent the linear scales. Each internal node of a binary tree
stores a (k-l)-dimensional partitioning hyperplane. Each leaf node of a binary
tree is associated with a k-dimensional subspace (a slice), where the interval
along its associated axis is a sub-interval and the other k-l intervals assume
the intervals of the global space. Each slice is addressed by an index i stored in
its leaf node. To each cell, a page is allocated to store all points that fall in the

I I I
I
GJD9--0-
~D ;0
(a) Original space.
I I
D
(b) First bucket.
Do oo
------r-----
I I I
I I I
I I I
I I I
I I I
-----,
I
~D :I1.- ---' _
(c) Second & Third bucket. (d) Fourth bucket.
Figure 2.11. The R-file.
unpartitioned subspace. From the indexes stored in k binary trees, the address
of a page can be computed. Adopting the bounding scheme similar to that of
skd-tree, two extra values are stored in a leaf node to bound the objects whose
centroids are in the corresponding slice along the axis that the binary tree is
associated with. Hence, an object is inserted into the grid cell that contains its
centroid. The regions defined by the two extra values may overlap and they
will be used for intersection search.
The file organizations based on hashing are generally designed for multi-
dimensional point data. To use them for spatial indexing, the mapping of
objects from k-dimensional space to nk-dimensional space or duplication of
objects identifiers are generally required. Indexing in a parameter space is
not efficient for general spatial query retrievals [Guttman, 1984, Whang and
Krishnamurthy, 1985].

2.6 Spatial objects ordering
Existing DBMS supports efficient one-dimensional indexes and provides fast ac-
cess to one-dimerisional data. If multi-dimensional objects can be converted to
one-dimensional objects, such indexes can be used directly without alteration.
The mapping functions used in mapping must preserve the proximity between
data well enough in order to yield reasonably good spatial search. The idea is
to assign a number to each representative grid in a space and these numbers
are then used to obtain a representative number for the spatial objects. Tech-
niques on ordering multi-dimensional objects using single-dimensional values
have been proposed. These include the Peano curve [Morton, 1966], locational
keys [Abel and Smith, 1983], Z-ordering [Orenstein and Merrett, 1984], Hilbert
curve [Faloutsos and Roseman, 1989], and gray ordering [Faloutsos, 1988]. We
discuss the method based on locational keys proposed by Abel and Smith [Abel
and Smith, 1983].
A space is recursively divided into four equal sized subspaces, forming a
hierarchy of quadrants. For each subspace, a unique numeric key of base S is
attached. All objects falling within a given subspace are assigned the subspace's
key. The key k for a subspace of level h (> 1) can be derived from the key (k')
of the ancestor subspace by the following formula:
{
k' +sm-h
k' + 2 *sm-h
k-
- k' + 3 *sm-h
k' + 4 *sm-h
if k is the SW son of k'
if k is the NW son of k'
if k is the SE son of k'
if k is the NE son of k'
Here m is an arbitrary maximum number of levels in decomposition, which
is greater than h. The global space has Sm as the key.
Figure 2.12 illustrates an example of key assignment (base S), where the
maximum level of decomposition is 4. One can notice that, when the locational
keys of the same level are traced, the ordering is a form of N- or Z-ordering.
To assign a key to a rectangle, the smallest block which completely covers
the rectangle is used. An inherent problem of such an assignment is that an
object bounding rectangle may be very much smaller (as a consequence of the
bounding rectangle spanning one or more subspace divisions) than the asso-
ciated quadrant. To alleviate this problem, a decomposition technique [Abel
and Smith, 1984] is used, where a rectangle may be represented by up to four
adjacent quadrants. Rectangles Band C in Figure 2.12b illustrate the cases
where one and two quadrants are used: keys 1300 for rectangle B, and keys
1422 and 1424 for rectangle C. By associating each rectangle with a collection
of quadrants, a better approximation of a rectangle is achieved. This form of
representation requires an object identifier to be stored in multiple locations.

z.onJcring
(a) Assignment of locational keys. (b) Assignment of covering nodes.
Figure 2.12. Ordering based on locational keys.
However, even if this approach is adopted, the size of the representative quad-
rant may still be much larger than the size of the object's bounding rectangle.
A B+-tree is used to index the objects based on their associated locational keys.
For an intersection search, all quadrants that intersect the query region have
to be scanned. The major advantage of the use of the locational key is that
B+-tree structures are widely supported by conventional DBMSs.
2.7 Comparative evaluation
In this section, we briefly summarize some comparative studies that have been
conducted in the literature.
Greene evaluated the performance of R-trees and R+-trees [Greene, 1989].
In the comparison between R-trees and R+-trees, it is found that the R+-tree
requires much more splits, especially for large data objects, but fewer splits for
smaller data objects. For a uniform data distribution of square rectangles that
fully covers the map space, 30% of the objects are duplicated. Interestingly, the
results show that for the case where the coverage is 100% and the objects are
long and narrow along the x-axis dimension, the duplication decreases. This
is likely due to the better grouping achieved along the x-axis. In general, the
query efficiency tests show that R+-trees perform better for smaller objects and
slightly worse off for larger objects. The study in fact exhibits similar pattern
of results to that of the kd-trees extended using the overlapping approach and
the non-overlapping approach [Ooi, 1990].

Ooi, et al. [Ooi et al., 1991] compared the performance of the skd-tree and
the R-tree. The results indicate that the skd-tree is a more efficient structure
than the R-tree with nearly the same storage requirement. The containment
search provided by the skd-tree is more efficient than its intersection search
and is less sensitive to skewed data.
In [Hoel and Samet, 1992], Hoel and Samet conducted a qualitative compar-
ative study on the performance of three spatial indexes, namely the R*-tree, the
R+-tree, and the PMR quadtree [Nelson and Samet, 1987], on large line seg-
ment databases. Spatial testing on line segments was conducted. These queries
include finding all lines incident at a given point, and at the other endpoint
of the line segment of a given point, nearest line segments of a given point,
the MBR of line segments that contains a given point and all line segments
with a given rectangular window. In their implementation, the execution time
of query retrieval is the prime objective, which is sometimes achieved at the
expense of a little more expensive storage space. The difference in performance
is not very great, although the PMR quadtree has a slight edge over the other
two, and the R+-tree is slightly better than the R*-tree because of the disjoint
decomposition of line segments. The R+-tree required considerably more space
than the other two structures. However, the study did not result in claims of
convincing superiority for any of the tested three indexes. This could be due
to the use of line segments, which are much simpler than non-zero sized and
irregularly shaped objects.
In [Ooi, 1990], the efficiency of three extending methods was studied using
a family of kd-trees, namely skd-trees [Ooi et al., 1987], Matsuyama kd-tree
[Matsuyamaet al., 1984], and the 4d-tree [Banerjee and Kim, 1986]. Databases
of 12,000 objects were generated with different distribution of object sizes and
object locations. The average data density used is 3. However, for very skewed
object placements, the data density of certain locations could be very high. The
study shows that the Matsuyama kd-tree which adopts the non-overlapping
native space indexing approach performs efficiently in terms of page accesses
for small objects. As the object sizes become bigger, its performance degrades.
The 4d-tree is the least efficient structure. Its nodes store less information than
those of the skd-tree, which accounts for a smaller directory size. Intersection
search is not supported efficiently because of its inability to prune the search
space effectively.
In [Papadias et al., 1995], the topological relationships of meet, overlap,
inside, covered-by, covers, contains, and disjoint between MBRs were studied.
The efficiency of the R-tree, R+-tree, and R*-tree were then studied using three
databases of 10,000 objects, with different sizes of MBRs, and 100 queries. For
small MBRs (less than 0.02% of the map area) and medium MBRs (less than
0.1% of the map area), R*-trees and R+-trees outperform the R-tree, with the

R+-tree slightly more efficient than the R*-tree. However, for large MBRs (less
than 0.5% of the map area), the R+-tree becomes less efficient than the other
two due to additional levels caused by duplications. The R+-tree does not work
for high data density [Greene, 1989, Papadias et aI., 1995].
We also set out to investigate the performance of the R-tree and R*-tree for
high-dimensional data. We implemented both structures using C on the Sun
SPARC workstation running SunOS 5.5. The size of a disk page used for both
trees is 4 KByte. The quadratic cost splitting algorithm [Guttman, 1984] is
adopted for the R-tree, and the quadratic cost version of evaluating the overlap
of a given node is also implemented for the R*-tree. To deal with paging, a
priority based page replacement strategy that adopts a least useful policy is
employed [Chan et aI., 1992]. A page is useful if it will be referenced again in
the traversal; otherwise, it is useless. The strategy favors useless pages that
are at the higher level of the tree, and useful pages that are at the lower level
of the tree. We conducted our experimental study on a real data set consisting
of Fourier points in high-dimensional space (2, 4, 8 and 16 dimensions) of the
contours of industrial parts. The database used is the same one employed in
[Berchtold et aI., 1996], except that we extracted a subset of 1 million objects
only. Figure 2.13 shows some representative results which are largely consistent
with previous works. First, as expected, R*-tree is more space efficient than the
R-tree (see Figure 2.13a). Second, R*-tree's insertion cost is larger than that
of the R-tree, and as the number of dimensions increases, the relative difference
also widens. This is consistent with the result in [Beckmann et aI., 1990].
For point query retrievals, we perform 1000 queries, and used the average
number of disk accesses as the metrics. The 1000 points are randomly selected
from the respective test data of the dimensions. We observe that when the
number of dimensions is small (see Figure 2.13c), both the R*-tree and R-tree
perform equally well (with R*-tree slightly better). This result is again consis-
tent with the findings in [Papadias et al., 1995] for large databases. However,
as the number of dimensions increases, the R*-tree requires more disk accesses
than the R-tree during retrieval. We also evaluated 1000 range queries, and
the result is shown in Figure 2.13d. The result confirmed the observation that
R*-tree outperforms the R-tree only at low dimensions, but is inferior to the
R-tree at higher dimensions. Finally, from the results, we note that both the
R-tree and the R*-tree does not scale well with the number of dimensions.
2.8 Summary
We have reviewed a number of indexes that are suitable for indexing non-zero
sized objects in spatial database systems. These have been categorized based
on their extending methods and the base structures. We have also discussed

0.8 20
..----- ---------------------
-------
-------- R·tree - -----_..-
c: 16 W·tree ..... .'
.2 0.6
.. ~
~ Cl>
:; 8 12Cl>
'"'" "'"'" 0.4 R·tree - <Il
CI. '6
"'" W·tree ..__. Cl>
.!a
'" 8."
e
-----------
Cl> Cl>
'" >
e '"Cl>
0.2>
'" 4
0 0
0 4 8 12 16 0 4 8 12 16
dimension dimension
(a) Storage cost. (b) Insertion cost.
2000 3500 R·tree -
W-tree .....
R·tree -
W·tree ..... 3000
1600 /
~
<Il 2500
<Il
Cl> Cl>
" "" 1200 "'" " '" 2000
"'" "'"<Il <Il
'6 '6
"Cl>
" Cl>
1500
'" 800 '"e eCl> Cl>
"> >
"'" '" 1000
""
,
400
-- ---- 500
..~....--"
0 0
0 4 8 12 16 0 4 8 12 16
dimension dimension
(c) Point query cost. (d) Range query cost.
Figure 2.13. Comparison of R-tree and R*-tree.

the strengths and weaknesses of these techniques. Despite so many work, we
believe the area will remain a very fruitful and challenging one for the next
decade with several promising research directions.
First, there is clearly a lack of benchmarks for evaluating spatial indexes.
This can be attributed to the many factors that need to be considered in eval-
uating a spatial index. Concerning the data, spatial data varies widely in
sizes; spatial objects come in irregular shapes; and objects are not uniformly
distributed in the data space. Furthermore, queries range from simple point
queries to complex spatial join operations that come in different flavors (inter-
section, containment and proximity). Designing a suite of benchmarks is an
important issue that cannot be ignored.
Second, as pointed out, the evaluation of spatial indexes has been rather
limited. Most of the performance study used R-tree as the base for comparisons.
Furthermore, most of the work used synthetic data. We believe that more
extensive and comprehensive performance studies using real data sets will be
necessary and useful for practitioners as well as·developers.
Third, the scalability (in terms of number of dimensions of the data space)
of existing indexes has not been adequately addressed. Most of the work are
restricted to two-dimensional space. Recent work by Berchtold et al. [Berchtold
et aI., 1996] addressed the scalability of indexes with respect to the number of
dimensions, and showed that the R*-tree does not scale well. Instead the R*-
tree degenerates drastically. The same paper also shows that the TV-tree [Lin
et aI., 1995] can perform poorly as the number of dimensions increases. While
the X-tree [Berchtold et aI., 1996] appears to be a promising scalable index, we
believe that designing scalable high-dimensional indexes will be highly exciting
and rewarding.

3 IMAGE DATABASES
Images have always been an essential and effective medium for presenting vi-
sual data. With advances in today's computer technologies, it is not surprising
that in many applications, much of the data is images. In medical applications,
images such as X-rays, magnetic resonance images and computer tomography
images are frequently generated and used to support clinical decision making.
In geographic information systems, maps, satellite images, demographics and
even tourist information are often processed, analyzed and archived. In police
department criminal databases, images like fingerprints and pictures of crimi-
nals are kept to facilitate identification of suspects. Even in offices, information
may arrive in many different forms (memos, documents, and faxes) that can
be digitized electronically and stored as images.
The traditional database management systems, which have been effective
in managing structured data, are unable to provide satisfactory performance
for images that are non-alphanumeric and unstructured. The growing need for
image information systems has led to the design and implementation of image
database systems [Chang and Fu, 1980, Chang and Hsu, 1992, Kunii, 1989,
Knuth and Wegner, 1992, Nagy, 1985, Ogle and Stonebraker, 1995, Tamura
and Yokoya, 1984].

In this chapter, we focus on content-based retrieval techniques, that is tech-
niques that retrieve images based on their visual properties such as texture,
color and shape of objects. In particular, we look at the critical issue of speed-
ily finding the correct images from a large image database system based on
the image feature. For a large collection of images, sequentially comparing
image features is time-consuming and impractical, if not impossible. Instead,
access methods that exploit the image features to narrow the search space are
necessary.
We begin our discussion by looking at what constitute an image database
system. Following that, in Section 3.2, will shall discuss some of the issues
involved in the design of a content-based index. In the same section, we also
review indexing mechanisms that can be used to support content-based re-
trievals. In Section 3.3, we provide a taxonomy of existing image indexes. The
taxonomy is based on the image features used for indexing. Following that, we
present four indexes that facilitate speedy retrieval of images based on color-
spatial information. In Section 3.4, we examine three hierarchical indexes that
integrate multiple existing indexes into a single structure, and in Section 3.5,
we present a signature-based technique. Finally, we shall conclude with a spec-
ulation on future trends in Section 3.6.
3.1 Image database systems
An image database system must deal with both structured and unstructured
data. Furthermore, an image database system also distinguishes itself by the
following additional functionalities:
• Feature extraction. In order to organize the images and their associated
information, it is necessary for the system to understand the contents of the
images. Thus, the system must be able to analyze an image to extract key
features such as the shape of objects in an image, its color components and
texture.
• Feature-based indexing. Traditional database systems index their data by
key attributes which are usually numeric or fixed-length text data. For
image database systems, the system must build indexes based on the features
extracted. Such feature-based indexes can then be used to facilitate efficient
search of a large collection of images and other related information based on
the features of the images.
• Content-based retrievals. Image database systems should support a wide
range of queries. In particular, queries that involve the contents of the
image, in words/text or pictorial form are important and crucial.

IMAGE DATABASES 79
• A measure of similarity. Since content-based queries are usually inexact, the
system requires a measure to capture what we humans perceive as similarity
between two images. However, as the notion of similarity does not neces-
sarily mean correct, the similarity measure must be carefully designed not
to exclude any relevant images, while at the same time minimize irrelevant
images from the results.
Input Image
t
PREPROCESSING MODULE
Image Feature Update
InputlScanner Extraction IndexlDatabase
tQUERY MODULE
r
----~
---Runtime
~Interactive Processor Feature/lmage
~ Query
Feature Database
Formulation
Extraction
Concurrency
Browsing Feature
Control &
-E-- & Matching
Recovery
Feedback
Manager
Output
Retrieved
Images
User
Query
Figure 3.1. Architecture of an image database system.
Figure 3.1 shows the (generic) architecture of an image database system. Im-
ages are preprocessed to extract the key features used for searching. The images
and the feature indexes are then stored in the database. During retrieval, fea-
tures are extracted from the query image, and matched against those stored to
retrieve images that are similar to it. As a consequence of the need to retrieve
images based on similarity, the user interface will usually incorporate some
browsing and feedback mechanisms to facilitate reformulation of queries to im-
prove accuracy. Like traditional database systems, concurrency control and
recovery managers are also critical components of an image database system.
Supporting a fully functional image database system is a difficult problem
and embraces different technologies such as image processing, user interface
design, and database management. In fact, early systems are largely attribute-
based or free-text-based and hardly have any real content-based support. For
attribute-based systems, images are treated as binary large objects (BLOBs).

SO INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
A conventional DBMS, extended with the capability to handle BLOBs, can
be used to manage the images. Access to the unstructured images is achieved
through the structured attributes of the images. Hence, no special effort is re-
1uired to design the organization technique, indexing mechanisms (such as B+-
tree and inverted files) and query processing methods of the systems. However,
this approach is not capable of handling the more user-friendly content-based
queries.
The free-text-based approach applies the concepts of document retrieval
techniques to provide "content-based" functionalities by manual description
of the image and treating the image description as those of a document. Image
access is done through the accompanying image description. For example, for
the query "Retrieve all images that show a girl skating in an ice ring", the
description "a girl skating in an ice ring" is used to retrieve the images. The
system attempts to search this description with that of images stored in the
database. Indexing methods that can be used include signature file access meth-
ods, inverted file access methods and direct (or sequential) file access methods
Besides being unable to facilitate true content-based queries, other limitations
of the free-text-based approach include a free-text description of an image is
highly variable due to the ambiguities in the natural language associated with
annotating images with text and the different interpretations of the image; im-
age description is usually incomplete since an image is semantically richer than
text description; and the vocabulary of the person creating the index and the
user or even between users may not match. As such, the effectiveness of this
approach is fairly limited. The readers are referred to Chapter 5 for an in-depth
discussion on text indexing techniques.
3.2 Indexing issues and basic mechanisms
3.2.1 Key issues in content-based index design
Designing an access method for an image database system is more complex
than a traditional database system. This is because the features to be indexed
(hereafter, we shall refer to as indexing features) are usually unstructured.
Three key issues that must be addressed in designing an index structure for
content-based image retrieval are:
• Determine a representation for the indexing feature.
• Determine a similarity measure between two images based on their repre-
sentations.
• Determine an appropriate index organization.

IMAGE DATABASES 81
For the first issue, a suitable representation must be determined and used
to represent the indexing feature. Some of the desirable properties of a repre-
sentation include
• Exactness. For a representation to be useful, it has to capture the essential
details of the indexing feature;
• Space efficiency. The representation should keep the storage cost low. To
this end, approximate representations rather than exact representations are
often used. For example, instead of representing the shape of an object,
its bounding box can be used. As another example, grouping colors that
are perceptually similar can reduce the number of colors that need to be
maintained by the system without sacrificing retrieval accuracy.
• Computationally inexpensive similarity matching. It should be easier and
faster to compute the similarity between the representations than between
their features. In general, computing the degree of similarity between ap-
proximate representations is less computationally intensive. For example,
computing the intersection of two polygons is more costly than computing
the intersection of two rectangles that represent them.
• Preservation of the similarity between the features. Two features that are
similar should remain so under their representations.
• Automatic extraction. The representation should be automatically extracted,
rather than manually generated.
• Insensitivity to noise, distortion, rotation. Any noise or distortion should
not affect the representation drastically. In other words, two features of the
same image, one without noise, and the other distorted by some noise, should
be represented in a similar way (if not exactly). Similarly, the representation
of a feature, regardless of whether the image has been rotated or not, should
be the same.
It is hard to find an effective representation with all the desirable properties. In
fact, some of the above properties conflict. For example, representing the color
of an image as a vector (color histogram) which has all the above properties
has been shown to be less effective than one that also captures the spatial
information. However, the latter representation of color incurs more storage,
and is more sensitive to the orientation of the image.
Before moving on, we would like to look at two methods that can be used to
represent image features coarsely. These methods have the advantages of space
efficiency as well as reducing the dimensionality of the indexes (for vector-based
representations). They can be categorized as follows:

• Partitioning. This method partitions an image space into a fixed size grid.
Each such cell is assigned a label and can be used to approximate the size
of an object or the spatial location of a feature. For example, the set of cells
that contains an object serves as an indication of the size of the object. As
another example, the location of an object can be determined by the position
of the cell it is in.
• Grouping. This method combines several components of a feature into
groups, and represents the image feature in terms of the groups instead
of the large number of components. For example, the basic color feature can
have over 100 different colors, but can be grouped into a small number of
groups based on the fact that many colors are perceived to be similar by
humans. As another example, the shape of an object can be described by a
small number of primitives such as lines and arcs.
A coarse representation can be used as a quick means of pruning away irrelevant
images, and a finer representation is usually necessary in order to restrict the
set of potential candidate images to a manageable size.
The second issue follows from the first. The similarity measure between
the indexing features of two images, say 51, may no longer be appropriate on
the representations. Thus, an appropriate similarity measure on the represen-
tations, say 52, has to be derived. The main criterion for such a similarity
measure is that two features that are similar under 51 should remain so under
52. In fact, since the representations may be approximate, we expect the num-
ber of images that are similar to a query image under 52 to be larger than that
under 51. There are several alternatives to determine the similarity between
two features through their representations:
• Exact match. In this approach, the representation of an image feature is
usually coarse, in the sense that images with similar features will be mapped
to the same representation. As a result, an exact match on the representation
can be used to search for similar features.
• Approximate match. Under this approach, the degree of similarity between
the image representations is computed based on some approximation tech-
niques. One advantage of this category is that the image representation can
be exact. Where approximate representations are used, we can expect more
irrelevant images to be retrieved as well.
Finally, an appropriate index organization should be determined to organize
the representations in a manner that the similarity measure can be supported
efficiently. Other important criteria for selection of an index structure include
storage efficiency and maintenance (update) overhead. To a certain extent, the

IMAGE DATABASES 83
representation and similarity measure determine the index structure. For exam-
ple, if the image feature is represented as a vector, and the similarity measure
is the Euclidean distance, then a natural choice is the multi-dimension point
access method. Here, the vector is mapped to a point in a multi-dimensional
space, and a region search can be used to search for similar images in the multi-
dimensional space. On the other hand, if the image features are represented as
rectangles in the image space, then a spatial access method may be employed.
In fact, as we shall see in Section 3.3, most of the image indexes are based on
existing techniques. As such, we shall review some of these techniques before
proceeding to look at the taxonomy.
3.2.2 Basic indexing scllemes
Spatial access methods. Spatial access methods are file structures used to
organize large collection of multi-dimensional points or geometric objects to
facilitate efficient range or nearest neighbor searches. It turns out that we can
easily exploit such techniques to speed up retrieval of images. The basic idea is
to extract k image features from each image, thus mapping images into points
in a k-dimensional feature space. Once this is done, any spatial access methods
can be used as the index, and similarity queries will then correspond to nearest
neighbor or range searches. As an example, let us consider the color feature.
In general, the color feature can be represented as a k-tuple for a system that
supports k colors, and the values of the tuple of an image are the percentages
of the colors in the image.
Many spatial access methods have been proposed in the literature. These
include methods that transform geometric objects into points in a higher di-
mensionality space such as the grid file [Hinrichs and Nievergelt, 1983]; meth-
ods that linearize spatial data such as quad-trees [Gargantini, 1982] and "z-
ordering" [Orenstein, 1986]; and methods that are based on trees such as the
family of R-trees [Guttman, 1984]. However, most of these methods suffer from
the so-called "high-dimensionality curse", that is these techniques perform no
better than sequential scanning as the number of dimensions becomes suffi-
ciently large [Faloutsos et aI., 1994]. For example, for R-trees, performance
begins to degrade drastically as the dimensionality hits 20 and above. We refer
the readers to Chapter 2 for a survey on spatial access methods.
Inverted file. In an inverted file index, an inverted list is created for each
distinct key (indexed feature). The inverted list essentially consists of a list
of pointers to the objects that contain features that are similar to the indexed
feature. Given an image feature, the inverted file is scanned, and all images
with the features that are similar to it can thus be retrieved speedily. However,

inverted file method incurs high storage overhead and is also expensive to up-
date. Some recent work has been done to address the storage problem [Witten
et al., 1994, Moffat and Zobel, 1996].
Signature file. The signature file access method is an efficient access method
for objects that can be characterized by a set of descriptors, making it suitable
for indexing unstructured data such as textual documents (characterized by
a set of keywords) and images (characterized by a set of semantic objects or
colors). Each descriptor of an image can be represented as a string of bits, and
an image signature can be obtained by superimposing (inclusive-OR) all the
descriptors of the image. The signatures of all images can then be maintained
in a file called the signature file. During query retrieval, the descriptors of the
query image can be coded into a signature, and the signature file is then used
as a filtering mechanism to eliminate most of the unqualifying data so that
only a portion of the data file needs to be accessed. The retrieval performance,
however, can be hampered by a high false drop probability (due to irrelevant
images' signatures matching the query image). Variations of signature file
access methods have been proposed to improve on the retrieval efficiency of the
signature file. These include single-level signature file [Roberts, 1979], multi-
level signature file [Sacks-Davis et al., 1987], and partitioning approach [Lee
and Leng, 1989].
3.3 A taxonomy on image indexes
Existing image indexing mechanisms can be classified based on the image fea-
tures used for indexing. For each image feature, further classifications can be
made with respect to the semantic representations used for the feature. A dif..
ferent type of semantic representation entails a different indexing method. In
this section, we provide a taxonomy of image indexing schemes based on such
classifications. These schemes have been reported in the literature. For some
features, other schemes which may also be used but not reported are excluded
from our discussion. The taxonomy is summarized in Figure 3.2.
3.3.1 Shape feature
The shape feature is extremely useful for image database systems like an X-
ray system or a criminal picture identification system. In an X-ray system,
queries like "Retrieve all kidney X-rays with a kidney stone of this shape" are
very common. For a criminal picture system, we expect queries like "Retrieve
all criminals with a round face shape". The example shape, the shape of a
kidney stone in the first case, and round in the second, can be supplied using
an example image.

IMAGE DATABASES 85
Content-based indexes
Multi-dimensional
index
Rectangles
I
Sequenced
Multi-attribute
treee
Signature
I
Signature
file
Color-Spatial
Multi-level
histogram
~.
2-level . -tier
color
B+-tree index
-------------Color
histogram
I
Geometric
properties
IInverted
file
Multi-dimensional
index
Color
Objects
. ~aritYagainst
SIgnature representative
I obts
Signature Inverted
file file
Texture
ITilmura features
I
Multi-dimensional
index
Multi-dimensional
index
Shape
Rectangular
cover
I
Spatial
relationship
~Signature 2-D string
MUlti~level Isignature file Sequential file
Multi-dimensional
index
Figure 3.2. A taxonomy of image indexing schemes.

Shape features can be represented using its boundary information by any of
16 primitive shape features. Each primitive feature is either a line, an arc with
a starting point, an ending point, and so on. Moreover, the primitive feature
can be denoted by a distinct character. Thus, the boundary information can
be compactly stored as a one-dimensional string [Jea and Lee, 1990]. The
shape features of a shape boundary can then be represented by substrings of
the one-dimensional string. This simple representation allows the exploitation
of existing efficient string matching algorithms. Since objects with the same
shape will be encoded in the same manner, exact string matching is performed
instead. To index the string representation, an inverted file is used.
A closely related work by Mehrota and Gary [Mehrotra and Gary, 1993]
used a set of structural components to represent shape boundary. These com-
ponents are modeled as an ordered set of interest points such as locally maximal
curvature points or vertices of the polygonal approximation. A shape feature
can be obtained by fixing the number of points to be used to represent the
shape feature. The feature is then mapped into a point in a multi-dimensional
space, where the dimension is given by the number of points used to repre-
sent the shape. The similarity measure can then be given by the Euclidean
distance between pair of points in the multi-dimensional space. The point
multi-dimensional access method is used for indexing the shape feature.
In [Jagadish, 1991]' a collection of rectangles that forms a rectangular cover
of the shape is used. Since shapes vary widely from objects to objects, the
number of rectangles can be very large. To reduce the storage requirement, at
most k rectangles in the cover is used to represent the shape. The k rectangles
picked must capture the most important features of the shape "sequentially",
that is the k rectangles form a sequence. As each rectangle is represented by
two pairs of coordinates, and there are at most k rectangles, the shape feature
can be easily mapped into a point in a 4k-dimensional space. Thus, a multi-
dimensional point access method can be readily used for indexing the shape
feature. Similarity retrieval based on Euclidean distance is performed using a
region search query.
Shape can also be represented based on the concept of mathematical mor-
phology [Korn et al., 1996, Maragos and Schafer, 1986, Zhou and Venetsanopou-
los, 1988], which employs a primitive shape to interact with an image to extract
useful information about its geometrical and topological structure. A (2M+1)
vector, called the size distribution of a shape [Serra, 1988], can be used to store
the measurements of the area of an image at different (2M+1 of them) scales.
The pattern spectrum [Maragos, 1989] turns out to be a compact representation
that captures the same information. The advantage of the scheme is that it is
essentially invariant to rotation and translation, and can highlight differences at
several scales. In [Korn et al., 1996], the pattern spectrum is first employed to

IMAGE DATABASES 87
capture the shape information of an image (in the domain of a tumor database).
The information is then mapped into the (2M+l) vector of the size distribution
so that a multi-dimensional point index can be employed to index the shape
information. While similarity retrieval is essentially a nearest neighbor search,
the paper also presented a distance function, max-granulometric distance, that
guarantees no false dismissals.
Numerical vectors have also been employed to model shape. These include
using the coefficients of the 2-D Discrete Fourier Transform or Discrete Wavelet
Transform [Mallat, 1989], as well as first few moments of inertia [Faloutsos et 301.,
1994, Flickner et 301., 1995]. These techniques usually maps the shape feature
to multi-dimensional point access method and use the Euclidean distance for
similarity retrieval. Alternatively, the shape features can be represented by the
geometric properties of the image such as shape factors (for example, ratio of
height to width), mesh features, the moment features and curved line features.
In this case, the inverted file has been used for indexing.
For a system that is based on the shape feature, unless the images have very
distinct shape, the performance may suffer. As such, shape is usually employed
in specialized domains.
3.3.2 Semantic objects
If objects within an image are prominent and can be easily recognized, retrieval
can be achieved based on the objects. Queries can be evaluated by matching
the list of objects of a query image against the list of objects of images in the
database. Two methods have been adopted in the literature:
• An object in an image may be analyzed to determine its degree of similarity
against a set of distinct objects. This degree of similarity is represented as
a belief interval (bi) [Rabitti and Stanchev, 1989] that indicates how closely
an image object is compared to the represented object used in the system.
An inverted file is used to maintain for each distinct object a list of (bi,
ptr) pairs where ptr is a pointer to an image that contains an object that
resembles the indexed object with a belief interval of bi. In this way, given a
query image object, one first determines the corresponding distinct object it
belongs to, from which one can obtain all objects that are similar to it. By
sorting the list in non-ascending order, the system can have a control over
the degree of similarity desired.
• An object may also be represented by an object signature. An image signa-
ture is obtained by superimposing all the object signatures of the objects in
the image [Rabitti and Savino, 1991]. The signature file access method can
then be used to speed up the retrieval process. A query image's set of sig-

natures can be obtained, and its image signature is first used to prune away
images that are irrelevant. Candidate images are then further examined by
comparing their object signatures against those of the query image.
The object-based approach is, however, limited by current image analysis
techniques. Unless objects are very well defined, it still requires substantial
human intervention in order to ensure that the objects are correctly extracted.
3.3.3 Spatial relationship
In an object-based system, a query image with a ball above a box may also
result in images with a ball next to a box or a box above a ball being retrieved.
A more discriminating way to retrieve images is to facilitate a more precise
querying that specifies both the semantic objects in the images as well as the
spatial relationships between the objects. An an example, consider the query
"Retrieve all paintings with a house and a tree on its left". Here, the house
and tree are the objects while "to the left" is a spatial relationship between the
two. In [Chang et al., 1987, Chang et al., 1988], a semantic representation for
spatial relationship using a two-dimensional string (2-D string) was proposed.
An image is first preprocessed to obtain the symbols that represents the objects
it contains. The 2-D string representation is then a projection of the symbols
along the x-axis and the y-axis, and consists of a pair of one-dimensional strings
(I-D string) each representing the ordering and spatial relationships of the
objects along the projected axis. For example, consider an image with three
objects such that 0 1 is to the left of O2 which is to the left of 0 3 . The projection
on the x-axis results in the I-D string 0 1 < O2 < 0 3 , where "<" is a spatial
operator that denotes "to the west or to the south of". In [Chang et al., 1987],
only three spatial operators are used: "=" to mean "at the same spatial location
as", ,,:,. to represent "in the same grid cell as", and "<" as explained. During
query processing, the 2-D string representation of the query image is obtained,
and compared against those in the database. Similarity retrieval is supported
using an exact representation and an approximate matching algorithm.
Variations and extensions to the 2-D strings have been explored [Chang
et al., 1989, Lee and Hsu, 1990, Costagliola et al., 1992, Lee et al., 1992].
In particular, a multi-level signature file access method has been adopted as
follows. An image can be partitioned into a M x N grid. For each object, a
M x N bit object signature can be obtained by setting bit (i - 1) . M +j to 1 if
the object occur in cell (i,j); otherwise the bit is cleared. An image signature
can then be obtained by superimposing the object signatures. Querying is
performed by determining the object and image signatures of the query image,
and using them to filter the images to be retrieved.

IMAGE DATABASES 89
The effectiveness of exploiting spatial relationships, as already mentioned,
can be drastically affected by the orientation of the images since the relation-
ships between objects may no longer be preserved.
3.3.4 Texture
Texture is an important property that can be used as cues for image retrieval.
In particular, because it can be extracted from both gray-level images as well
as color images, it can be used in many applications. However, the extraction
of texture information is a computationally intensive operation.
One of the most popular texture representations is the Tamura features
[Tamura et al., 1978]. While texture can be captured by six basic compu-
tational forms, coarseness, contrast, directionality, linelikeness, regularity and
roughness, it has been shown that the first three can sufficiently be used to
discriminate between texture differences in images. As such, these three forms
(coarseness, contrast and directionality) have been widely used in texture recog-
nition. These three components are briefly summarized here:
• Coarseness. The coarseness component measures the scale of the texture
(for example, pebbles versus boulders). When two patterns differ only in
scale, then the magnified one is considered to be coarser. For patterns with
different structures, those that have larger element size or fewer element
repetitions are perceived to be coarser by the human eyes. Coarseness can
be computed using moving windows of different sizes. The essence of the
method adopted in [Tamura et al., 1978] is to pick the coarsest texture as
the best size. For every region in an image, its coarseness is represented by
the largest best size texture, Sbest. The coarseness of the image can then be
obtained by taking the average of Sbest over the image.
• Contrast. The contrast component can be thought of as representing the
quality of the image. A good quality image is one that is sharp in contrast,
while a low quality image is blurred. The human eyes can easily discriminate
between a sharp image and a blurred one. As an image contrast can be varied
by stretching or shrinking its gray scale, the intensity of each pixel of an
image can be multiplied by a positive constant to derive at different contrast
value. The contrast can then be obtained as a function of the variance of
the gray-level histogram [Tamura et aI., 1978].
• Directionality. Directionality describes whether an image has a favored di-
rection (like grass) or whether it is isotropic (like a smooth object such as
glass). The human eyes can easily differentiate a directional pattern from
one that is non-directional. In [Tamura et aI., 1978], the degree of direc-
tionality is calculated using a histogram of local edge probabilities against

their directional angle. Although this measure does not categorize images as
directional or non-directional, this histogram representation can sufficiently
capture the global features of the images such as long lines and simple curves.
Clearly, texture can be modeled as a 3-tuple (coarseness, contrast, direction-
ality). Moreover, since images that are alike if their coarseness, contrast and
directionality are similar, the Euclidean distance can be used as a measure of
the degree of similarity between images. To speed up the retrieval process, the
texture feature can be represented as a point in a 3-dimensional space, with
region search being used to prune the search space.
There are other representation of texture, such as the Simultaneous Au-
toregressive (SAR) model, and the Wold features [Francos et al., 1993]. Both
methods also represent texture as vector of numbers, and compare images based
on the Euclidean distance. As such, a multi-dimensional indexing mechanism
can be used to index the texture features also.
3.3.5 Color
A natural way to retrieve colorful images would be to retrieve them by color.
The color composition of an image is a global property which does not require
knowledge of the component objects of an image. Moreover, color distribution
is independent of view and resolution, and color recognition can be carried out
automatically without human intervention.
A semantic representation for color is the use of color histogram that cap-
tures the color composition of images [Swain, 1993]. Using the RGB color
space, the histogram comprises a set of "bins" each representing a color that
is obtained by a range of red, blue and green values. The number of pixels of
an image falling into each of these bins can be obtained by counting the pixels
with the corresponding color. The histogram is then normalized by dividing its
entries by the total number of pixels of the image. The normalized histogram is
size-independent and it enables images of different sizes to be compared mean-
ingfully. The degree of similarity between two images is determined by the
extent of the intersection between the histograms. Query by visual example
is possible by matching the histograms. Object recognition is also achieved
by using the color composition of the object. However, to support indexing
using color histograms, a multi-dimensional indexing method is necessary and
the number of dimensions required is of very high order (which is the num-
ber of distinct colors to be supported). The color histogram of an image is
mapped into a point in the multi-dimensional space, and a region query can be
performed to find matching images.
However, it has become clear that color alone is not sufficient to characterize
an image. For example, consider two images - one with the top half blue and

IMAGE DATABASES 91
bottom half red, while the other's top left and bottom right quadrants are red
and its bottom left and top right quadrants are blue. Although these two images
are similar in color composition, they are entirely different to a human observer.
This is because the ways the colors are clustered and the positions ofthe clusters
are very different from one another in the two images. As such, several recent
studies have proposed to integrate color and its spatial distribution to facilitate
image retrieval [Chua et al., 1997, Gong et al., 1995, Hsu et al., 1995, Lu et al.,
1994, Ooi et al., 1997]. Most of the indexing mechanisms proposed for color-
spatial information are generally multi-layered - two-level B+-tree [Gong et aI.,
1995], three-tier color index [Lu et aI., 1994] and Sequenced Multi-Attribute
Tree (SMAT) [Ooi et al., 1997]. An exception to this trend is based on signature
files approach [Chua et al., 1997].
3.4 Color-spatial hierarchical indexes
In this section, we describe three indexes that have been proposed to integrate
color and spatial information for image retrieval. All these schemes are hierar-
chical indexes in that multiple indexing mechanisms are integrated to form a
single index structure. The search process begins from the top level index, and
moves down to the lowest level index, traversing along the path that satisfies
the search criterion.
3.4.1 Two-level B+ -tree structure
In [Gong et al., 1995], the color-spatial information of an image is modeled by
splitting the image into 9 equal sub-areas (3 x 3), and the color information
within each sub-area is represented by a color histogram. In this way, by
matching the corresponding color histograms of two images, one can obtain
a more accurate similarity (in terms of color-spatial information) between the
two images than the traditional histogram-based approach. Although color
histogram is a multi-dimensional representation, Gong et ai. cleverly mapped
it into a numerical key. This not only turns the computationally intensive
matching process into simple numerical-key comparisons, it also fa6litates the
exploitation of existing single-dimensional indexing structures such as the B+-
tree structure. As a result, a two-level B+-tree structure was proposed to speed
up the retrieval process. We shall first look at the retrieval technique, followed
by the transformation of color-histogram to a numerical key before proceeding
to examine the index structure.
The retrieval technique. Given an image, it is first processed to extract its
9 color histograms. Each histogram is then mapped into two levels of informa-
tion. The first level describes the composition of colors corresponding to the

histogram of the region. However, instead of using the full set of colors (which
is very large), the colors are grouped into 11 "bins" only. The grouping of
colors is based on the observation that some colors are perceived to be similar
by humans. This is accomplished in two steps:
• The RGB color space is transformed into Munsell's HVC color space [Miya-
hara and Yoshida, 1989]. This is necessary because it is not possible to
determine the similarity between two colors based on the RGB color space.
Instead, the HVC color space describes colors in terms of hue (the color type),
value (brightness) and chroma (saturation), and the perceptual differences
can be determined by the geometric distances.
• The HVC color space is grouped coarsely into 11 bins, each of which can be
distinguished from the others as a distinct color by subjective perception.
The grouping is based on the argument that two images with the same visual
content but taken with minor differences in illuminating conditions should
not be considered as different images.
Furthermore, instead of the traditional approach ofusing the normalized pixel
count to represent the proportion of the groups, each group is assigned a range
which bounds the percentage of pixels in the image with colors of the group.
A total of 9 disjoint ranges are predetermined and used: [0,5), [5,15), [15,25),
..., [65,75), [75,100]. Because of the groupings, two histograms are considered
to be similar if all the corresponding ranges of the 11 bins are the same. This
simplifies the histogram matching process, but the coarse grouping increases the
probability of retrieving irrelevant images, and missing relevant images whose
color composition fall into neighboring ranges.
The second level of information contains the average H, average V, and
average C values of all the 11 histogram bins. As in the color composition, the
H, V and C values are grouped into 9,4 and 4 groups respectively, with interval
of 40°, 2.5 and 7.5. This level is used as a secondary similarity measure to
complement the histogram metrics in order to reduce the number of irrelevant
images retrieved.
During query retrieval, the query image is processed to extract its 9 his-
togram. For each histogram, the two levels of information are obtained from
the sample query. The level 1 information is used to prune away dissimilar
images, and candidate images are further examined and compared on their H,
V and C group values.
The index: Two-level B+-tree structure. The above retrieval mechanism
has the nice property that only exact matches need to be performed: two
histograms are similar if they have the same range values for the 11 histogram

IMAGE DATABASES 93
pins, and for each pair of bins, the groups for the H, V and C values are
the same. As such, the authors proposed that the first level information be
mapped into a composite key with 12 attributes: the first attribute indicates
the histogram region (1 of the 9 region), and each of the other 11 attributes
corresponds to one histogram bin and has a value that indicates its range (note
that instead of keeping the range, since the set of ranges is predetermined, fixed
and disjoint, a range is represented by a number). Similarly, the second level
information is mapped into a 34-attribute composite key: the first attribute
represents the histogram region, and the other 33 attributes are split into 11
groups of 3 attributes, each group for a histogram bin, with one attribute for
the group number of the H value, one for the group number of the V value, and
one for the C value.
Level I:
B+-lree on Normalized Pixel Count
Level 2:
B+-tree on Average H,Y and C values
~.
!IJIIEDJ
Figure 3.3. The two-level B+-tree structure.

A two-level B+-tree can then be exploited to speed up the retrieval process,
Figure 3.3 shows the structure. The top level index is a B+-tree built on the
12-attribute key, and is used to facilitate the histogram matching process. The
entries in the leaf node of this level is associated with an independent B+-
tree that is built on the 34-attribute key. This second level tree is devised to
facilitate the comparison of the average H, V and C values. Internal nodes
stores the maximum values of the child nodes in order to direct the search.
Since images with the same histogram configuration will have the same first
part of the key, they can be found in the same leaf node of the top level tree,
and hence in the same second level tree associated to that leaf node. Thus the
images in the second level tree will be fetched only if matching at both levels
are successful.
3.4.2 Three-tier color index
To handle speedy image retrieval based 011 the positional information of color,
Lu, Ooi and Tan proposed a three-tier color index [Lu et al., 1994]. While layers
1 and 2 prune away irrelevant images based on colors, layer 3 matches images
based on their color positions as well. We shall first look at the individual layer
1 and layer 3 and their motivations before presenting the index structure as a
whole. The second layer is the R-tree structure.
Layer 1: Dominant color classification. The first layer is the dominant
color classification. For each image, a fixed number of dominant colors is ex-
tracted. The dominant colors are those with the largest numbers of pixel count.
Based on the dominant colors, the image can be assigned to a partition. In this
way, images with the same dominant colors can be found in the same partition.
The underlying assumption is that images with the same dominant colors tend
to be more similar than images that match on the less dominant colors. Thus,
during the image retrieval process, only a few partitions with the similar sets of
dominant colors need to be examined, while the other partitions with different
dominant colors can be ignored.
Let k denote the number of dominant colors. Then the number of classes is
given by:
n!
number of classes =nCk =..,.----,-.,...--,(n - k)!k!
where n is the number of colors supported in the system. Figure 3.4 illustrates
this layer when k = 3.
Layer 3: Multi-level color histogram. The third layer is a complete quad-
tree structure, called the multi-level color histogram, used to capture spatial

IMAGE DATABASES 95
distribution of colors. The basic idea is to capture the set of histograms for
an image by recursively decomposing the image. For an image, its multi-level
color histogram comprises several levels. The top level (root) of the tree corre-
sponds to a histogram that gives the color composition of the entire image. The
second level consists of four histograms that represent the color composition of
the top left, top right, bottom left and bottom right quadrants of the image
respectively. At the next level, we have the set of histograms that are obtained
from further splitting each quadrant of the image into four equal parts, where
each histogram is a description of the color content of each smaller part. This
process is repeated for the number of levels desired. In general, at the ith level,
the image is subdivided into 4i
-
1
regular regions, and each region has its own
histogram to describe its color composition. For example, in Figure 3.4, the
third layer is a 3-level color histogram.
With multi-level color histograms, since every level captures the color com-
position of the entire image, any level can be used to compute the similarity
between two images. For a level, the degree of similarity is given by the sum of
the intersections of the corresponding pairs of histograms at the level. In other
words, at the ith level, the similarity value is computed as follows:
4i - 1 m
Si = 4i~1 'L'Lmin(NH7(Q),NH7(D))
j=1 k=1
where m is the number of colors supported by the system, Q and D are the
query and database images, and NH7 (fMC) is the normalized pixel count of
the kth color in the jth histogram of the image fMC.
As the lower level of the tree reflects more closely the color composition and
distribution of the image, it is clear that the similarity value decreases as the
tree is traversed downwards. This observation leads to a filtering mechanism
during image retrieval. During query processing, the query image and the
database images are compared based on their color histograms. The top-level
histograms are first compared. If they match within some threshold value, the
next level will be searched and compared, and so on. Only when the threshold
value at the leaf level is met then will the image be retrieved. The target image
will be "discarded" if the similarity value fails to meet the threshold at any
level of the tree. As it costs less to compute the similarity value at the higher
levels of the tree, a significant amount of processing time may be saved and
unnecessary accesses to irrelevant images can be minimized.
The index: Three-tier color index. Figure 3.4 shows the three-tier color
index which employs three level of pruning to speed up retrieval. The first
layer is the dominant color classification. It allows us to prune away images

belonging to classes that would never satisfy the query to narrow the search
space to some classes.
Layer 2 is a multi-dimensionaIR-tree structure to further prune away im-
ages within the candidate partitions that are not relevant. This is achieved as
follows. For each partition, an R-tree is used to organize the images within the
class based on the proportion of the dominant colors in the images. Since the
dominant colors is sufficient to discriminate between images, the dimensional-
ity required is relatively small. Thus, images that are similar will be spatially
close to one another, and a region query will be able to restrict the search to
the relevant images within the partition.
Finally, the last layer, which is the multi-level color histogram, compares the
histograms of the query image with those of the remaining potential candidate
images. Images that fail the test need not be retrieved. Thus, we can see that
the three-tier color index can minimize accesses to the image collections to only
images that are most likely to satisfy the query.
3.4.3 SMAT: A height-balanced color-spatial index
In the two color-spatial approaches presented above, the spatial distribution of
colors are coarsely captured by the various histograms. There is no indication
on how the color is distributed in the image space within each space represented
by a histogram.
Another problem with the two approaches is that though the individual tree
structures (B+-tree, R-tree, Dominant Color Classification) employed in the
respective layers are height-balanced, the entire hierarchical index structure
may not be so. For example, in the two-level B+-tree structure, if the database
images are skewed such that many images have the similar color composition,
then a small number of the B+-trees at the second layer will be much larger
(and taller) than the rest. Retrieving these images will result in longer access
times. The same scenario holds for the three-tier color index. To resolve this
problem calls for a new notion of height-balancing, and new height-balanced
index structures to be developed.
In this section, we look at a height-balanced color-spatial index developed by
Ooi, et al. [Ooi et al., 1997]. We shall describe the representation of the color-
spatial information, the algorithm to extract them and the retrieval technique
before looking at the proposed hierarchical index structure.
Representing the color-spatial information. It has been observed that
humans are prone to focus on large patches of colors, rather than on small
patches that are scattered around [Beck, 1967, Treisman and Paterson, 1980].
The resultant effect is that given two images, they will appear to be similar

IMAGE DATABASES 97
Tier I: Dominant Color Classification
k =I
k=2 (0,2) -(O,n) (1,2) (2,n) .
/
Tier 3: Multi-Level Color Histogram
D
EE
1m
Figure 3.4. The three-tier color index.

if both of them have large patches (refer to as clusters) of similar colors at
roughly the same locations in the images. For example, Figure 3.5 shows three
images and the corresponding eight largest clusters, sorted in descending order.
These clusters have been extracted using the proposed color-spatial technique
to be discussed shortly. From the cluster representation of image A (Figure
3.5(b)), it can be seen that several clusters contain color 4 (pink). The cluster
representation in image B (Figure 3.5(d)) also shows that there are dominant
clusters containing color 4 (pink) that fall in the same region and intersect
those clusters in image A. Hence, the two images are "similar" in terms of
color and spatial information. Similarly, based on the cluster representation in
Figure 3.5(f), it is clear that image C is different from the other two images
since there are no common color and location between them. Based on the
observation, the work [Ooi et aI., 1997] represented the color-spatial information
of an image as a set of single-colored clusters in the image space, and these
clusters are used to facilitate image retrieval.
Extracting the color-spatial information. To extract the color and spa-
tial information, a heuristic similar to the one adopted in [Hsu et aI., 1995] was
employed. The heuristic, which comprises three phases, represents the color-
spatial information as a set of k single-colored regions, for some predetermined
value k which is expected to be small.
In the first phase, a set of k representative colors of an image is selected.
The colors selected are those with the largest number of pixel counts in the
image. This set of colors is called the dominant colors. In the second phase,
a set of clusters for each of the dominant colors are determined. The algo-
rithm adopted is based on the maximum entropy discretization method [Chiu
and Kolodziejczak, 1986]. Briefly, for each selected color in the first phase,
the maximum entropy discretization algorithm is applied to the image space
to extract the spatial information of the color. Initially, the entire image is
regarded as one whole region. In the first pass, the image is partitioned into
four regions, and the process is repeated on the four regions recursively. For
each region, an evaluation criterion is used to determine whether further par-
titioning is needed. The results of the application of the algorithm is a set of
representative regions for each selected color. Each region is represented as a
rectangle within the image space.
At the end of phase two, a large set of single-colored clusters have been
derived. In phase three, these clusters are ranked (regardless of color) in de-
scending order of their sizes (area of the rectangles). The k largest clusters will
be picked as the dominant clusters to be used as the color-spatial information
of the image.

(a) Image A
(c) Image B
(e) Image C
IMAGE DATABASES 99
Dunninant Colur Xmin Ymin Xmax Ymax Area
Cluster
I 17 116 0 147 114 3,534
2 17 147 0 173 114 2,%4
3 4 20 0 30 114 1,140
4 4 30 0 40 114 1,140
5 17 61 R 116 15 3R5
6 17 61 0 116 7 3R5
7 4 0 0 19 17 323
R 4 0 IR 19 35 323
(b) A's 8 largest clusters
Dunninant Culor Xmin Vlllin Xll1ax Yma.lt Area
Clusler
1 4 147 0 173 114 2,%4
2 4 0 0 23 114 2,622
3 40 R6 20 105 104 1,596
4 37 60 25 76 114 1,424
5 4 72 21 R6 114 1,302
6 4 7R 15 147 31 1.104
7 4 24 62 3R 114 72R
R 4 24 0 7R 13 702
(d) B's 8 largest clusters
Durminanl Culur Xmin VllIin Xmax YlHax Area
Cluster
I 3 150 0 lfifi 114 I,R24
2 3 0 0 12 114 1,36R
3 3 166 0 173 114 79R
4 39 35 3 54 26 437
5 39 RO 3 105 19 4lKl
6 42 34 47 56 65 396
7 39 lOR 45 157 53 392
R 39 30 27 54 43 3R4
(f) C's 8 largest clusters
Figure 3.5. Three images and their 8 largest ·c1usters.

The similarity function used for image retrieval computes the degree of over-
lap between the rectangles of the source and target images. Two rectangles
overlap only if they have the same color, and they intersect in the image space;
the degree of overlap is given by the number of pixels intersected.
The retrieval process using the color-spatial information is as follows. The
image database is initially preprocessed to determine the clusters (color-spatial
information) of the images. Given a sample query image, its k clusters are first
extracted. The color-spatial information of each image in the database is then
compared with those of the query image using the similarity function described
above. The images can then be ranked based on the percentage of overlap,
retrieved and displayed in that order.
The index: Sequenced multi-attribute tree. Even though the approach
restricts the number of clusters per image to k, the number of cluster com-
parisons to be performed is still very large, about O(N . k2 ) where N is the
number of images in the database. Since only a small number of images is
likely to match the sample image, a large number of unnecessary comparisons
are being performed. To minimize the expensive comparisons, an index struc-
ture, the Sequenced Multi-Attribute Tree (SMAT), is proposed. SMAT is based
on three observations on the similarity function of the color-spatial approach:
• Color must be matched before the spatial property as color is deemed a more
important feature.
• If two clusters of two images share the same spatial property but different
color content, then the two clusters will not contribute to the similarity
function.
• If two clusters of two images share the same color but with non-overlapping
spatial properties, then the two clusters will also not contribute to the sim-
ilarity function.
SMAT is a multi-tier tree structure, where each layer corresponds to an
indexing attribute. For example, the top layer can be based on color, the
second is based on color percentage or size of the cluster, and the last is based on
spatial property. Each layer can be constructed using any indexing mechanism.
For example, the top layer can be implemented using a single dimensional
indexing structure such as the B+-tree [Comer, 1979]. On the other hand, the
lowest layer can employ a multi-dimensional indexing structure like the R-tree
[Guttman, 1984]. Except for the lowest level, entries in the leaf nodes of all
levels point to the roots of the trees in the next level. Only the leaf nodes of the
lowest level tree contain pointers to the image data. Thus, SMAT essentially
consists of multiple trees integrated together in a hierarchical manner. To

IMAGE DATABASES 101
reach the lowest layer of the SMAT where the actual images are pointed to,
the query must satisfy the conditions relating to the discriminating keys in all
the higher layers. Any condition violated in any layer will terminate the search
path prematurely.
In [Ooi et al., 1997], a variation of the R-tree structure [Guttman, 1984] was
employed to implement a 2-tier SMAT structure. Figure 3.6 shows the struc-
tural view of the SMAT structure implemented. The first layer discriminates
clusters based on color. Since color is a single-dimensional attribute, the R-tree
used at this layer is a single-dimensional R-tree (I-D R-tree). Each entry has
a color range that defines the data space of the subtree pointed by its child
pointer. The color ranges of internal nodes do not overlap, unless they are
exactly the same range. This occurs only when the data is very skewed. En-
tries of the leaf nodes of the first layer R-tree are of the form (color-range, BR,
PTR), where BR defines the spatial bounding rectangle which contains all the
clusters' color rectangles within the image space, and PTR points to an R-tree
of the next layer. Spatial information is required at the leaf node for balanc-
ing purposes. Suppose, for a given color range, the next layer R-tree pointed
by PTR outgrows others and the next split involves its root node (PTR). By
splitting such a node, the height of SMAT will increase. To enable some form
of balancing, the node is split according to the splitting strategy adopted at
the second layer, but the entry is inserted into the leaf node of the first layer
instead. In other words, two entries with the same color range (at the first
layer) are created, but with different bounding rectangles.
The second layer is based on the spatial information of the clusters. Each
entry of the internal node contains a rectangle that defines its child node's data
space and a pointer pointing to the subtree. The second layer R-tree is like a
2-dimensional spatial R-tree structure. For the leaf nodes, entries are of the
form (color, coordinates, PTR). The color attribute contains the color of the
cluster, the coordinates attribute contains the four coordinates of the cluster,
and PTR is a pointer to the address in the database that contains the image
data (see Figure 3.6). The image data contains the ID of the image, and the
colors and coordinates of the k dominant clusters. This information are used
in computing the similarity function (we shall see how this is used when we
discussed the matching algorithm).
Matching and searching a SMAT. The matching algorithm retrieves im-
ages that are similar to a sample image. Given a sample image, the algorithm
extracts k dominant clusters. For each of the clusters extracted, it determines
the set of images that are similar to it. This is done by traversing SMAT to
determine the clusters that matches the clusters of the sample image. It suffices
to know that the search algorithm returns a list of pointers to a file that con-

node I
{[eI,elJ, c::J}
IMAGE-ID
node 3
cI, coord I c2, coord2 1-oE;:----..J
c3, coord3 c4, coord4
Level I: 1-0 R-lree
color discriminator
node 2
Level 2: 2-D R-tree
spatial discriminator
Figure 3.6. The SMAT structure.

IMAGE DATABASES 103
tains information on potential matching images. Recall that these information
includes the image id and the (color, cluster) pairs of the image. From these
information, the algorithm proceeds to compute the similarity value of the sam-
ple image and the candidate image, and rank the candidate image accordingly.
Since it is possible that other clusters of the sample image may also match
the same candidate image at a later iteration, the image ids are maintained
in a hash table to avoid subsequent comparisons and retrieval. Finally, all the
images can be retrieved based on the image ids.
The search algorithm of a SMAT structure is fairly straightforward, and
follows from the wayan R-tree is searched. The algorithm descends the I-D
R-tree from the root, and at each internal node, entries are checked. For each
color range that contains the search color, the subtree is searched. When a
leaf node is reached, the color of the search cluster is used to check for any
entries whose color range contains the color. For all color ranges that qualify,
their spatial bounding rectangles are checked to see if they intersect the search
cluster. For qualified entries, the search continues to the corresponding 2-D
R-trees at the next layer. While the traversal of the I-D R-tree often leads to
a distinct path (unless there are duplicates), more than one subtree under the
2-D R-tree may need to be searched. Nevertheless, the search algorithm can
eliminate irrelevant clusters of the indexed images and examine only clusters
near the search area.
Inserting color clusters into SMAT. Inserting image clusters into a SMAT
raises some interesting issues concerning the growth of the tree. The first issue
concerns the initial loading of SMAT. In this case, the tree is not "mature" in
the sense that not all layers may have been constructed. The question of when
SMAT grows from one layer to the next arises. The second issue deals with the
height-balancing of SMAT. While the R-tree is height-balanced, SMAT may
not be fully height-balanced as images may be inserted towards one end of the
SMAT.
The strategy adopted let SMAT grow downward until some criterion is met,
and grow upward when height-imbalance occurs. Initially, the height of all the
layers are predetermined. For a SMAT structure with k layers, L1 , L2 , ... , Lk'
let the predetermined height for layer L; be h;. Note that h;, for all i E [1, k],
changes dynamically as SMAT grows. During initial loading, SMAT is not
fully developed, and so h; is used to guide the growth of layer L; downward as
follows: layer L;+1 will appear only if all the nodes along the path leading to
the leaf node of layer L; in which the new record is to be inserted are full, and
the length of the path has reached h;. This is to ensure that the height of the
SMAT is maintained and not increased further unless necessary. To illustrate,
consider the I-D R-tree in Figure 3.6. Suppose, leaf node 1 is full and h1 is set

to 2, and a new cluster is to be inserted into leaf node 1. If node 3 is full, instead
::If allowing the I-D R-tree to grow, the tree grows downward by creating the
l1ext layer tree, and the record is inserted there. On the other hand, if node 3
IS not full, then creating the next layer will undoubtedly increase the height of
the search path by one. Instead, leaf node 1 should be split as normal.
Once all the layers of SMAT are developed, the issue of height-balancing
becomes a concern since it affects the retrieval time of SMAT. Although R-tree
is height-balanced, SMAT may not be so. This happens especially if there are
a lot of clusters of a particular color. Thus, there is no guarantee that all the
trees in the second layer index will grow and shrink at the same rate. This
means that it is possible that a particular tree in a level may grow much faster
than the other trees in the same level causing the SMAT to be skewed to one
,ide. This is to say that the basic SMAT structure, can only be locally balanced
but not globally height balanced.
Since SMAT is a multi-tier structure, the concept of height-balanced is
,lightly different from a single structure index. A SMAT structure is height-
balanced if the following two conditions are met:
• Each tree structure within a layer is height-balanced.
• The difference in the heights of trees within a layer, say Li' is at most ei for
some predetermined ei for each layer.
Figure 3.7 illustrates a height-balanced tree. As can be seen, in the worst case,
the difference between trees in height within a k-layer SMAT is I:7=2 e;.
To keep SMAT height-balanced, the upper layers are allowed to grow once
the lowest layer has been established. The minimum height of the trees at each
layer are maintained. If there is an increase in the height of a tree (at a layer)
as a result of an insertion, the new height of the tree is compared against the
minimum height at that layer. If the difference between the two is above a
certain predetermined threshold, then rebalancing is activated. Rebalancing is
performed as follows. Let the layer where rebalancing is needed be Li' and its
parent layer be Li-1. Let the root of the tree that causes height imbalance at
L; be R;, and the leaf node of L;-1 that points to R; be LN;. Let the entry
in LN; that points to Ri be lold. The information at Ri is used to insert a
new entry, lnew into LNi. lold is set to point to the left child of Ri, and lnew
is set to point to the right child of Ri . Ri can then be removed. Note that the
corresponding bounding information in lold needs to be updated too.
The insertion algorithm that SMAT adopts within a tree is similar to that
used in R-trees in that new clusters are added to the leaves, nodes that overflow
are split, and splits are propagated up the tree. The splitting algorithm adopted
is based on the quadratic-cost algorithm of R-tree by Guttman [Guttman, 1984].

IMAGE DATABASES 105
hI Thl D layer I
h2+<1 f2 ...layer 2
h3 . . . layer 3
h3+e3
h4 . . .
layer 4
h4+e4
Figure 3.7. A height-balanced SMAT.
The algorithm attempts to find a small-area split, but is not guaranteed to find
one with the smallest area possible. There is, however, the additional task of
handling height-balancing.
3.5 Signature;..based color-spatial retrieval
In this section, we present a signature-based color-spatial retrieval technique
[Chua et aI., 1997]. The mechanism involves several components, and we discuss
each of them in a subsection. First, the color-spatial information has to be
extracted and represented. Next, we describe the retrieval process that is based
on the color-spatial information. In particular, the retrieval process requires
a measure to compute the similarity between two images (in terms of their
color-spatial representation). We also discuss an approach which incorporates
the concept of perceptually similar colors and weighting of colors.
3.5.1 Representing the color-spatial information
The proposed color-spatial approach partitions each image into a grid of m x n
cells of equal size. Figure 3.8 shows an example of an image being partitioned
into a 4 x 8 grid. Instead of obtaining the color-spatial information at pixel-
level, the colors that can be used to represent a cell are determined. This
is done as follows. For a given color, each cell is examined to determine the
percentage of the total number of pixels in the cell having that color. If this

percentage is greater than a pre-defined threshold value, then the cell is said
to be represented by that color. This approach is equivalent to applying the
maximum entropy discretization algorithm [Chiu and Kolodziejczak, 1986] un-
:ier the assumption of uniform color distribution. Note that, depending on the
threshold value, a cell may have no color representative or it may have more
than one representative.
o cell does not satisfy the threshold III cell satisfies the threshold
Figure 3.8. An image partitioned into a 4 x 8 grid.
For the approach to be practical and useful, several issues have to be ad-
dressed. First, the number of colors can be very large, resulting in a large set of
color-spatial information. This is resolved by restricting the number of colors
for an image to a set of C colors (called the dominant colors) of-the image. C
is expected to be small as most images are usually dominated by a few colors.
To select the C dominant colors, the heuristics employed in [Hsu et al., 1995]
is adapted. The heuristics works as follows. Two color histograms, Hi and He,
representing the color composition of the entire image and the center of the
image are obtained. First, Ci (Ci < C) colors that have the largest number of
pixels in Hi are picked. Next, the Ci colors picked are eliminated from con-
sideration when the remaining Ce (= C - Cd colors are to be picked. The Ce
colors are obtained from the remaining colors with the largest number of pixels
in He. While the first set of colors represents the background colors, the second
set represents the object colors (based on the inherent assumption that objects
usually appeal' in the center of an image). Unlike the algorithm in [Hsu et al.,
1995] where the background and the object colors are selected alternatively,
the modification is to reduce the probability that the most dominant color in
the center of the image (representing the object) is in fact one of the dominant
background color. This is based on the observation that a significant portion
of the center region of an image can be covered by the background colors.
The second issue concerns the representation of the color-spatial information.
It turns out that the proposed approach has a very nice property - given a

IMAGE DATABASES 107
color, a cell is either represented or not represented by it. As such, each cell
can be represented by a bit - if the cell satisfies the threshold value, the bit
is set; otherwise, it is cleared. Hence, for each color, a bitstream (called the
color signature) that captures the spatial distribution of that color is obtained.
In the color signature, bit (i· (m - 1) +j) corresponds to cell (i, j). Referring
to Figure 3.8 again, suppose a color qualifies to be the representatives of cells
0,4,5,6,7,10,14,15,25,26,30 and 31, its corresponding 32-bit color signature will
be 10001111001000110000000001100011. Given an image with k colors, there
will be k color signatures. These color signatures can be superimposed (bitwise
logical-OR) to obtain an image signature.
3.5.2 The retrieval process
From the human perception point of view, two images are perceived to be alike
if the color composition of the two images are similar, and the distributions of
the colors in the images are similar. Under the signature-based representation
of color-information, the above two points can be translated into the following
two conditions to facilitate efficient retrieval:
• The images have the same representative sets of colors.
• The signatures representing both images are similar in that they may only
differ in some of the bits. This only requires a simple operation (logical AND)
to compute the intersection between two images for a particular color.
We discuss in the next few subsections several similarity measures that have
been used [Chua et al., 1997] to indicate the similarity between two images
based on their signatures.
Basic similarity function. For the signature-based color-spatial approach,
recall that each bit in a signature represents a particular cell in the image.
Let Qi and Di denote the signatures of color i for a query image Q and a
database image D respectively. Then, the two images have the color i at
the same particular region (cell) if and only if the corresponding bits in both
signatures are set; otherwise the two images are not similar at the region. Let
the representative color sets of Q and D be CQ and CD respectively. Then,
the similarity measure, SIMbasic, between Q and D for a color i E CQ can be
determined as:
{
BitSet(Q;ID;J
SIMbasic(Q, D, i) = BitS~t(Q;)
if color i E CD
otherwise
(3.1)
where BitSet(BS) denotes the number of bits in the bitstream BS that are set,
and '1' represents the bitwise logical-AND operation. Now, if a large part of

cells in Q has the same color as that in D, then the similarity computed will
be closed to 1. The similarity measure between two images Q and D is then
given by:
SIMbasic(Q,D) = L SIMbasic(Q,D,i)
ViEC'Q
(3.2)
Similarity function with perceptually similar colors. Because of the
effectiveness of using perceptually similar colors [Niblack et al., 1993], Chua,
et al. also incorporated the contributions of perceptually similar colors in their
similarity measure. To determine the degree of similarity between two colors,
the method proposed by Ioka [Ioka, 1989] was adopted. The method first
transforms colors in the RGB space to the CIE (Commission Internatinale
de I'Eciairege) L*u*v* space, and the similarity between two colors can be
measured from the Euclidean distance between the colors in the CIE L*u*v
space. The Euclidean distance between two colors, i and j, in the L*u*v*
space is computed as:
(3.3)
Let M denote the number of L*u*v* colors the system can support. The degree
of similarity between two colors, i and j, is given by:
SIM(i,j) ={ 1 _ ~(i,j)pxD mox
ifD(i,j) > P x Dmax
otherwise
(3.4)
where Dmax = max D(i, j), i i=- j, 1::; i, j ::; M, and p is a predetermined
threshold value between 0 and 1 (in our study, we have arbitrarily set p to 0.2).
Essentially, p x Dmax represents the tolerance in which two colors are considered
to be similar. If SIM(i,j) > 0, then color i is said to be perceptually similar to
color j, and vise versa. The larger the value of SIM (i, j), the more similar the
two colors are. If SIM (i, j) = 0, it means that the two colors are not perceived
to be similar. The similarity values computed for all pairs of colors are stored
in a M x M matrix, called the color similarity matrix (denoted SM), where
entry (i, j) corresponds to the value of SIM(i, j). S M is stored in a flat file and
will be frequently used during the retrieval process to determine the similarity
between two colors.
Under the signature approach, the contribution of the perceptually similar
colors of color i for query image Q and database image D is computed as
follows:
. """' BitSet(QiIDj ) (..)
SIMpercept(Q, D, z) = LJ BitSet(Q.) x SM Z,J
jESp
,
(3.5)

IMAGE DATABASES 109
where Sp is the set of colors that are perceptually similar to color i as de-
rived from the color similarity matrix SM. SM(i,j) denotes the (i,j) entry
of matrix SM. To take the contributions of perceptually similar colors into
consideration, Equations 3.1 and 3.5 can be combined to obtain the perceived
similarity between two signatures on color i as follows:
SIMcolor-spatiaL(Q, D, i) =SIMbasic(Q, D, i) + SIMpercept (Q, D, i) (3.6)
Thus, the similarity measure for query image Q and database image D is the
sum of the similarity for each color in the representative set CQ for image Q,
and is given as follows:
SIMcoI01·-spatial(Q,D) = L SIMcolor-spatial(Q,D,i) (3.7)
'tiECQ
Weighted similarity function. In the above similarity measure, all the
dominant colors have been implicitly assigned the same weight. However, in
some applications, it may be desirable to give the object colors a higher weight.
This is particularly useful when the object is at the center and the user is only
interested in retrieving images containing similar objects at similar locations.
The authors also proposed a weighted similarity measure which is given as
follows:
SIMweighted(Q, D) L SIMcolor-spatial(Q, D, i) +
iECi
wt x L SIMcolor-spatial(Q,D,i)
iECe
(3.8)
where Ci and Cc are the set of background and object colors of Q respectively,
and wi (> 1) is the weight given to the object colors. A weight greater than
1 can be assigned to the object colors to give a higher weight to images with
similar object colors as that of the query image.
3.6 Summary
In this chapter, we have surveyed content-based indexing mechanisms for image
database systems. We have looked at various methods of representing and
organizing image features such as color, shape and texture in order to facilitate
speedy retrieval of images, and how similarity retrievals can be supported. In
particular, we have a more in-depth discussion on color-spatial techniques that
exploit color as well as their spatial distribution for image retrieval.
As images will continue to play an important role in many applications,
we believe the need for efficient and effective retrieval techniques and access

methods will increase. While we have seen much work done in recent years,
there remain a lot to be mined in this field. In what follows, we outline several
promising areas (not meant to be exhaustive) that require further research.
Performance evaluation
This chapter has presented a representative set of indexes for content-based im-
age retrievals. Unlike other related areas such as spatial databases, the number
of indexes proposed to facilitate speedy retrieval of images is still very small.
This is probably because content-based image retrievals have been largely stud-
ied by researchers in pattern recognition and imaging community, whose focuses
have been on extracting and understanding features of the image content, and
on studying the retrieval effectiveness of the features (rather than on efficiency
issues). It is not surprising then that the indexes discussed have not been ex-
tensively evaluated. Besides [Ooi et al., 1997], which reported a preliminary
performance comparison demonstrating that SMAT outperforms R-tree in most
cases, most of the other works have only compare with the sequential scanning
approach.
We believe that a comparative study is not only necessary but will be useful
for application designers and practitioners to pick the best method for their
applications. It will also help researchers to design better indexes that overcome
the weaknesses and preserve the strengths of existing techniques. Another
aspect of performance study, which is applicable for indexes in general, is the
issue of scalability. Again, most of the existing work has been performed on
small databases. How well will such indexes scale is certainly unclear until they
have been put to the test. The readers are referred to [Zobel et al., 1996] for
some guidelines on comparative performance study of indexing techniques.
More on access methods
The focus of this chapter has been on content-based access methods. There are
many other content-based retrieval techniques that have been proposed in the
literature [Aslandogan et al., 1995, Chua et al., 1994, Gudivada and Raghavan,
1995, Hirata et al., 1996, Iannizzotto et aI., 1996, Nabil et aI., 1996] and shown
to be effective (in terms of recall and precision). These works, however, have
not addressed the issue of speedy retrievals. Designing efficient access methods
for these promising methods will make them more practical and useful.
Another promising direction is to further explore color and its spatial dis-
tribution. One issue is to exploit the colors that are perceptually similar. For
example, out of the 16.7 million possible shades of colors displayable in a 24-bit
color monitor, the human eyes can only differentiate up to 350,000 shades. As
such, colors that are perceived to be similar should contribute to the compari-

IMAGE DATABASES III
son of color similarity. While some work has been done in this direction [Chua
et aI., 1997, Niblack et aI., 1993], perceptually similar colors are considered in
the computation of the degree of similarity, rather than being modeled in the
feature representation. We believe the latter can be more effective in pruning
the search space. Another issue is to exploit texture and color for segmentation
of an image space. Indexing of clusters based on both texture and color may
be more effective.
Concurrent access and distributed indexing
Traditionally, image retrieval systems have been used for archival systems that
are usually static in that the images are rarely updated. As such, the issue of
supporting concurrent accesses are not critical. Instead, in such applications,
the access methods should be designed to exploit this static characteristic.
However, as multimedia applications proliferates, we expect to see more
real-time applications as well as applications running in parallel or distributed
environment. In both cases, existing techniques will have to be extended to
support concurrent accesses. Some techniques have been developed for central-
ized systems [Bayer and Schkolnick, 1977, Sagiv, 1986, Ng and Kameda, 1993]
as well as parallel and distributed environment [Achyutuni et aI., 1996, Kroll
and Widmayer, 1994, Litwin et aI., 1993b, Tsay and Li, 1994]. But, we be-
lieve more research that tailors to image data, especially those that involved
hierarchical structures, are needed.
Integration and optimization
The retrieval results of an image database systems are usually not very precise.
The effectiveness of using the content of an image for retrieval depends very
much on the image representation and the similarity measure. It has been
reported that using colors and textures can achieve a retrieval effectiveness of
up to 60% in recall and precision [Chua et aI., 1996]. Furthermore, different
retrieval models based on different combination of visual attributes and text
descriptions achieve almost similar levels of retrieval effectiveness. Moreover,
each model is able to retrieve a different subset of relevant images. This is
because each image feature only captures a part of the image's semantics. The
problems then include selecting an "optimal" set of image features that fits best
for an application, as well as developing techniques that can integrate them to
achieve the optimal results. One promising method is to use content-based
techniques as the basis, but also exploits semantic meanings of the images and
queries to support concept-based queries. Such techniques have been known as
semantic-based retrieval techniques. Typically, some form of knowledge base
is required, rendering such techniques domain-specific. In [Chua et al. , 1996],

the domain knowledge is supplied by users as part of a query. The query is
modeled as a hierarchy of concepts through a concept specification language.
Concepts are defined in terms of the multiple images' content attributes such
as text, colors and textures. Each concept has three components: its name,
its relationships with other concepts, and rules for its identification within the
images' contents. In answering queries, the respective indexes are used to speed
up the retrievals for concepts that are at the leaf of the hierarchy, and their
results combined based on the hierarchy of concepts defined. More studies are
certainly needed along this direction.

4 TEMPORAL DATABASES
Apart from some primary keys and keys that rarely change, many attributes
evolve and take new values over time. For example, in an employee relation,
employees' titles may change as they take on new responsibilities, as will their
salaries as a result of promotion or increment. Traditionally, when data is
updated, its old copy is discarded and the most recent version is captured.
Conventional databases that have been designed to capture only the most recent
data are known as snapshot databases. With the increasing awareness of the
values of the history of data, maintenance of old versions of records becomes
an important feature of database systems.
In an enterprise, the history of data is useful not only for control purposes,
but also for mining new knowledge to expand its business or to move on to a new
frontier. Historical data is increasingly becoming an integral part of corporate
databases despite its maintenance cost. In such databases, versions of records
are kept and the database grows as the time progresses. Data is retrieved based
on the time for which it is valid or recorded. Databases that support the storage
and manipulation of time varying data are known as temporal databases.
In a temporal database, the temporal data is modeled as collections of line
segments. These line segments have a begin time, an end time, a time-invariant

attribute, and a time-varying attribute. Temporal data can either be valid time
or transaction time data. Valid time represents the time interval when the
database fact is true in the modeled world, whereas transaction time is when a
transaction is committed. A less commonly used time is the user-defined time,
and more than one user-defined time is allowed.
A database that supports transaction time may be visualized as a sequence
of relations indexed by time and is referred to as a rollback database. The
database can be rolled back to a previous state. Here the rollback database
is distinguished from the traditional snapshot database where temporal at-
tributes are not supported and no rollback facility is supported. A database
that supports valid time records a history of the enterprise being modeled as
it is currently known. Unlike rollback databases, these historical databases al-
low retroactive changes to be made to the database as errors are identified. A
database that supports both time dimensions is known as bitemporal database.
Whereas a rollback database views records as being valid at some time as of
that time, and a historical database always views records as being valid at some
moment as of now, a bitemporal database makes it possible to view records as
being valid at some moment relative to some other moment.
One of the challenges for temporal databases is to support efficient query
retrieval based on time and key. To support temporal queries efficiently, a
temporal index that indexes and manipulates data based on temporal relation-
ships is required. Like most indexing structures, the desirable properties of a
temporal index include efficient usage of disk space and speedy evaluation of
queries. Valid time intervals of a time-invariant object can overlap, but each
interval is usually closed. On the other hand, transaction time intervals of a
time-invariant object do not overlap, and its last interval is usually not closed.
Both properties present unique problems to the design of time indexes. In this
chapter, we briefly discuss the characteristics of temporal applications, tempo-
ral queries, and various promising structures for indexing temporal relations.
We also report on an evaluation of some of the indexing mechanisms to provide
insights on their relative performance.
4.1 Temporal databases
In this section, we briefly describe some of the terms and data types used in
temporal databases. For a complete list of terms and their definitions, please
refer to [Jensen, 1994].
An instant is a time point on an underlying time dimension. In our discus-
sions that follow, we use 0 to mark the beginning of a time, and time point to
mean instant on the discrete time axis. A time interval [Ts, Te) is the time
between two time points, T s and T e , where T s :S Te, with the inclusion of the

TEMPORAL DATABASES 115
end time. Note that the closed range time is similar to the non-closed range
representation, since [Ts , Te] =[Ts , Te + 1). A chronon is a non-decomposable
time interval of some fixed minimal duration. In some applications, chronons
have been used to represent an interval. A span or time span is a directed du-
ration of time. It is the length of the time with no specific starting and ending
time points. A lifespan of a record is the time when it is defined. A lifespan
of a version (tuple) of a record is the time in which it is defined with certain
time-varying key values. For indexing structures that support time intervals,
start time and version lifespan are two parameters that may affect their query
and storage efficiency.
4.1.1 Transaction time relations
Transaction time refers to the time when a new value is posted to the database
by a transaction [Jensen, 1994]. For example, suppose a transaction time rela-
tion is created at time Ti , so that Ti is the transaction time value for all the
tuples inserted at the creation of the relation. The lifespan of these tuples is
[Ti , NOW]. The right end of the lifespan at this time is open, which can be
assumed to have the value of NOW to indicate progressing time span. At time
Tj when a new version of an existing record is inserted, the lifespan of the new
version is [Tj , NOW], and that of the previous version is [Ti , Tj). Transaction
times which are system generated follow the serialization order of transactions,
and hence are monotonically increasing. As such, a transaction time database
can rollback to some previous state of its transaction time dimension.
There are two representations for transaction time intervals. One approach
is to model transaction time as an interval [Snodgrass, 1987] and the other is
to model transaction time using a time point [Jensen et aI., 1991, Lomet and
Salzberg, 1989, Nascimento, 1996]. The latter approach implicitly models an
interval by using the time when a new version is inserted as the start time
of its transaction time and the time point immediately before the time when
the insertion of the next version as its transaction end time. In what follows,
we shall use the single time point representation to model transaction time.
However, explicit representation of transaction time intervals is often used for
performance reason.
To illustrate the concept of temporal relations, we use a tourist relation that
keeps track of the movement of tourists to study the tourism industry. The
relation has a time invariant attribute, pid, and a time varying attribute, city.
At time 0, the relation is created and the transaction time value for the current
tuples is °(Table 4.1). The lifespan of these tuples is [0, NO W]. At time 3, the
tuple with pid=p1 is updated, the new city value is Los Angeles (Table 4.2).

Table 4.1. A tourist transaction time relation at time O.
tuple pid city Tt
tl pI New York 0
t2 p2 Washington 0
t3 p3 New York 0
Table 4.2. The tourist transaction time relation at time 3.
tuple pid city Tt
t1 pI New York 0
t2 p2 Washington 0
t3 p3 New York 0
t4 pI Los Angeles 3
t5 p6 Seattle 3
To keep the history, a new tuple t4 is inserted. Thus, the lifespan for tl is [0,
3) and the lifespan of t4 is [3, NOW].
In the transaction time relation, there are no retroactive updates (updates
that are valid in the past) and predictive updates (updates that will be valid
in the future). Each transaction is committed immediately with the current
transaction time. For instance, if at time 2, the city for p1 changes to Seattle,
this update cannot be committed at time 3. If a tuple will be updated at time
4, this update cannot be reflected in Table 4.2, because predictive update is
not supported in the transaction time relation. Note that time intervals that
are still valid at the present time point are not closed. In other words, the end
time progresses with the current time.
4.1.2 Valid time relations
The transaction time dimension only represents the history of transactions, it
does not model the real world activity. We need a time to model the history of
an enterprise such that the database can be rolled back to the right time-slice
with respect to the enterprise activity. Valid time is the time when a fact is
true. In a valid time relation, a time interval [Ts , Te] is used to indicate when
the tuple is true. Valid time intervals are usually supplied by the user, and each

Table 4.3. The tourist valid time relation at time O.
tuple pid city Ts Te
tl pI New York 0 3
t2 p2 Washington 0 NOW
t3 p3 New York 0 NOW
Table 4.4. The tourist valid time relation at time 3.
tuple pid city Ts Te
tl pI New York 0 3
t6 pI Seattle 2 3
t2 p2 Washington 0 NOW
t3 p3 New York 0 NOW
t4 pI Los Angeles 3 NOW
t5 p6 Seattle 3 6
t7 p5 Washington 4 6
new tuple is inserted into the relation with its associated valid time interval.
A time-invariant key can have different versions with overlapping valid time,
provided the temporal attributes of these versions are different. Time intervals
that progress the current time are open. Since they are usually determined by
users, new tuples often have close intervals that end before or after the current
time NOW.
Tables 4.3 and 4.4 show the valid time relation of tourist. At time 0, the
tuples are inserted with the valid time ranges. Assume in period [2, 3], the city
for pi is changed from New York to Seattle, and from time 3, it is changed
again to Los Angeles. The relation in Table 4.4 represents these updates. Note
also that the valid time relation in Table 4.4 can capture proactive insertions,
for example, tuple t7 which has the valid time interval [4, 6] appears in the
relation at time 3.
Unlike transaction time relation, a valid time relation supports retroactive
and predictive updates. If an error is discovered in an older version of a record,
it is modified with the correct value, the old value being substituted by a new
value. Hence it is not possible to rollback to the past as in the transaction time
database.

L18 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Table 4.5. The tourist bitemporal relation at time O.
tuple pid city Ts Te Tt
t1 pI New York 0 3 0
t2 p2 Washington 0 NOW 0
t3 p3 New York 0 NOW 0
Table 4.6. The tourist bitemporal relation at time 5.
tuple pid city Ts Te Tt
tl pI New York 0 3 0
t6 pI Seattle 2 3 3
t2 p2 Washington 0 NOW 0
t3 p3 New York 0 NOW 0
t4 pI Los Angeles 3 NOW 3
t5 p6 Seattle 3 6 3
t7 p5 Washington 4 6 3
t8 p5 Washington 5 8 5
4.1.3 Bitemporal relations
In some applications, both the transaction time and valid time must be mod-
eled. This is to facilitate queries for records that are valid at some valid time
point and as of some transaction time point. A relation that supports both
times is known as a bitemporal relation, which has exactly one system sup-
ported valid time and exactly one system supported transaction time. Table 4.5
illustrates an example of the tourist bitemporal relation at time O.
From Table 4.6, note that tuples t7 and t8, with the same pid and city
values, bear overlapping valid time [Ts , Tel. This is possible because the two
tuple versions have different transaction time values. However, in a valid time
relation, this situation cannot be represented.
Like a valid time relation, the bitemporal relation also supports retroactive
and predictive versioning.

4.2 Temporal queries
Various types of queries for temporal databases have been discussed in the
literature [Gunadhi and Segev, 1993, Salzberg, 1994, Shen et aI., 1994]. Like
any other applications, temporal indexing structures must be able to support
a common set of simple and frequently used queries efficiently. In this section,
we describe a set of common temporal queries. These queries should be used
to benchmark the efficiency of a temporal index.
We use the tourist relation shown in Table 4.7 as an example in our discussion
that follows. We assume that the time granularity for this application is one
day for both valid and transaction time. Consider the first tuple. The object
with pid pI is at New York from day 0 to day 2 inclusive. Its transaction time
starts at day 1 and ends when there is an update to the tuple.
A set ofcanonical queries was initially proposed by Salzberg [Salzberg, 1994].
We extend this set of queries by further classifying temporal queries in each
query type based on the search predicates - intersection, inclusion, contain-
ment and point. Such finer classification can provide insights into the effec-
tiveness of the indexes on different kinds of search predicates. For queries
that involve only one time and one key, the key can either be a time-invariant
attribute or a time-varying attribute, and the time can either be valid time
or transaction time. However, the single time dimensional queries are more
meaningful for valid time databases. They can however be applied to transac-
tion time. Nonetheless, the search remains the same although the semantics of
time may be different. The following constitutes the common set of temporal
queries:
l. Time-slice queries. Find all valid versions during the given time interval
[Ts, Te]' For a valid time database, the answer is a list of tuples whose valid
time fall within the query time interval. For transaction time database, the
answer are snapshots during the query time interval and hence the predicate
"as of" is used for transaction time.
Based on the search operation on the temporal index, time-slice queries can
be further classified as:
• Intersection queries. Given a time interval [Ts , Te], retrieve all the
versions whose time intervals intersect it. For example, a valid time
query to find all tourists who are in US during the interval [3,7] would
return 9 tuples: t2, t3, t4, t5, t6, t7, tlO, t12 and t14.
• Inclusion queries. Given a time interval [Ts , TeL retrieve all the versions
whose valid time intervals are included in it. For example, the query
"Find all tourists who stay in a city between day 3 and day 7" would
return 2 tuples: t5 and tlO.

• Containment queries. Given a time interval [Ts , Tel, retrieve all the
versions whose valid time intervals contain it. For example, the query
"Find all tourists who stay in a city from day 3 to day 5" would result
in 5 tuples: t3, t4, t7, tiO and tI4.
• Point queries. Given a specific time point t(instant), retrieve all the
versions whose valid intervals contain the time point. Point queries
can be viewed as the special case of intersection queries or containment
queries where the time interval [Ts , Te] is reduced to a single time instant
T . For example, the query "Find all tourists who are in US on day 1"
would result in 3 tuples: tI, t3 and t4.
2. Key-range time-slice queries. Find all tuples which are in a given key range
[ks , ke] that are valid during the given time interval [Ts , Te]. It is a con-
junction of keys and time. Like the time-slice query, the time-slice part of
the query can assume one of the predicates described above. For example,
the query to find all tourists who are in New York during the interval [3,7]
is a key-range time-slice query with intersection predicate. The result of the
query is now 2 tuples instead: t3 and t6. As another example, the query
"Retrieve all tourists who are in cities with names beginning in the range
[D,N] on day 1" would be a point key-range time-slice query that results in
3 tuples: tI, t3 and t4.
The key-range time-slice query is an exact-match query if both ranges are
reduced to single value; that is, find the versions of the record with key k at
time t. An example of this category is "Find all tourists who visited New
York on day 1", and results in tuples: tl and t3.
3. Key queries. Find all the historical versions of the records in the given key
range [ks , ke]. Such a query is a pure key-range query over the whole lifespan.
For example, the query "Find all tourists who visited New York" is a past
versions query. This query will return the tuples: tl, t3, t6, t9 and tIl.
4. Bitemporal Time-slice queries. Find all versions that are valid during the
given time interval [Ts , Te] as of a given transaction time Tt ·
5. Bitemporal key-range Time-slice queries. Find all versions which are in the
given key range [ks , ke] that are valid during the given time interval [Ts , T e]
as of a given transaction time Tt .
To answer time-slice queries, the index must be able to support retrieval
based on time. The key-range time-slice queries require the search to be based
on both key and line segments. To support valid time, an index must support
dynamic addition, deletion and update of data on the time-dimension, and

Table 4.7. A tourist relation for running examples.
tuple pid city period trans_time
t1 pI New York [0,2] 1
t2 p2 Washington [5, now] 1
t3 p3 New York [0, 6] 1
t4 p4 Detroit [0, 7] 2
t5 p5 Washington [4, 6] 2
t6 p5 New York [7, now] 3
t7 p6 Seattle [3, now] 3
t8 p4 Washington [10, now] 3
t9 p3 New York [12, now] 3
tlO pI Los Angeles [3,6] 3
tIl p7 New York [14, now] 4
t12 pI Detroit [7,9] 4
t13 pI Detroit [10, 12] 5
t14 p9 Los Angeles [3,8] 6
t15 pI San Francisco [13, now] 6
support time that is beyond the current time. In other words, reactive and
proactive updates are required. An index that has been designed for valid time
can be easily extended for transaction time even though a transaction database
can be thought of as an evolving collection of objects. The major differences
are that delete operations are not required for transaction time databases, and
time increases on one end dynamically as it progresses. However, it is much
more difficult to extend a transaction time index for indexing valid time data
since transaction time indexes are designed based on the fact that transaction
times do not overlap, and such property is quite often built into the index.
Further, some transaction time indexes are specifically designed for intervals
that are always appended from the current time, and do not support reactive
update and proactive insertion.
4.3 Temporal indexes
Without considering the semantics of time, temporal data can be indexed as
line segments based on its start time, end time, or the whole interval, together
with the time-varying attribute or time-invariant attribute. Indexing structures
based on start time or end time are straightforward and structurally similar to

existing indexes such as B+-tree [Comer, 1979]. Such an index is not efficient for
answering queries that involve time-slice since no information on the data space
is captured in the index. To search for time intervals with a given interval, a
large portion of the leaf nodes have to be scanned. To alleviate such a problem,
temporal data can be duplicated at the data buckets whose data space of time
intervals it intersects. However, duplication increases storage cost and the
height of the index, which affects the query cost. Alternatively, temporal data
can be indexed directly as line segments or mapped into point data and indexed
using multi-dimensional indexes. As such, most temporal indexes proposed so
far are mainly based on the conventional B+-tree and spatial indexes like the
R-tree [Guttman, 1984].
In this section, we review several promising indexes for temporal data. They
are the Time-Split B-tree [Lomet and Salzberg, 1989, Lomet and Salzberg,
1990b, Lomet and Salzberg, 1993], the Time Index [Elmasri et al., 1990], the
Append-Only tree [Gunadhi and Segev, 1993], the R-tree [Guttman, 1984]' the
Time-Polygon tree [Shen et al., 1994], the Interval B-tree [Ang and Tan, 1995],
and the B+-tree with Linearized Order [Goh et al., 1996]. Where necessary,
we also discuss the extensions that have to be incorporated for such indexes to
facilitate retrieval by both key and time dimensions.
4.3.1 B-tree based indexes
The Time-Split B-tree. The Time-Split B-Tree (TSB-tree) [Lomet and
Salzberg, 1989, Lomet and Salzberg, 1990b, Lomet and Salzberg, 1993] is a
variant of the Write-Once B-Tree (WOBT) [Easton, 1986]. The TSB-tree is
one of the first temporal indexes that support search based on key attribute
and transaction time. An internal node contains entries of the form <aU-
value, trans-time, Ptr>, where aU-value is the time-invariant attribute value of
a record, trans-time is the timestamp of the record and Ptr is a pointer to a
child node [Lomet and Salzberg, 1989].
Searching algorithms are affected by how a node is split and the information
it captures about its data space. Therefore, we shall begin by looking at the
splitting strategy. In the TSB-tree, two types of node splits are supported, key
value and time splits. A key split is similar to a node split in a conventional
B+-tree where a partition is made based on a key value. A TSB-tree after a key
split is shown in Figure 4.1. For the time split, an appropriate time is selected
to partition a node into two. Unlike key split, all record entries that persist
through the split time are replicated in the new node, which stores entries with
time greater than the split time. Figure 4.2 shows the TSB-tree time splitting in
which the record <pI, Detroit, 4> is duplicated in the historical and new nodes.
If the number of different attribute values in a node is more than lM/2J(M is

pI New York T=I p2 Washington T=I p3 New York T=I
index page
data pages
After insertion of record <p9, Los Angeles, 6>
Ipi New York T=I 0 p2 Washington T=I DL.... _
Ip3 New York T=I op9 Los AngelesT=6 IJL _
Figure 4.1. A key split of a leaf node in the TSB-tree based on p3.
the maximum number of entries in a node), a key split is performed; otherwise
the node is split based on time. If no split time can be used except the lowest
time value among the index item, a key split is executed instead of time split.
To search based on key and time, index keys and times of internal nodes are
used respectively to guide the search. With data replication, data whose time
intersects the data space defined in the index entries are properly contained in
its subtree, and this enables fast search space pruning.
The TSB-tree can only support transaction times in the sense that times of
the same invariant key must strictly be in increasing order. In other words,
there is no time overlapping among versions of a record. When a record is
updated, the existing record becomes a historical record, and a new version
of the record is inserted. The TSB-tree can answer all the basic queries on
transaction time and time-invariant key.
The major problem of the TSB-tree is that data replication could be severe,
and hence this may affect its storage requirements and query performance. As
noted, the index cannot be used for valid time data.
The Time Index. Elmasri et al. [Elmasri et al., 1990] proposed the time
index to provide access to temporal data valid in a given time interval. The
technique duplicates the data on some selected time intervals and index them
using a B+-tree like structure. Duplications not only incur additional cost

pI New York T=1 PI Los Angeles T=3 pi Detroit T=4
Now insert record <p9. Los Angeles. 6>. choose T=5 as the split time
index page
The new nodes are:
IpOT=1 II pOT=5 II 1
data pages Ipi New York T=I I IpI Los Angeles T=3 IIpI Detroit T=4 I
lL':p..:..1..:.D..:..et.:..:ro..:..:it-=.T_=4 [1 p9 Los Angeles T=6 0'---- _
Figure 4.2. Time splitting in TSB-tree.
in insertion and deletion, but also degrade the space utilization and query
efficiency. In the worst case that all intervals start at different instants but end
at the same instant, the storage cost is of order O(n2
).
As for the query operation, to report all intersections with a long interval
requires an order of O(n2
), since most of the buckets need to be searched. To
reduce the number of duplications, an incremental scheme is adopted which
only allows the leading buckets to keep all their id's, whereas others maintain
the starting or ending instants [Elmasri et al., 1990]. Figure 4.3 depicts the
time index constructed using the most current snapshot of the tourist relation
in Table 4.7. In the figure, the "+" and "-" signs indicate the starting instant
and ending instant of an interval respectively. The number of duplications
has been reduced, however, there are still many duplications for tuples having
long intervals. To search from an instant onward, all the leading id buckets
belonging to the same leaf node have to be read and checked. For instance, the
query "Find all persons who were in the United States from day 4 to day 6" can
be answered by locating indexing point 4, and reconstructing the list of valid
tuples from the leading bucket and subsequent entries right up to indexing
point 6. To insert or delete a long time interval, the number of leading id
buckets to be read and updated can be high, with the order of O(n).
The time-index is likely to be efficient for short query intervals and short time
intervals. For long data intervals, the amount of duplication can be significant.

o
(11,14,17)
(+110)
(+12,-11,-110)
(12,17)
(+18,+111,-17)
(12,111,112)
(+114) t
(+113)
Figure 4.3. The time index constructed from the tourist relation.
This will affect query efficiency as the tree becomes taller and the number of
leaf nodes increases. In addition, index support is provided for only a single
notion of time (in this case, valid time) and it is not clear how this can be
naturally extended to support temporal queries involving both transaction and
valid time. Elmasri et al. [Elmasri et al., 1990] also suggested that their time
index can be appended to regular indexes to facilitate processing of historical
queries involving other non-temporal search conditions. For example, if queries
such as "Find all persons who entered United States via LA and remained from
day 4 to day 6" is expected on a regular basis, such queries may be supported by
attaching a time index structure to each leaf entry of a B+-tree constructed for
the attribute city. Answering the above query involves traversing the first B+-
tree to identify the leaf entry corresponding to attribute value "LA", followed
by an interval search on the time index found there. However, this approach
may not be scalable since the number of time indexes will certainly grow to be
exorbitantly large in any nontrivial database.
The Append-Only tree. The Append-Only tree (AP-tree) [Gunadhi and
Segev, 1993] introduced by Gunadhi and Segev is a straightforward extension
of the B+-tree for indexing append-only valid time data. In an AP-tree, leaf
nodes of the tree contain all the start times of a temporal relation. In a non-leaf
node, pointer of each time value points to a child node in which this time value
is the smallest value (this rule does not apply to the first child node of each
index node). The AP-tree is illustrated in Figure 4.4.
Since both the update of an existing record and insertion of a new version will
only cause incremental append to the database, every insertion to the AP-tree

o 3 H 4 5
~~tl t3 t4 t7 tlO...
tl, t3, t4 represent tuples with Ts=O; t7, tlO represent tuples with Ts=3
Figure 4.4. An AP-tree structure of order 3.
will always be performed directly at the rightmost leaf node. All the subtrees
but the rightmost one of the AP-tree are 100% full. When the rightmost leaf
node is full, the node is not split, but instead a new rightmost leaf node is
created and is attached to the most appropriate ancestor node. Therefore, the
AP-tree may not be height-balanced. One such example is shown in Figure 4.5.
The AP-tree structure is simple and is small in the sense that it does not
maintain additional information about its data space. However, searching for
a record can be fairly inefficient. To search for a record whose interval falls
within a given time interval as in a time-slice query, the end time of the search
interval is used to get the leaf node that contains the record whose start time
is just before the search end time. From that node, the leaf nodes on its left
are scanned.
To answer queries involving both key and time-slice, a two-level index tree
called the nested ST-Tree (NST) was proposed. The first level of an NST is
a B+-tree that indexes key values, and the second level is an AP-tree that
indexes temporal data that correspond to records with the same key value. In
the B+-tree, each leaf node entry has two pointers, with one pointing to the
current version of the record with this key, and the other pointing to the root
node of the AP subtree. A query involving only key value can directly access
the most recent version of the record through the B+-tree. Figure 4.6 shows the
structure of the NST. An index structure similar to the NST was also proposed
to index time-varying attribute and time. Since the temporal attribute is not
unique, the qualified tuples will overlap their associated time intervals.

120 3 H 4 S H 7 1 O
(a) Insertion of start time 12 into a full AP-tree.
(b) Insertion of start times of 13 and 14.
Figure 4.5. Append in the AP-tree.
The AP-tree only supports monotonic appending with incremental time
value. Therefore, the multiplicity of the update operations will be limited.
The basic AP-tree itself can support queries involving only time-slice. Even
so, the search for time-slice queries is not efficient. A more expensive structure
such as the NST has to be used to answer key-time queries. Clearly, for the
time-slice queries, it is more efficient to use the AP-tree than the NST-tree.
On the other hand, for the key-range time-slice and past versions queries, the
NST-tree is more superior. We use the term AP-tree to refer to either of them,
and the context determines which structure we are referring to.
The Interval B-tree. The Interval B-tree [Ang and Tan, 1995] based on the
interval tree [Edelsbrunner, 1983] was proposed for indexing valid time inter-
vals. The underlying structure of the interval B-tree is a B+-tree constructed
from the end points of valid time intervals.
The interval B-tree consists of three structures: primary structure, secondary
structure and tertiary structure. The primary structure is a B+-tree which is

B +-tree for key index
AP-trees for time index
Data tuples
Figure 4.6. A nested ST-tree structure.
used to index the end points of the valid time intervals. Initially, it has one
empty leaf node. New intervals are inserted into this leaf node. When it
overflows, a parent node of this leaf is created, and the middle value of the
points, say m, is passed into the newly created index node. The valid time
intervals that fall to the left of m is in the left leaf bucket, and those falling to
the right of it are in the right leaf bucket. Intervals spanning over m will be
stored in a secondary structure attached to m in the index node. Figure 4.7
shows the interval B-tree after inserting tuples tl, t2, t3 and t4 of Table 4.7.
Suppose the bucket capacity is 3. When t4 is inserted, the leaf bucket overflows
and 6, the middle value of {O, 0, 5, 6, 7, now} is chosen as the item for the
index node. The tuple tl is stored in the left child of the new index node, while
tl, t2 and t4 are in the secondary structure of index item 6. At this moment,
the right leaf bucket is empty because no intervals fall to the right of 6.

Index Bucket
I I
I
+t2[5, now]
t3[O, 6]
t4[O,7]
L:JLeaf Bucket
secondary
structure
Leaf Bucket
Figure 4.7. An interval B-tree after inserting n, t2, t3 and t4.
After the creation of the first index node, any further interval insertion will
proceed from the root node of the primary structure. If an interval spans over
an index item, it is attached to the secondary structure of this item. A long
valid time interval may span over several index items; however, it should be
attached to only one of them. The rule is as follow. All the items in the index
node can be maintained as a binary search tree called a tertiary structure. The
first item that entered this index node is the root of the binary search tree, and
the subsequent items having smaller (larger) values will be in the left (right)
subtree. Thus, in this binary search tree, the first item found to be spanned by
the valid time interval is used to hold it. Figure 4.8 shows insertion of the rest
of the tuples in Table 4.7.
After insertion, the root of the binary tree in the tertiary structure is 6.
Suppose we have a tuple tl6 with time interval [5, 15] to insert. Although the
period covers both 6 and 12 in the index node, since 6 is encountered first in
the binary tree of the tertiary structure, the tuple is attached to 6.
The efficiency of the index is heavily dependent on the distribution of data
and the values picked as index. A poor choice of index values may cause most
of the intervals being stored in the secondary structure, resulting in a small
B+-tree with large secondary structures.

Primary Structure
Index Bucket
,6 -----..::::::.:a....__ 12
,
t
.------- Tertiary Structure
2[5, now)
3[0,6)
4[0,7)
5[4,6)
7[3, now)
10[3,6)
14[3,8)
6[7, now)
8[10, now)
9[12, now)
13[10, 12)
Secondary
Structure
Leaf Bucket Leaf Bucket
Figure 4.8. The interval B-tree after insertion of all tuples.
B+-tree with Linear Order. Temporal data can also be linearized so that
the B+-tree structure can be employed without any modification. Goh et al.
[Goh et al., 1996] adopted this approach which involves three steps: mapping
temporal data into a two-dimensional space, linearizing the points, and building
a B+-tree on the ordered points.
In the first step, the temporal data is mapped into points in a triangular
two-dimensional space: a time interval [Ts , Te] is transformed to a point [Ts ,
Te - Ts ]. Figure 4.9 illustrates the transformation of the time interval to the
spatial representation for the tourist relation. The x-axis denotes the discrete
time points in the interval [0, now], and the y-axis represents the time duration
of a tuple. The points on the line named time frontier represent tuples with
ending time of now. The time frontier will move dynamically along with the
progress of time.
In the second step, points in the two-dimensional space is mapped to a
one-dimensional space by defining a linear order on them. Given two points,
PdXl,Yl) and P2(X2,Y2), the paper proposes three linear orders:
• D(iagonal)-order «D). Pl <D P2 iff (a) (Xl + yd < (X2 +Y2); or (b) (Xl +
yd = (X2 +Y2) and Xl < X2·

y
20
18
now
14
12
10
8
6
4
2
o
, ,,
, , / :rime Frontier
v/ ,/ ,,,,, ,,
2 ~,Tn (outside now)
, ,,, ,,,, ,
,, , ,,
2 4 6 8 10 12 14 now 18 20 X
Figure 4.9. Spatial representation of the tourist relation.
• V(ertical)-order « V). PI <v P2 iff (a) X2 + Y2 = now and Xl < X2; or
(b) Xl + YI :f:. now and X2 + Y2 :f:. now and Xl < X2; or (c) Xl + YI :f:. now
and X2 + Y2 :f:. now and Xl = X2, and YI < Y2·
• H(orizontal)-order «H). PI <H P2 iff (a) X2 + Y2 = now and YI < Y2; or
(b) Xl + YI :f:. now and X2 + Y2 :f:. now and YI < Y2; or (c) Xl + YI :f:. now
and X2 +Y2 :f:. now and YI =Y2, and Xl < X2·
now
(a) D-order
now
(b) V-order
now
(c) H-order
Figure 4.10. The three orderings for points in the two-dimensional space.
Figure 4.10 provides a graphic representation of the three linear orders de-
fined above. Clearly, by linearizing the points using any of the above orders,
we can construct a B+-tree on the temporal data. For instance, if we order the

IX
Figure 4.11. Organizing the spatial representation of the tourist relation using a S+-tree
and linearizing using the D-order.
points of the tourist relation using the D-order, the resultant B+-tree structure
is depicted in Figure 4.11.
A temporal query can be mapped to a spatial search on the two-dimensional
space, which in turn can be translated to a range search operation on the linear
space defined by the ordering relation. For example, consider the query "Find
all persons who left the United States on or after day 5." This query can be
efficiently handled by traversing the D-order B+-tree and retrieving all points in
the interval [(0,5), (14, 0)]. However, not all temporal queries can be efficiently
handled using the D-order. For example, consider the query "List all persons
who entered the United States on or before day 5". The D-order performs
poorly for this query, while the V-order is superior. The paper suggests that
different indexes (constructed using different ordering relations) be used to
support the various types of queries.
The main advantage of this method is the ease with which this indexing
scheme can be implemented using existing DBMSs. The performance analysis
shows that it is more efficient than the time index in terms of both storage
utilization and query efficiency. However, the index is more suitable for valid
times, which are mostly closed intervals. For data with open intervals, expen-
sive reorganization is necessary.
4.3.2 Spatial index based indexing methods
The R-tree. Unlike spatial applications where non-spatial data are usually
stored and indexed separately from spatial data, temporal attribute data such
as time-invariant key and time-varying key are indexed together with temporal
data. The time dimension can be viewed as one of the dimensions in a multi-

dimensional space and indexed using some existing methods [Rotem and Segev,
1987].
In this section, we discuss how the R-tree [Guttman, 1984] can be used to
index temporal data. The R-tree is a multi-dimensional generalization of the
B-tree, that preserves the height-balance property. Detailed description of the
R-tree can be found in Chapter 2.
For temporal applications, to index temporal data and its key, the R-tree
can be implemented as a two-dimensional R-tree (2-D R-tree) or a three-
dimensional R-tree (3-D R-tree). To use a 2-D R-tree, time intervals [Ts ,
Te] are treated as line segments in a two-dimensional space, with keys on the
other dimension. To index temporal data using a 3-D R-tree, the time intervals
and keys have to be mapped into points (key, T s , Te ) in a three-dimensional
space. Figure 4.12 shows examples of data partitioning for the tourist relation
(see Table 4.7).
Both implementations can handle the pure time query, key-time query and
pure key query of the query set. For the 2-D R-tree, all searches are performed
as intersection search. For the 3-D R-tree, search intervals must be mapped
into the search regions in the triangular space. Figure 4.13 shows the query
regions on the time dimension for the four search operations. As an example,
consider the intersection search. Let the query time interval be [QTs , QTe].
For an interval in the database to intersect the query interval, either its end
time must be in the interval or its start time must be in the interval. Thus, no
record with end time less than QTs needs to be considered, and no record with
start time after QTe needs to be examined. We then have the query region as
indicated by the shaded portion.
Here it is important to note that the R-tree cannot directly handle intervals
with open end-time. An entry in the internal node of the R-tree contains an
MBR that describes the data space of its child node. When data intervals are
not closed, the MBR cannot be defined properly, and these affect the splitting
algorithm that makes use of space coverage to distribute the data into two
groups. It is possible to use the current time or the largest time due to the
proactive insertion as an estimate during node splitting and data insertion.
One of the characteristics of temporal databases is that the historical data
is stored for a long time, and no deletion of past data is allowed. The size
of the database grows as time progresses, and so are its indexes. Kolovson
and Stonebraker proposed variants [Kolovson, 1993, Kolovson and Stonebraker,
1991] of the R-tree to index historical data. The R-tree is used to index time
intervals on one dimension and non-temporal attribute on the other. Three
variants that store some of the nodes on optical disk were proposed. The
first variant (MD-RT) maintains the whole R-tree based index structure on
a magnetic disk. There is no migration from a magnetic disk to an optical

L34 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS
Ke
p9 ,-rtk ------------------------:
, ,
p8 ' ,
p7 : w...:, ,
p6 : _ !7 '
--------::t5....:..:.=-- -,(,----------------,
p5 I '
p4 ---------1 18 :
p3 +~ - - - - - - - - - -: ~ ~() :
-,9 -,- - - - - - - - - - - - - - - - - - - ,
p2 , ,
~ '10 1 I ~ ...Ll1...- ...1l.i...- '
pi ----- -----r_-_-~ .!
o 2 3 4 5 6 7 8 9 10 II 12 13 14 now
Time
(a) Tuples represented as lines in 2-dimensional space
Te
no
4
13
12
II .17
10
~
.tI5,'
9 "18· ,
. ".- ,
, ,,11'1
,,
(5 "
.'
,
,
3 4 5 6 7 8 9 10 II 12 13 14 w Ts
Key
(b) Tuples represented as points in 3-dimensional space
Figure 4.12. Space partitioning in the R-tree.

y
Tmax
QTe
QTs
QTs QTe TmaxX
y
Tmax
QTe
QTs ..
QTs QTe Tmax X
(a) Intersection search (b) Inclusion search
yy
Tmax
QTe
QTs
QTs QTe Tmax X
(c) Containment search
Tmax
QTs
QTs
(d) Point search
Tmaxx
Figure 4.13. Query regions for R-tree on the time dimension.
:lisk needed. The second variation (MDjOD-RT-1) has the R-tree and its root
node on the magnetic disk, and moves the left-most part of the leaf nodes to
an optical disk if the size of the R-tree index reaches the pre-defined size. All
internal nodes, except the root node, whose child nodes are entirely on the
:>ptical disk are recursively vacuumed to the optical disk.
The third variant (MDjOD-RT-2) maintains two R-trees, both rooted on
magnetic disk. The first resides entirely on the magnetic disk whereas the
second stores the root node on the magnetic disk and the nodes of lower level
:>n the optical disk. When the size of the first R-t.ree reaches the expected size,
all the nodes below its root node are moved to the optical disk. Meanwhile, the
references of the first R-tree's root node are inserted into the proper position of

the second R-tree. The new records will be inserted into the first R-tree while
the search operations will be performed on both R-trees.
The data is stored in the leaf nodes, and nodes do overlap in their data
space for long intervals. In the case that the interval data collections have non-
uniform length distributions, overlap between bounding rectangles can be quite
severe due to some long intervals. To handle this shortcoming, the Segment R-
tree (SR-tree) [Kolovson and Stonebraker, 1991, Kolovson, 1993] was proposed.
The SR-tree stores interval records in both non-leaf nodes and leaf nodes. An
interval I is stored in the highest level node N of a tree if it spans at least one
of the intervals represented by N's child nodes. If an interval segment spans
the region covered by a node and extends the boundary of its parent node, it
will be cut into a spanning portion and one or more remnant portions. The
portions are stored in the separate parts of the index structure. Figure 4.14
shows the case in point.
Line segment P spans - - - -
C and extends A's boundary
root
A[i] D
~rc- -- ->
-.JL
root
spanning portion - - - - ~
- - remnant portion
Figure 4.14. A SR-tree with spanning portion and remnant portion.
An improved version of the of SR-tree, called Skeleton SR-tree, was proposed
to pre-partition the entire domain of the interval data into several sub-regions
based on estimation of the number of data records and approximation of dis-

tribution of intervals. The overlap between data space of leaf nodes is reduced.
Such an estimation may be easy to derive for certain applications (for example,
video rental) that have little variations in version lifespan. For applications
with wide variance of interval lifespan, the pre-partitioning is not effective.
The Time-Polygon index. The Time-Polygon Index (TP-Index) was pro-
posed to index valid time databases [Shen et al., 1994]. Like the B+-tree
with linear order, the TP-Index maps the time interval [Ts , Te] into a point
[Ts , Te - T s ] in a triangular two-dimensional space. However, the triangular
temporal space is partitioned into some groups such that each group is the clus-
ter of data points suited for a certain search pattern. Partitioning along X- and
V-dimensions, and parallel to the time frontier produce five polygonal shapes
as shown in Figure 4.15. Polygons used in the TP-index are not minimum
bounding polygons. The polygons are derived through recursive partitioning,
and can be easily merged when the tree is collapsing. The structure of the TP-
index is like that of an R-tree. Figure 4.16 shows the partition of the temporal
space and the TP-tree structure of the tourist relation. To support proactive
additions ofrecords (for example, Tn in Figure 4.16(a)), a virtual time frontier
that assumes the largest Te (Tmax ) has to be introduced, and partitions that
are adjacent to the time frontier have to be extended outward.
A-shape B-shape C-shape D-shape E-shape
Figure 4.15. The five polygon shapes in TP-tree.
The TP-index was designed solely to index valid time and handle time-
slice queries. To enable the TP-index to support the time-invariant key, it is
extended to index data in a three-dimensional space [Jiang et al., 1996]. In the
data space, the x-axis and y-axis hold the same definitions as before; the z-axis
denotes the key values of the data points in the space (see Figure 4.17).
Initially, data points are bounded in the three-dimensional temporal space.
When overflow occurs, these data points are partitioned into groups such that
each group can be stored in one data page. Partitions must cluster the data
points to be suited for temporal search patterns. There are three partitions for
the TP-tree: y-partition introduces a plane parallel to the x-z plane (called the

y
20
18
now
14
12
IO
8
6
4
2
o
,
" / Time Frontier (now)
, /
'/
"/ ,
/ "
, ,, ,
+Tn (outside now)
6 ",
, , ,, , ,,, ,,,
2 4 6 8 IO 12 14 now 18 20 X
(a) Partitioning of the temporal space for tourist
relation
~ ~
/ ~
D:J th
~t t
polygon 2 polygon 3
(b) A TP-tree structure
data bucket for
polygon I
data bucket for data bucket for data bucket for
polygon 4
Figure 4.16. An TP-tree for the tourist relation.

Y Jmax
,,
Time Frontier (now)
now,' Tmax
X
I
I
I
I
I
, I 0 -- data points, I
Z (key dimension)
Figure 4.17. A three-dimensional spatial rendition of the TP-tree.
y-plane); time-partition introduces a plane parallel to the time frontier (called
the time-plane); and key-partition introduces a plane parallel to the x-y plane
(called the key-plane). The y-partition and time-partition for different bound-
ing polygons are similar to those described in [Shen et al., 1994]. Note that
after the key-partition, the shapes of the resultant bounding polygons are the
same as that before the partitioning. Searching based on time is similar to that
proposed in [Shen et al., 1994]' where the search time intervals must be mapped
into appropriate query regions. The query regions for the various search oper-
ations on the time-dimension are shown in Figure 4.18. For example, consider
the query interval [QTs , QTe] for an inclusion search. Since all matching in-
tervals must start from QTs , those intervals that start before QTs should be
excluded. Similarly, since the query interval ends at QTe , all intervals that end
after QTe should be excluded. The resultant query region is thus the shaded
region as shown in the figure.
4.3.3 Methods for bi-temporal databases
Until recently most indexing and temporal researchers have been working on
the indexing problem along one of the two time dimensions. Kumar, Tsotras
and Faloutsos [Kumar et al., 1995] proposed two access methods, Bitemporal

y
(a) Intersection search
QTs
QTs QTe TmaxX
Y
QTe - QTs
QTs QTe Tmax X
(b) Inclusion search
y
QTe
QTs QTe Tmax X
y
QTs
QTs Tmaxx
(c) Containment search (d) Point search
Figure 4.18. Query regions for the TP-tree.
Interval Tree and Dual R-trees, for indexing both transaction and valid time
dimensions.
The Bitemporal Interval Tree makes use ofInterval Tree [Edelsbrunner, 1983]
to index a finite set U that contains V valid time points. An interval tree
consists of a full binary tree and a number of doublely-linked lists. The V time
points are in the leaf, and each internal node contains the middle value of its
two immediate children. If the starting point of an interval falls in the left
subtree of an internal node and the ending point falls in the right subtree, the
interval is stored in the doublely-linked lists associated to this internal node.
The left and right lists contain the starting and ending points respectively.
In the Bitemporal Interval Tree, the lists are transformed into "conceptual"
lists of pages to facilitate the splitting policies of the MVBT [Becker et aI.,

1993] so as to answer bitemporal pure-time-slice (BPT) query. By elaborately
pagenating the whole indexing structure, the index can answer BPT query in
o(10gb V +10gb n +a) I/O operations.
The authors also proposed a method that employs two R-trees (2-R) to
divide bitemporal records on transaction time. This method aims to eliminate
the large overlapping of the mix of rectangles with known ending transaction
time and those extending to now. A front R-tree indexes the records whose
transaction time is up to now, whereas a back R-tree indexes the records whose
transaction time lifespan is closed.
7
6
5
4
3
2
I
,--- t2
I
tl
13
12345678
T
transaction time
(a) Original representation of the time dimensions
T'--''--''--1'--1---'---'--1.--1. _
" Vg
.",
~
> 7
6
5
4
3
2 t3
I
12345678
"g
:=!
..> 7
6
5
4
3
2
T I
transaction time 12345678
12
transaction time
(b) lhe back R-tree (c) the fronl R-tree
Figure 4.19. The two R-tree method.
In Figure 4.19(a), there are three records in the bitemporal space. Records
tl and t2 have open transaction time lifespan, and the transaction time of t3 is
closed at time 3. Note that the three records overlap along the transaction time
axis. To avoid this kind of overlapping so as to improve the performance of
the R-tree, the dual R-tree method keeps records with closed transaction time
range, that is t3, in the back R-tree (Figure 4.19(b)) and records with open
transaction time range, that is il and t2, in the front R-tree (Figure 4.19(c)).

In the front R-tree, a bitemporal record can be represented as an interval line
parallel to the valid time axis. As a result, the overlapping is reduced. A
bitemporal query is answered by two searches, one is for reCtangles in the back
R-tree and the other is for intervals in the front R-tree. The front R-tree needs
a slightly more expensive search algorithm due to the open intervals.
While it is difficult to extend index structures such as the AP-tree and TSB-
tree for bitemporal indexing, the R-tree and the TP-tree can be extended with
additional dimensions. For example, a 5-D R-tree or TP-tree could be used to
index time-invariant key, transaction point interval and valid time time inter-
vals. However, the extension entails redesign of more complex node splitting
algorithms and query retrieval algorithms. With an increase in the number of
dimensions, spatial indexes may not perform as well.
4.4 Experimental study
Indexes are data structures that quickly identify the locations at which indexed
data items are stored. Indexes are therefore used as a speed up device in query
evaluation algorithms. Properties desired for these indexes include efficient
storage utilization, and efficient query retrieval. In other words, the use of disk
space should be efficient, which indirectly determines the query efficiency of
an index, and an index must be able to answer basic queries efficiently. In
addition, index construction and update cost should not be too high although
they are often treated as less important selection factors.
Various performances have been conducted. The TP-index was shown to be
more superior than the Time Index for valid time databases [Shen et aI., 1994].
The result is expected as replication in the Time Index could be very bad, and
it results in a much bigger tree. The Interval B-tree was shown to be more
efficient than the Time Index and the R-tree [Ang and Tan, 1995]. It is argued
that the query efficiency of the interval tree is in the order of 0 (log n + F)
where F is the time for reporting intersections.
4.4.1 Implementation of index and buffer management
Four indexes, the TSB-tree, AP-tree, 2-D R-tree and TP-tree were implemented
in C on a SUN SPARC workstation. In this section, we restrict ourselves to
the study on the indexes built on time-invariant key and transaction time.
For a large collection of temporal data (such as one million versions), the
index size can become fairly large, and it is unlikely that the entirety of the
index fits in memory. Instead, some index pages will be paged out as the tree is
traversed, and have to be re-fetched at a later time when they are re-referenced.
To reduce page re-fetching, a priority-based buffer replacement strategy [Chan
et aI., 1992] is used. The strategy employs the least useful policy (LUF policy)

and has been designed based on the wayan index is traversed. For a fair
comparison, the replacement algorithm was extended for the two-level NST
index structure. Under the strategy, priorities are assigned to index pages. An
index page is useful if it will be referenced again in a traversal of an index
structure; otherwise, the index page is useless in the current traversal. Useful
pages have higher priorities than useless pages. As the main concern ofthe work
is in minimizing the page re-fetching effect on the performance comparison, the
work fixed the buffer size at 32 pages, which is sufficient for traversing the trees
with height of up to 5 levels.
4.4.2 Data and query sets
The data sets employed in the study was generated using an extended ver-
sion of the Time-Integrated Testbed of the Department of Computer Science,
University of Arizona. The temporal relations were generated using Poisson
distributions with different mean values in arrival time (start time of an in-
terval) and version lifespan. Each database contains 1,000,000 versions. The
time-invariant attribute is uniformly distributed over [1,10000], and the number
of versions per key is randomly determined. For each version, its time-varying
attribute value is uniformly distributed in [1, 100000]. For each different set of
mean arrival and duration time, the data is generated with the constraint that
simulates transaction time. The data is generated in one go and pre-sorted
based on the start time. Each tuple is then inserted into the index. By doing
so, we did not have to modify existing R-tree splitting algorithm. This is not
ideal as the latest versions of transaction time data give rise to open rather
than closed intervals. However, apart from the R-tree, the presence or absence
of open intervals does not affect the other three indexes.
Among the basic queries, we shall look at just two of them: time-slice in-
tersection queries and key-range time-slice intersection queries. Being more
general, an intersection query is expected to yield more results than the inclu-
sion, containment and point queries.
Each set of queries contains 100 queries with different keys and time ranges.
The keys are randomly picked from its domain (that is [1,10000)). Where there
is a key-range search, a predetermined fixed range is used to determine the
end of the range. The starting time of the time ranges is generated using the
Poisson distribution, and a fixed range. Should the ending time exceeds the
current time, then the ending time is set to the current time.

4.4.3 Some experimental results on indexing invariant keys and transaction
time
We report on some experimental results on the performance of the indexes
that are built on the time-invariant key and transaction time. For time-slice
intersection queries, the mean inter-arrival time is fixed at A = 5, and the
mean duration time are fixed at J.L = 200, 500, 1000. For key-range time-slice
intersection queries, the key range is fixed at 1000 (10% of domain) when the
effect of time range is studied, and the time range is fixed at 15000 when the
effect of key range is studied.
On time-slice intersection query. Figures 4.20a and b show the perfor-
mance of the TSB-tree, AP-tree, TP-tree and (2-D) R-tree for time-slice inter-
section search queries under mean inter-arrival times of 2 and 5, and a fixed
mean duration time of 200. Figure 4.21 shows the effect of longer lifespan on
the four indexes. The performance of all four indexes is affected by the search
time range used in the query - the longer the search range the worse the
performance.
Comparison of results summarized in Figures 4.20 and 4.21, reveals that
while the mean duration time has little effect on a few indexes, the inter-arrival
time has significant effect on the performance of most indexes. Longer mean
inter-arrival time means less overlap in time intervals. For indexes such as
the TSB-tree and TP-tree, shorter inter-arrival times mean time intervals of
different keys are clustered closely, and the same search range intersects more
intervals and hence more pages are accessed. The performance of the 2-D R-
tree and the AP-tree are affected by the duration of time intervals. For the
R-tree which indexes time intervals as line segments, the degraded performance
is due to the fact that the minimum bounding rectangles (MBR) in the internal
nodes have more overlap for longer line segments. For the AP-tree, the opposite
effect is observed. Two factors contributed to this. First, (recall that) the data
set is non-overlapping for each key value. Second, a longer duration essentially
"stretches" the lifespan of the relation. As a result, the number of nodes to be
scanned by AP-tree is smaller for longer duration for the same query range.
It is clear that the TSB-tree performs the best. This can be interpreted by
the fact that the TSB-tree has a high degree of data clustering in both key
and time dimensions. On the contrary, the AP-tree is inferior to all the other
techniques. Its page accesses exceeds 2500 pages! This is because to search for
the intervals intersecting with the query interval [Ts , Te] in the AP-tree, a leaf
node is first determined using T e . All leaf nodes on its right, which contain
intervals whose start time is larger than Te , are ignored. Leaf nodes on its left
must be searched.

(2,200)
3500
3000 TSB-tree -+-
R-tree -+-
TP-tree -s-
2500 AP-tree "*-
"Ql
(f)
(f)
2000Ql
()
()
<1l
(f)
1500Ql
OJ
<1l
0-
1000
500
0
1000 5000 10000 15000 20000 25000 30000
time interval
(a) (A, JL) =(2, 200)
(5,200)
3000
2500
'0 2000 TSB-tree -+-
Ql
R-tree -+-(f)
(f)
TP-tree -s-Ql
()
AP-tree "*-()
1500<1l
(f)
Ql
OJ
<1l
0- 1000
500
0
1000 5000 10000 15000 20000 25000 30000
time interval
(b) (A, JL) =(5,200)
Figure 4.20. Effect of arrival rate on time-slice intersection query.

3000
(5,500)
2500
"0 2000 TSB-tree --<r-
Q)
AP-tree """*-(/)
(/)
TP-tree -e-Q)
<.> R-tree -I--<.> 1500co
(/)
Q)
en
co
0- 1000
500
0
1000 5000 10000 15000 20000 25000 30000
time interval
Figure 4.21. Effect of longer lifespan on time-slice intersection query, (.., J-L) = (5 500).
On key-range time-slice intersection query. The results for the key-
range time-slice intersection queries are very similar to that for the time-slice
queries. Here, we shall present the results when (.., J.1) has value of (5, 200).
In order to see the effect of key range, the query time range is kept constant,
and similarly, to see the effect of the query time range, the key range is fixed.
Figure 4.22a shows the result when the key range is fixed at 1000, while Fig-
ure 4.22b looks at the effect of varying the key range when the time range is
fixed at 15000 time units. Like the time-slice query results, it can be observed
the AP-tree is also more expensive than the others due to its two level struc-
ture. With such a structure, each AP-tree in the second level of the nested
structure is small, and many of such small trees must be searched. It can be
seen also that the AP-tree is more sensitive to the key ranges than time ranges
(see Figure 4.22b). This is logical since the first level of the nested structure
is the B+-tree for keys and the key range determines the number of AP-trees
in the second level that need to be searched. As the key range increases, the
performance deteriorates. Whereas for a fixed time range, the average number
of leaf nodes that need to be searched do not differ greatly.
The TSB-tree retains its good performance in key-range time-slice query
because of its high degree of data clustering in both key and time dimensions.

(5,200)
600
500 TSB-tree -+-
AP-tree -
R-tree -+-
400
TP-tree -B-
'0
(])
(J)
(J)
(])
(J
(J
300Cll
(J)
(])
Ol
Cll
a. 200
100
0
1000 5000 10000 15000 20000 25000 30000
time interval
(a) Key range = 1000
(5,200)
2500
2000
'0
(])
~ 1500
(])
(J
al(J)
~ 1000
Cll
a.
TSB-tree -+-
AP-tree -
R-tree -+-
TP-tree -B-
500
50004000
o~-iil-.-liI===~==~~;;;;;;;;1;;;;;;;;;;;;;;;;;;;;;;;;;;;;i
100 500 1000 2000 3000
key range
(b) Time range = 15000
Figure 4.22. Performance of intersection search in key-range time-slice query (.., JL) =
(5, 200).

To answer past versions query efficiently, it is important to cluster data by
the time-invariant key in an indexing structure. By linking all the past versions
of a given key together, the best performance of this query can be expected.
However, although the TSB-tree, AP-tree, R-trees and TP-tree do have some
feature of data clustering by key, none of them provide an explicit method to
link the historical versions of a given key. Hence, a search based on the key is
required. Among these four indexes, the AP-tree is likely to be more efficient
for the past versions query. For each key that satisfies the search condition, the
whole second level AP-tree is retrieved for all the versions.
4.5 Summary
In this chapter, we have surveyed a number of promising temporal indexes.
Many of these indexes were proposed either for valid time or transaction time
database. Researchers only started to work on indexing in a bitemporal database
recently. For transaction time databases, the TSB-tree approach is very efficient
as it manages to keep the volume of I/O accesses low and uses tight bounding
intervals to support fast search. However, it cannot handle disjoint intervals (or
overlapping intervals) that may be present in the valid time databases. Direct
application of B-trees such as the AP-tree by indexing on a single time point
(starting or ending) is efficient in terms of storage space but is not efficient
for any search that involves interval. Its inefficiency is due to the fact that no
information of the actual data space in the child nodes is captured for pruning
the search space. Hence, a simple time-slice search requires the scanning of a
large proportion of leaf nodes.
Spatial indexes such as the R-tree can be used for indexing both transaction
times and valid times. To index open intervals that move with current time
NOW, splitting algorithms that split nodes based on area of data space must
be re-designed to handle the situation where one side of the MBR is moving
with time. The R-tree can be used to index temporal data as line segments
or points. As indicated by the experiments, the performance of the R-tree
indexing lines is not as ideal as that of the TP-tree. However, should the lines
be mapped into points, its efficiency should become comparable to that of the
TP-tree.
Like other applications, data distribution affects the performance of tem-
poral indexes. For bitemporal databases, different distributions may exist for
the time-invariant keys, time-varying keys, the number of versions per key,
the arrival of new time-invariant keys, and for each key, the arrival of next
transaction-time versions and next valid-time versions, and relationships be-
tween two times, whether they are strongly bound [Jensen and Snodgrass,
1994]. Generally, the distribution of time-invariant key is likely to be depen-

dent on the applications where they can be mapped into some sequential order.
Likewise, the distribution of time-varying key is fairly dependent on the appli-
cation, some may be on increasing order (for example, salary) while others are
likely to be more random. The arrival of new keys and arrival of new versions
tend to follow Poisson distribution.

5 TEXT DATABASES
Text databases provide rapid access to collections of digital documents. Such
databases have become ubiquitous: text search engines underlie the online text
repositories accessible via the Web and are central to digital libraries and online
corporate document management.
Perhaps the key feature distinguishing text databases from other kinds of
database is the way in which they are accessed. Queries to conventional
databases are exact logical expressions used to satisfy information needs such
as "how many accounts have a negative balance" or "which students are en-
rolled in computer science". In contrast, queries to text databases are used
to satisfy inexact information needs such as "what is the economic impact of
recycling" or "what factors led to George Bush's loss in the 1992 presidential
election". This inexactness is not because users are unable to express needs
precisely; it is because the needs deal with imprecise real-world concepts that
cannot be described in a formal system. That is, it is usually not possible to
translate such information needs into a logical query expression that will fetch
only the documents that are answers-an information need and its answers are
not mathematically related.

Thus there is no exact mechanism for determining whether a document is
an answer; instead, queries to text databases are used to identify documents
that are likely to be pertinent to the query, that is, likely to be relevant. These
documents may even contradict each other-commentators may disagree as to
why Bush lost the election, for example. Thus document databases must be
designed to answer informal queries and produce the most likely answers. The
study of techniques for identifying documents that are relevant to an informa-
tion need is known as information retrieval.
Since answers have only a loose, informal correspondence to queries it follows
that the performance of query evaluation techniques is not just a consequence of
how fast they are or how economical they are with system resources. It is also
necessary to cOllsider how good they are at identifying relevant documents,
that is, their effectiveness. The effectiveness of query evaluation techniques
can be formally measured by the proportion of retrieved documents that are
relevant and by the proportion of the relevant documents that are retrieved;
determination of relevance must be made by a human assessor. (It follows
that experiments in information retrieval are expensive, and tend to rely on
standard document collections and query sets for which relevance judgments
have been made.)
Text databases can also be used for more traditional forms of access to data.
For example, in a database of newspaper articles each document will include
the article's text; but will also include information such as authorship, date of
creation, and so on. A possible entry in a database of correspondence is shown
in Figure 5.1. Fields such as date could be queried in conventional ways and do
not require exotic query evaluation methods. It is the use of informal querying
that makes information retrieval systems different to other kinds of database.
In this chapter we describe the ways in which text databases might be ac-
cessed, kinds of queries, index structures to support these queries, and query
evaluation techniques.
5.1 Querying text databases
Simple text engines are familiar to anyone who uses the document repositories
available via the web. These engines can be used to find information about,
say, some individual-to find their home page perhaps-or to search for re-
search papers on a given topic. Typical queries are a list of keywords that the
user guesses will identify the desired information; the system responds with a
list of hits, some of which are relevant and some of which are (in the context
of the query) obviously junk. Based on information retrieval theory, the bet-
ter systems use efficient query evaluation techniques that return relatively few
irrelevant documents.

TEXT DATABASES 153
From: Albert Einstein
Sender address: Old Grove Rd, Nassau Point, Peconic, Long Island
To: F.D. Roosevelt, President of the United States
Recipient address: White House, Washington D.C.
Date: 2nd August 1939
Sir:
Some recent work by E. Fermi and L. Szilard, which has been
communicated to me in manuscript, leads me to expect that the element
uranium may be turned into a new and important source of energy in
the immediate future. Certain aspects of the situation seem to call for
watchfulness and, if necessary, quick action on the part of the
administration. I believe, therefore, that it is my duty to bring to your
attention the following facts and recommendations.
In the course of the last four months it has been made
probable-through the work of Joliot in France as well as Fermi and
Szilard in America-that it may become possible to set up nuclear chain
reactions in a large mass of uranium, by which vast amounts of power
and large quantities of new radium-like elements would be generated.
Now it appears almost certain that this could be achieved in the
immediate future.
This new phenomenon would also lead to the construction of bombs, and
it is conceivable-though much less certain-that extremely powerful
bombs of a new type may thus be constructed ...
Figure 5.1. Example entry in newspaper database.
At the most abstract level, text databases are like conventional databases:
given a query, each entry in the database is compared to the query to determine
whether it is an answer. To allow this process to be efficient a data structure
known as an index is used. Central to effective information retrieval is the
ability to use all the terms (that is, words) in a document to compare it to a
query. That is, it is necessary to index every term in every document.
It is possible to automatically select a subset of the words in a document
to represent its content and to index these words only, or to manually assign
descriptive words or subject categories. However, automatic selection of key-
words is in general not successful; and perhaps surprisingly automatic indexing
of all words gives more effective retrieval than does manual indexing [Salton,
1989]. Moreover the cost of manual indexing for a realistically-sized database

is prohibitive. Thus searches on document databases use content-the full text
of each document-rather than descriptors of some kind.
5.1.1 Boolean queries
There are two principal approaches to querying to text databases: Boolean and
ranked. Boolean query languages were for many years chosen for commercial
information retrieval systems. The basic concept is straightforward-queries
are Boolean expressions in which the atoms are words and are combined with
Boolean operators. For example the query
uranium AND
( (nuclear AND energy) OR (atomic AND bomb) )
could be used to retrieve the example document in Figure 5.1. Such queries are
effectively equivalent to conventional database queries (and as we discuss below
are evaluated in a similar way) but it is not easy for a typical user to translate
an information need into a Boolean query. Making good use of Boolean infor-
mation retrieval systems requires professional information providers who are
experts at interpreting user requests and translating them into formal queries.
There are several ways in with Boolean query languages for text retrieval
can be extended to give the potential for better effectiveness. One extension of
particular value to English text is stemming or suffixing. In its simplest form,
suffixing allows partial match on strings, so that for example the query term
bomb*
would match any word starting with the string bomb. This allows users to match
variant forms of the same word, such as bomb, bombs, bombing, bombardier,
and so on. Alternatively automatic stemmers can be used; these are algorithms
that recognize the standard suffixes used in English'(such as -ed, -es, -ation,
and -ness) and remove them prior to indexing [Harman, 1991, Lovins, 1968,
Porter, 1980). Stemming is a form of word normalization; another, basic form
is case conversion.
Another language extension is to allow querying on word proximity, and
in particular adjacency. In the query above, there was no requirement that
nuclear and energy be nearby in the text. If it is specified in the query
that they must be proximate or adjacent then it is more likely that retrieved
documents will contain these words as a phrase. The Boolean query languages
used in commercial text databases, such as the ISO standard 8777 or Common
Command Language, allow the user to require that two words are to be located
within any fixed number of word positions from each other.
Well-designed interfaces can also help to improve effectiveness, by for ex-
ample providing access to an online thesaurus that can be used to expand

TEXT DATABASES 155
the query. Such extensions however have no impact on the underlying query
evaluation mechanism.
5.1.2 Ranked queries
The other principal approach to text retrieval is ranking, in which a query
is an expression in natural language or a list of keywords; each document is
compared to the query and assigned a numerical similarity; and the documents
with the highest similarity values are retrieved for presentation to the user. In
contrast to Boolean queries, there is no precise delineation between answers
and non-answers; potentially every document in the database has a non-zero
similarity but only the first few documents presented for viewing (or, in the
case of information filtering [Belkin and Croft, 1992], those above a chosen
threshold) are seen by the user. There is a probabilistic assumption that the
highest-ranked documents are those most likely to be relevant; thus as the user
moves through the list of ranked documents the density of relevant documents
should diminish.
In many contexts ranked queries are simply lists of keywords, but in others
they may be substantial blocks of text. For example, the abstract of a paper-
or even a whole paper-could be used as a query to find other papers with
a similar topic; experiments with ranking have shown that longer queries are
better at identifying relevant documents. Thus a typical query might be a list
of keywords such as
nuclear atomic energy power
or a natural language description such as
Relevant documents will discuss the use of nuclear or atomic
energy as a power source.
The functions used to score documents with respect to queries are known as
similarity measures. Many years of information retrieval experiments, with
both small document collections and databases of gigabytes of text, have iden-
tified several families of effective similarity measures. (These experiments have
also shown that ranking is typically more effective than Boolean retrieval, even
for queries formulated by an expert.) We do not survey similarity measures
in this chapter, but instead illustratively focus on one: the cosine measure.
This measure is one of the most effective and has proven successful across a
wide range of databases, and is interesting because it makes use of at least as
much index information as other effective similarity measures. Discussion of
the cosine measure thus allows us to explain what information an index must
store.
Intuitively, we would like a document and query to be regarded as similar
if: most of the query terms occur in the document; they are frequent in the

document; the density of these words in the document is high; some allowance
is made for the "importance" of words, where one would usually regard a word
such as uranium to be more discriminating (and therefore more important)
than a word such as the. Mathematically these concepts can be captured as
follows. The cosine similarity of a document d and query q can be computed as
C(q, d)
ZtEq&d Wq,t . Wd,t
Wq,Wd
where Wx,t is thp. importance of word t in x and Wx is the length of x. In this
formulation of the cosine measure it can be seen that the numerator is high if
important words (that is, high Wx,t words) are in both query and document,
and that division by length ensures C(q, d) is high only ifthe document is dense
with query terms. Thus, given two documents containing the same query terms
with the same frequencies, the shorter of the two will have higher similarity.
Word importance is an abstract concept, but in practical ranking is effec-
tively captured by the formulations
Wq,t (log fq,t + 1) (log~ + 1) and
Wd,t (log fd,t +1)
Here fx,t is the frequency of occurrence of t in x-that is, the number of times
term t occurs in document or query x-and there are N documents in the
database of which It contain t. Thus a word that is rare in the collection-
that is, has a high inverse document frequency-or frequent in either query or
document attracts a high weight. The lengths are usually computed as
so that length is essentially a function of the number of distinct words. Note
that for a given query Wq is a constant and thus has no impact on the ranking
and is not calculated.
In principle, then, query evaluation for a query q consists of computing the
similarity C(q, d) for every document d in the database, then returning to the
user the documents with highest similarity.
As for queries to traditional databases, it is valuable to try and improve a
ranked query before evaluating it, by removing noise and transforming it into a
better description of the information need. In particular, stopwords are usually
removed; these are frequent, non-discriminating words such as the and closed-
class or function words such as however that carry no meaning. Elimination

TEXT DATABASES 157
of stopwords has little impact on effectiveness but is important for efficiency,
because these words are so common. After stopping the query above might be
transformed to
Relevant documents discuss nuclear atomic energy power source
Stemming is as valuable for ranking as it is for Boolean queries, for example
yielding
relev document discus nuclear atom energ power source
for the query above. Elementary natural language techniques can also prove
valuable; such techniques include recognition and deletion of key phrases, such
as "we discuss" or "in this paper" , and recognition of proper names and aliases,
so that for example "USA" and "United States" are indexed together. However,
while such techniques change the set of terms available for indexing, they do
not change the methods used to construct an index or to retrieve documents.
For further information on ranking and information retrieval, there are sev-
eral good textbooks [Frakes and Baeza-Yates, 1992, Salton, 1989, Salton and
McGill, 1983, van Rijsbergen, 1979, Witten et al., 1994]. Recent research de-
velopments in the area are presented in special issues of Communications of
the ACM [Fox, 1995] and Information Processing and Management [Harman,
1995a] .
5.1.3 Indexing needs
The needs of querying determines the kinds of information that must be held
in an index. For both Boolean and ranked queries, the index must store every
distinct word occurring in the database and, for each word, the documents
the word occurs in. To support proximity queries the index must store the
positions at which each word occurs in each document; ordinal word numbers
are more useful than byte positions. To support ranked queries the index
must store the frequency of each word in each document. As we discuss later,
richer kinds of queries may require information about document structure. In
the following sections we describe index structures that have proved successful
for text databases, then explain query evaluation techniques that use these
structures.
5.2 Indexing
5.2.1 Inverted indexes
An index is a data structure for supporting a query evaluation technique. The
most commonly used structure for indexing text databases are inverted indexes,

Lexicon
10
29
41
Documents
Mapping table
Inverted lists
Figure 5.2. Arrangement of a simple inverted file..
a family of structures that can be readily adapted to each of the kinds of query-
ing discussed above. Inverted indexes are well-established-they have been used
in commercial text retrieval systems since before 1970-and in recent years re-
finements to inverted indexing have dramatically improved performance.
In outline an inverted index is extremely simple, consisting of a lexicon of the
distinct words to be indexed and for each word an inverted list of information
about that word. The lexicon must be organized to allow fast search for a
given word and each list should allow rapid processing to identify matching
documents. Thus in the most basic case the lexicon could be stored as an array
of words and each list as an array of ordinal document numbers. A mapping
table, also stored in an array, can then be used to map from document numbers
to matching documents. This arrangement is illustrated in Figure 5.2.
For example, each of the three query terms nuclear, energy, and uranium
has an entry in the lexicon (found, say, by binary search in the array) and
a corresponding pointer to the inverted list. Each list contains the document

TEXT DATABASES 159
number 12; the twelfth position in the mapping table thus points to a document
containing all of the query terms.
5.2.2 Search structures
For conventional databases, design of the search structure is crucial to perfor-
mance. For text databases, the major bottleneck is usually the fetching and
processing of the inverted lists, and any structure that allows reasonably fast
access to the distinct words of the database is likely to be satisfactory.
A typical arrangement would be to use a B-tree in which internal nodes con-
tain words and pointers to children and external leaves contain words, pointers
to inverted lists, and for each word the number of documents in the database
containing the word. For many text databases such a B-tree could easily be
held in memory, but the arrangement is also effective if space considerations
force B-tree nodes out to disk. Use of a B-tree means that the words can be
accessed in lexicographic order, allowing users to scan the lexicon and placing
words with the same root but variant suffixes together. If the lexicon is not
too large it is feasible to scan it for the strings that match a given pattern.
Other search structures have been proposed for lexicons but none offers any
clear advantage, while the logarithmic worst-case performance and good space
utilization of B-trees make them a desirable choice.
As a concrete example consider the database consisting of the 3 Gb of text
used in the first three years of the ongoing TREC information retrieval ex-
periment [Harman, 1992, Harman, 1995b]1 This database contains just over
1,000,000 documents, and, coincidentally, just over 1,000,000 distinct words
at an average of about 9 characters each. There are around 480 x 106
word
occurrences in total or, discounting repetitions of words within a document,
there are 220 x 106
word-document pairs. (Note that figures of this kind are
to a certain extent dependent on how words are defined-whether punctuation
such as apostrophes are part of words or delimit them, for example, or whether
words are distinguished by case.)
Thus the complete TREC lexicon can be stored in the leaves of a B-tree of
around 20 to 24 megabytes, given 9 bytes for each word, 4 bytes each for a count
and a pointer, and making an allowance for space wastage. Assuming a block
size of 8 kilobytes, and therefore a branching factor of 28 to 29 , the total space
for all internal nodes of the B-tree would occupy no more than 128 kilobytes
and thus even in the worst case only the leaves need be held on disk. In a basic
representation the inverted lists would contain 220 x 106
document identifiers of
four bytes each, or a little under 1 gigabyte in total. This high ratio of inverted
list size to lexicon size is typical of text databases, and is the reason that-in

contrast to other database applications-inverted lists are not stored directly
in the B-tree: their size would prohibit scanning of the lexicon.
5.2.3 Inverted lists
A basic inverted list consists of a series of document identifiers, as illustrated
in Figure 5.2. But such a list does not support the kinds of queries discussed
above; ranking requires word frequencies and proximity requires word positions.
Addition of frequency information to a list is straightforward: each docu-
ment identifier is followed by a frequency count for that word in that document.
Addition of word positions is only a little more difficult, but can add consider-
ably to index size: each document identifier is followed by a frequency count f,
then by f ordinal word positions. Thus the inverted list for uranium might be
3:1(61),
10:2(14,106),
12:1(9),
29:4(22,36,98,202), ...
representing that the word uranium occurs in document 3 once, at position 61;
in document 10 twice, at positions 14 and 106; in document 12 once, at position
9; and so on. The punctuation is of course only for the benefit of the reader;
the list is stored as the sequence
3 1 61 10 2 14 106 12 1 9 29 4 22 3698 202 ...
For the 3 gigabytes of TREe data discussed above the index would contain
220 x 106 document identifiers, 220 x 106
frequencies, and 480 x 106
positions.
Query processing (explained in detail below) involves retrieving the inverted
list corresponding to each term in the query, then processing the list to extract
document numbers and, if necessary, frequencies and positions. A typical query
term occurs in up to 1%of the stored documents, and may occur in many more,
so in a larger collection the typical retrieved inverted list will contain thousands
or tens of thousands of document identifiers.2 Fetching and processing of these
lists is the major bottleneck in query evaluation, and any improvement can
yield big reductions in query evaluation time.3
The first issue to address is the physical layout of the inverted lists on disk.
The two costs of accessing data from disk are the head-positioning time (seek
and latency) and the per-bit transfer costs. A programmer cannot directly
improve transfer costs, which on current desktop machines allow transmission
of approximately 10 megabytes per second. But repositioning of the disk head
can be largely avoided by storing each inverted list contiguously, or as close to
contiguously as the operating system will allow. A contiguous file can be fetched
around ten times faster than a file of 8 kilobyte blocks randomly scattered on a

TEXT DATABASES 161
disk, so dramatic gains can result from storing each inverted list so that it can
be fetched with a single read operation. Experimental results have shown that,
despite "interference" by the underlying file system-such as organizing files
into randomly-placed blocks and employing header blocks to locate the parts of
the file-the various optimizations used by operating systems allow large files
to be fetched at close to the maximum dictated by the transfer rate.
In some early implementations of inverted files, each list was stored as a
linked list with one node per document, resulting in both appalling perform-
ance-allowing only a few kilobytes to be fetched each second-and large in-
verted files, because of the additional requirement for pointers. It was imple-
mentations such as these that gave inverted files a reputation for inefficiency;
a related problem was that use of linked lists discouraged programmers from
maintaining inverted lists in sorted order, thus adding further to query evalu-
ation costs. However, the strategy of storing inverted lists contiguously does
present problems for update. These issues are considered further below.
Even with inverted lists stored contiguously they have significant space re-
quirements, with, in a simple implementation, 4 bytes for each word occurrence
(for the in-document position) and a further 8 bytes (for the document number
and frequency) for each word-document pair, giving approximately 4 gigabytes
for the 3 gigabyte collection described above. It is clearly desirable that this
space be reduced, not only to conserve disk usage but because reduction in
size cuts transfer costs and thus, potentially at least, reduces query evaluation
times. As a simple first step to reducing size we could question our assumptions:
why, for example, have 4 bytes for the document number? At around 1,000,000
documents 20 bits is adequate, increasing the complexity of processing the in-
verted list but reducing size significantly. Similarly, 4 bytes is excessive for a
frequency or a word position. Space can also be saved by applying a stoplist,
that is, not indexing the common words that contribute most to index size.
Such ad hoc approaches, however, will at best halve the size of the index, to
perhaps 70% of the size of the indexed data.
Much greater reductions in size-that is, compression-result from more
principled methods for efficient representation of integers [Bell et al., 1993,
Bookstein et al., 1992, Choueka et al., 1988, Moffat and Zobel, 1996, Witten
et al., 1994]. We assume in the following discussion that the numbers to be
compressed are positive integers only, but it is straightforward to adapt these
coding schemes to embrace zero and negative numbers.
One simple family of representations is the Elias codes [Elias, 1975]. The
Elias codes represent integers in variable number of bits, and contiguous se-
quences of Elias codes are uniquely decodable. The basic code is unary, in
which each number x is represented by a string of x bits. For example, below
are some numbers in decimal and their equivalent in unary.

x
1
2
3
20
7, 3, 6
unary
°10
110
11111111111111111110
1111110110111110
In the last line is shown a sequence of numbers; although no punctuation is
given the sequence can be separated into the constituent numbers-that is, the
sequence is uniquely decodable, an essential property for any such compression
scheme.
Unary is not particularly efficient for large numbers-"large" in this context
means "about 4"-but it provides the first step in the Elias family. The next
step is the gamma code, in which each number x is factored as 2P- 1 + d. For
example, 1 = 21
-
1
+°and 20 = 25
-
1
+4. Storing p in unary, using p bits, and
d in binary, using p - 1 bits, gives another uniquely decodable representation.
(In all but the last line of the following table a comma is used to separate the
unary and binary parts of each gamma code, but no such separator is required
in practice.)
x
1
2
3
20
7, 3, 6
gamma
0,
10,0
10,1
11110,0100
1101110111010
The gamma code for a natural number x requires 2llog2 xJ + 1 bits, so that
(decimal) 1,000,000 requires 39 bits. The next Elias code is delta, in which x
is factored as for gamma but p is represented using gamma rather than unary.
x
1
2
3
20
7, 3, 6
delta
0,
100,0
100,1
1l001,0100
10111100110110
Using delta, 1,000,000 is represented in 29 bits; as we discuss below this saving
can, in conjunction with other manipulations, yield excellent compression.
Another family of representations is the Golomb codes [Golomb, 1966, Gal-
lager and Van Voorhis, 1975]. These codes are of particular interest because, as

TEXT DATABASES 163
we discuss below, for this application they yield optimal whole-bit compression.4
In the Golomb codes a single integer parameter b is used to model the distri-
bution of values to be represented; this value can be approximated as
b ~ 0.69 x average x.
Given b, the number x is factored as 1 + (k - 1) x b + d where a :s d < b.
The value k is represented in unary and d in binary; but since b may not be
a power of 2 the number of bits used to represent d can vary between llog2 bJ
and flog2 b1. Computing r =flog2 bland 9 =2r
- b, the value d is encoded in
r - 1 bits if d < 9 and as d +9 in r bits otherwise.
For example, suppose b is 11 so that r is 4 and 9 is 5. Then the numbers 1
to 5 are represented by the sequence of codes 0,000 to 0,100 (where the range
of suffixes is ato 4, represented in 3 bits each) and 6 to 11 are represented by
0,1010 to 0,1111 (for suffixes 5 to 10, in 4 bits each). The codes are uniquely
decodable and, as for all such codes, all sequences of bits are a valid code.
Variable-bit coding is a necessary tool for compression of inverted lists. How-
ever, applying variable-bit codes to inverted lists in their raw form does not
yield particularly good compression; for example, the average document num-
ber only requires one or two bits fewer than the maximum number, and as the
examples above show the coding schemes do not directly result in significant
reductions in size.
A simple property of inverted lists provides the basis for much greater com-
pression. Observing that most of the numbers stored in inverted lists-the
document numbers and the positions-are strictly increasing, by taking the
difference between adjacent numbers of the same kind the values to be stored
become much smaller. Our example inverted list can be written as
3:1(61),
10-3:2(14,106-14),
12-10: 1 (9),
29-12:4 (22,36-22,98-36,202-98), ...
that is,
3:1(61),
7: 2 (14, 92),
2:1(9),
17:4(22,14,62,104), ...
Considering for the moment just the document numbers, the sequence resulting
from taking differences forms a Bernoulli distribution, for which the Golomb
codes are an optimal representation [Bell et al., 1993]. An inverted index con-
sisting of a lexicon and, for each indexed word, an inverted list of Golomb-coded

document numbers occupies under 10% of the size of the indexed data. For
the 3 gigabyte database discussed above such an inverted index requires about
190 megabytes. Delta codes can also be used, at a small loss of compression
efficiency. Using gamma codes for frequencies and delta codes for word posi-
tions, an inverted file typically occupies about 22% of the size of the indexed
data, or under 700 megabytes in our practical example-one sixth of the space
required for the uncompressed index.
This space saving does come at a cost: processing effort required to de-
code inverted lists. However, on current desktop machines the time spent in
decompression is more than offset by the time saved in data transfer [Moffat
and Zobel, 1996], and in new architectures the gap between processor speed
and disk transfer rates is continuing to widen, favoring the use of compression.
Thus inverted file compression saves both space and time. Further refinements
to representation of inverted files are discussed in Section 5.3.
Although the successful application of compression to inverted files is fairly
recent, compression is already used in several commercial text database systems
and some of the Internet search engines. The public-domain MG text database
system was developed to demonstrate the application of compression to this
domain [Bell et al., 1995, Witten et al., 1994).
5.2.4 Index construction
There are several possible approaches to index construction for text databases,
which can be broadly classified as either one-pass or two-pass, that is, according
to the number of times the text is inspected during index construction. We
first outline the possibilities, then describe two of the more efficient methods
in detail.
The concept of indexing has often been described as "inversion"-provision
of access to records according to content. Inversion is often implemented as a
sorting process, and indeed a common algorithm given in textbooks for gener-
ating an inverted file is as follows:
1. For each document d in the collection and each word t in d, write a pair
(t, d) to a file.
2. Sort the file with t as a primary sort key and d as a secondary sort key.
This algorithm is, however, almost absurdly wasteful-the document numbers
are already sorted, but sorting algorithms will gain little advantage from this
partial sorting. Moreover, the volume of index information dictates an expen-
sive external sort.
Better solutions use a dynamic structure containing the distinct words in
the database, where each node in the structure points to a dynamic list of

TEXT DATABASES 165
1. While the internal buffer is not full, get documents; for each docu-
ment d, extract the distinct words and for each word t,
(a) If t has a.lready occurred in a previous document, add d to t's
document list.
(b) Otherwise add t to the structure of distinct words and create a
document list for t containing d.
2. When the internal buffer is full, write it to disk to give a partial index,
with the inverted lists stored according to word order. Clear the buffer
and return to step 1.
3. Merge the partial indexes to give the final inverted file.
Figure 5.3. Single-pass index construction algorithm using temporary files.
the document numbers containing that word. Initially the word structure is
empty; as documents are processed new words are added, and for existing
words new document numbers are added to the words' lists of occurrences
(together with the positions of the word in each document). However, in a
naive implementation the costs will still be high because of the difficulties
of maintaining structures of words and lists without frequent disk accesses.
Minimizing the use of disk is the key to fast index construction.
There are two fast index construction methods, both of which use a dedi-
cated in-memory buffer as a temporary storE:. In the first method, shown in
Figure 5.3, the buffer is used to store complete partial indexes and the database
is processed in a single pass. Note that compression is as useful during indexing
as it is in the finished index-if the partial indexes are constructed and stored
compressed, more documents can be indexed before the internal buffer is filled,
and less temporary space is required for the partial indexes.
The main disadvantage of this method in practice is the use of temporary
space for the partial indexes, which will exceed the size of the final index because
the indexed words must be repeated between files; and further space is required
for merging. Note that given a fixed-size internal buffer the asymptotic cost of
the merging grows more quickly than does the volume of data to be indexed.
This is not usually a problem in practice because, at least historically, growth
in database size has been matched by improvements in technology, but the
single-pass algorithm is not suitable for "huge" databases.
The alternative efficient method, however, has neither of these problems.
This method is outlined in Figure 5.4. Given memory for a complete lexicon

1. Extract the distinct words from each document, and for each word
count the number of documents in which it appears. (Additional statis-
tics are required if word positions are to be stored.)
2. Use the complete lexicon and occurrence counts to create an empty,
template inverted index, to be progressively filled in during the second
pass. The template index contains each distinct word and, for each
word, contiguous space for the word's document list.
3. Initialize the second pass by creating, in the internal buffer, an empty
document list for each term in the lexicon.
4. While the internal buffer is not full, get documents; for each docu-
ment d, extract the distinct words and, for each word t, add d to t's
document list.
5. When the buffer is full, write the partial index into appropriate parts
of the template index, clear the document lists, and go to step 4.
Figure 5.4. Two-pass index construction algorithm.
and for a fixed buffer to be used as a temporary store, a text database can
be rapidly indexed in two passes using no temporary disk space at all [Witten
et aI., 1994). In this method, the first pass is used to construct the lexicon
and a skeleton for the complete index. The skeleton is progressively filled in
during the second pass, by writing the contents of the buffer when it becomes
full; note that each writing of the buffer requires only a single pass through the
disk, thus minimizing disk head movement.
Both methods are highly efficient in practice, indexing about half a gigabyte
of text per hour on a large desktop machine. Indeed the principle costs tend
not to be the indexing itself but the auxiliary processes such as the parser for
extraction of words from each document.
5.2.5 Index update
Compared to records in conventional databases, each record in a text database
contains a large number of items to be indexed-usually hundreds and often
thousands or more. Index update is therefore expensive: insertion of a single
record involves changing the inverted list of every word occurring in that record.
Since these changes can increase the length of the inverted lists, so that (if
stored contiguously) they may no longer fit at their current location on disk,

TEXT DATABASES 167
update also involves moving lists to allow for such increase. The cost of update
is the most significant technical difficulty faced in implementation of a text
database system. In this section we describe approaches to update of indexes for
text database, principally considering record insertions as these are by far the
most common update operation to text databases: in contrast to conventional
databases, in which every record in a table may be modified daily by operations
such as "add interest to every account balance" , there are no bulk updates, and
a great many text databases are used to store streams of incoming data such
as newspaper articles, court transcripts, and completed documents of one kind
or another.
There is no single clever strategy that dramatically reduces update costs
(which, for similar reasons, are also a problem for the alternative technology of
signature files). There are however several strategies for ameliorating update
costs, by using temporary space, by trading update time against query evalua-
tion time, and by deferring the availability of new documents. We now outline
some of these strategies.
Updating the index as each record is inserted is costly, but the per-record
cost rapidly diminishes if insertions are batched, say into groups of R records,
and all of the corresponding index updates handled at once. Such aggrega-
tion of updates is effective because records share many words (in particular the
common words, whose inverted lists are the most expensive to access and up-
date), and because the changes to the inverted lists can be handled in order of
appearance on disk, minimizing head movement-net seek time will be almost
unchanged compared to updating the inverted file for a single record. Varying
R trades the per-record cost of update against the delay until the record be-
comes available. In some environments, for example, it may be quite reasonable
to process all insertions overnight, in which case the amortized update cost is
negligible but the database will be unavailable while the index is modified.
In other environments, the downtime and the delay in availability of new
records are unacceptable. However, simple variants of the batching strategy
can still be used. For example, if the new records are not indexed immediately
that does not mean that they are unavailable; they can be held in a pool that
is exhaustively searched during evaluation of each query.s If this pool is large
enough that exhaustive searching is an unreasonable expense, the pool can be
treated as a mini-database and indexed accordingly.
Once we grant the existence of a pool index, further cost ameliorations are
possible. In particular, the main index can be updated on the fly, with each in-
verted list updated as the opportunity arises-when that inverted list is fetched
as part of query evaluation, for example, Or when a moment of inactivity allows
the machine to schedule the update.

A further amelioration is to consider the organization of each inverted list
of disk. Contiguous storage is clearly preferable for fast query evaluation, but
does not allow the fastest update for the reasons discussed above. However, it
does allow reasonable update. A simple free-list of available space can be used
to maintain the index, for example, typically resulting in space utilization of
around 67%-an unfortunate increase in index size, but not a disaster given
the small initial size.
An alternative is to carve each list into blocks in some way. Here again
there is a trade-off, since long blocks are highly wasteful of space-the average
inverted list is kilobytes but the median is only tens of bytes-but short blocks
are in effect a linked list. One approach that has been suggested is to use a
linked list of blocks, each one twice the length of its predecessor [Faloutsos
and Jagadish, 1992]. However, if applied to all the lists this solution does not
reduce storage costs and increases query evaluation costs. To see why, consider
how the individual blocks must be allocated. Either each block size must be
stored in a separate file or blocks must be managed within a single file via a
scheme such as the buddy system; in either case significant head movements
are required to fetch a single inverted list. Moreover, in either case the trailing
block in each list will be only partially used, giving average space utilization of
75%. In the presence of update some of the blocks of each size will be unused,
further reducing space utilization. Thus the scheme uses only slightly less space
than contiguous storage but adversely impacts query evaluation. The volume
of data read and written during update is reduced (in both cases the whole
list must be read; in the contiguous case, if there is no room for expansion the
whole list must be written elsewhere, whereas in the blocked case only the end
of the list must be written), but more separate accesses are required for these
accesses for the blocked lists.
A practical compromise is to partition only the longest lists into fixed- or
variable-length blocks, and use conventional space management strategies to
manage the rest so that these lists are stored contiguously. A block size that
reflects the organization of the underlying file system is likely to give good
performance. Note that maintaining the contents of a contiguous list in sorted
order is not a significant overhead-even if updates (as opposed to insertions)
are frequent, the cost of inserting a number into an array in memory is dwarfed
by the cost of reading or writing the array to disk-and maintaining sorted
order significantly reduces the cost of query evaluation.
5.2.6 Signature files
Our presentation of inverted files has been rather clear-cut, specifying exactly
how text should be indexed with only limited options for variations that might

TEXT DATABASES 169
improve performance. We are able to present the material in this way be-
cause, currently at least, the technology is fairly settled. There is no compet-
ing methodology for indexing text that efficiently supports evaluation of query
types such as ranking and proximity. Inverted files have not always held such
a position, however. An alternative technology for more limited applications is
signature files.
In signature files, each record is represented by a fixed-length bitstring, or
signature [Pfaltz et al., 1980). The words in the record are hashed to decide
which bits are set to 1; a record is probabilistically likely to contain a given word
if all the bits in its signature that correspond to that word are set. As in all hash-
based methods an explicit vocabulary is not required. Naive query evaluation
requires inspection of all the signatures. However, only those bit positions
corresponding to the query terms need to be inspected, so, by transposing the
array of signatures into an array of bitslices, rapid evaluation of conjunctive
queries is possible [Roberts, 1979). Further improvements can be obtained by
organizing the slices into a multi-level structure [Kent et al., 1990, Sacks-Davis
et al., 1987). Once likely matches are identified these records must be retrieved
and post-processed to verify whether they contain the query terms.
Signature files are well-suited to many of the older text database appli-
cations, which featured: fixed-length documents such as abstracts; machines
with small memories and large numbers of users; and simple Boolean and adja-
cency queries. Compared to the traditional linked-list inverted files, signature
files are rather smaller and give significantly better evaluation times. How-
ever, signatures are not effective for current text applications, partly because
they are poor at indexing databases whose records vary dramatically in length;
and partly because they do not provide efficient evaluation mechanisms for the
rich query paradigms that users now expect for text databases, including not
only ranked and proximity queries but the structured-based querying discussed
below. Moreover, they are not as compact as the current inverted file imple-
mentations, which radically improve on the implementations of only a few years
ago [Zobel et al., 1992, Zobel et al., 1995a).
5.3 Query evaluation
5.3.1 Boolean queries
Boolean query evaluation is, conceptually, a straightforward application of el-
ementary algorithms. Assuming the inverted lists are stored in sorted order
(and neglecting for the moment queries involving phrases or proximity) each
operation is a simple linear merge of two sorted lists, with intersection for AND
and union for OR. The temporary space required to represent the result of the
merge is at most one slot for each document in the database.

Evaluation is only made slightly more complex by introduction of proximity
queries. An intersecting merge is used to find the documents containing the
words that must be proximate; then a comparison of positions is used to check
that the words are appropriately close within the documents. Note that the
word positions should be represented as ordinal word occurrences rather than
byte positions, or it is not possible to reliably identify whether two words are
actually adjacent.
5.3.2 Ranked queries
The principle of ranking was sketched out above: a similarity measure such as
cosine is used to allocate a numerical score to each document in the collection
with respect to the query, then the documents with the highest scores are
retrieved for presentation to the user. In this section we explain how an index
can be used to rapidly compute the scores for the highest-ranked documents.
Reformulating the cosine measure as
C(q, d)
LtEq&d Sq,d,t
Wq ,Wd
where 5 q,d,t = Wq,t . Wd,t, it can be seen that, for any document d, the value
Sq,d,t is non-zero only if t occurs in q, that is, if t is a query term. The numer-
ator LtEq&d 5 q,d,t can be computed considering only query terms; thus all the
information required to compute the numerators is available in an inverted file.
(For the remainder of this discussion we assume that each inverted list consists
of pairs (d, ht) of number-frequency pairs, and that position information is
either not stored or is ignored by the ranking process.) The query length Wq is
unnecessary, but the document lengths Wd must be precomputed and stored in
a separate structure; with efficient representations these lengths can be stored
in a few bits each [Moffat et al., 1994].
Using the inverted file, the cosine similarity of a document d and query q can
be computed as in the elementary ranking algorithm in Figure 5.5. An array of
accumulators is used to store, for each document in the database, the running
totals of the partial sum LtEq&d 5q,d,t. For a typical database and query, once
index processing is complete a reasonable fraction of the accumulators will be
non-zero. These accumulators are then normalized by the document lengths,
and a partial sort such as a heapsort is used to identify the k documents with
the highest cosine values.
The elementary ranking algorithm provides reasonable performance, and in-
deed has been employed in many practical information retrieval systems. How-
ever, it does have significant costs that in many environments are unacceptable,
particularly for larger document collections. First, ranked queries are often ex-

TEXT DATABASES 171
1. Create an array A of accumulators, one for each document d in the
database, and for each d initialize Ad f- O.
2. For each term t in the query,
(a) Compute the term weight Wq,t.
(b) Retrieve the inverted list for t from disk.
(c) For each term entry (d, id,t) in the inverted list, compute Wd,t and
set Ad f- Ad + Sq,d,t.
3. Divide each non-zero accumulator Ad by the document length Wd.
4. Identify the k highest accumulator values (where k is the number of
documents to be presented to the user) and retrieve the corresponding
documents.
Figure 5.5. Elementary ranking algorithm using an array of accumulators.
pressed in natural language, and therefore contain a large number of query
terms; from the point of view of effectiveness this is beneficial because increas-
ing the number of query terms can significantly improve the likelihood that the
query will locate relevant documents. Second, some of the query terms may
occur in a good fraction of the records in the database. The inverted lists for
these query terms must be retrieved and processed in full, and some of them
may be long. Third, the array of accumulators, which contains a floating point
value for each document in the database, is accessed frequently and randomly
and hence must be stored in memory; and a separate array is required for each
simultaneous query. Fourth, the array of document lengths must be either held
in memory or fetched in full for each query.6
In combination, there is substantial use of disk traffic, for inverted list re-
trieval; memory, for accumulators and document lengths; and processor time,
for decompression, accumulator update, and accumulator normalization. We
need to consider ways to reduce all these costs.
An observation that allows savings in all of these resources is that a to-
tal ranking is unnecessary-in response to a given query users are only inter-
ested in a tiny subset of the document collection. Thus it is not necessary to
compute the similarity of every document. Using simple heuristics, several of
which are discussed below, it is straightforward to drastically prune the number
of accumulators required without degrading retrieval effectiveness. (However,
note that two methods can highly rank completely different documents. That

is, maintenance of effectiveness does not imply that the same documents are
fetched, but only that the same proportion of fetched documents are relevant.)
Once the number of accumulators is reduced, index reorganizations can be used
to reduce the other resource requirements.
A straightforward approach to reducing the number of accumulators is to
restrict their number to some fixed value Amax where Amax « N, the number
of documents. In simple versions of such algorithms [Moffat and Zobel, 1996],
query terms are processed in order of decreasing importance as measured by
their inverse document frequency; each (d, fd,t) pair is decoded and d, if not
previously encountered, is only allocated an accumulator if the limit Amax has
not yet been met. Thereafter only existing accumulators can be updated, and
(d, fd,t) pairs referring to other documents are ignored. Thus only documents
containing rare (high inverse document frequency) terms are allocated accu-
mulators, on the heuristic assumption that documents without such terms are
unlikely to be relevant.
Experimentally there was no impact on effectiveness with Amax set so that
only around 2% of the documents have an accumulator, thus reducing memory
requirements by about a factor of 15 (although there is only one-fiftieth of the
number of accumulators, each accumulator now requires a document number
and is stored in a sparse data structure), and eliminating some of the compu-
tational requirement for accumulator update. Since most of the (d, fd,t) pairs
in each inverted list are no longer used-particularly in the long inverted lists
of common terms-the decompression of these pairs is wasted effort. Most
of the decompression can be avoided by introducing a small amount of inter-
nal structure into each inverted list to allow the unused (d, fd,t) pairs to be
skipped, slightly increasing disk traffic but halving processing costs. This in-
ternal structure can also be used to accelerate Boolean query processing. With
these improvements the remaining important bottleneck in processing is the
disk traffic.
An alternative method further reduces processing costs and also reduces disk
traffic [Persin et al., 1996]. The basic idea is that by only allowing sufficiently
large Sq,d,t values to create an accumulator, the number of accumulators will be
reduced. The principle underlying "sufficiently large" is that-because accumu-
lator values grow as inverted lists are processed and because Sq,d,t values tend
to diminish if inverted lists are processed in decreasing order of inverse docu-
ment frequency--the effect of adding further Sq,d,t terms to the accumulators is
increasingly marginal, and not only are unlikely to bring new documents into
the top k but cannot even significantly perturb the ranking. By comparing each
Sq,d,t value to two current thresholds (one to check whether the value should
be considered at all and one to check whether it warrants a new accumulator),
small Sq,d,t values can be filtered and the number of accumulators restricted.

TEXT DATABASES 173
The thresholds are increased as inverted lists are processed. This method, as
for the skipping method, drastically reduces memory requirements without de-
grading retrieval effectiveness, but it requires two parameters to control the
degree of filtering.
If the inverted files are designed appropriately disk traffic can also be dra-
matically reduced. The principle of the index design is that inverted lists are
sorted by within-document frequencies rather than by document number. For
example, consider the inverted list
(5,3)(9,2)(12,2)(16,5)(21,1)(25,2)(32,4) ,
representing that the term being indexed occurs three times in document 5,
twice in document 9, and so on. If the list is ordered first by decreasing within-
document frequencies, with a secondary sort by document number, then it
becomes
(16,5)(32,4)(5,3)(9,2)(12,2)(25,2)(21,1) ,
With this ordering, all of the sufficiently large Sq,d,t values in each inverted
list are at the start; once a small Sq,d,t value is reached then fetching and
processing of that inverted list can terminate. In the experiments of Persin
et 301. this allowed a five-fold reduction in disk traffic and processing time.
A potential drawback to this reorganization of inverted lists is that the docu-
ment numbers are no longer sorted, so that the compression strategy described
above is not strictly applicable. However, a straightforward modification of it
yields equally good compression. First, the frequencies are stored in decreasing
order, so the duplicate frequencies are redundant and can be omitted. Sec-
ond, in practice most of the frequencies are either 1 or 2, and compressing
the sorted document numbers of a given frequency yields good space saving.
Overall, frequency sorting slightly reduces index size.
Another alternative, also based on frequency-sorted inverted lists, is to in-
terleave the processing of the inverted lists rather than process them sequen-
tially [Persin, 1996]. In the query evaluation methods described above, each
inverted list is processed sequentially from the beginning until either the list is
exhausted or the frequencies are judged to be sufficiently small that they will
not affect the ranking; once processing of an inverted list is complete, it is not
revisited. But consider two terms t and tl
occurring in documents d and dl
re-
spectively. Even if t is ra.rer than tl
and has higher inverse document frequency,
so that t's inverted list is processed first, it may well be that Sq,d,t is less than
Sq,d',t ' if t is much less frequent in d than tl
is in dl
. It follows that, if we are
to observe the principle that high Sq,d,t values should be processed first, it is
inappropriate to process the whole of the inverted list for t before commencing
the list for t',

1. Create an empty set of accumulators.
2. For each term t in the query, identify the highest within-document
frequency id,t for that term and compute the partial similarity Sq,d,t.
3. While the largest unprocessed Sq,d,t value is sufficiently large,
(a) Find the query term t with the largest unprocessed Sq,d,t value.
(b) If there is an accumulator Ad present in the set of accumulators,
set Ad +- Ad + Sq,d,t·
(c) Otherwise, if the number of accumulators is less than Amax , create
a new accumulator Ad and set Ad +- Sq,d,t.
(d) Compute the next highest Sq,d,t value for t.
4. Divide each accumulator Ad by the document length Wd.
5. Identify the k highest accumulator values and retrieve the correspond-
ing documents.
Figure 5.6. Interleaved ranking algorithm using limited accumulators
In interleaved ranking, processing consists of considering the partial simi-
larity values Sq,d,t in order of strictly non-increasing magnitude, independent
of the inverted lists in which they occur. Efficiency gains results from two
heuristics: limiting the number of accumulators so that only the larger Sq,d,t
values can create an accumulator; stopping when the next greatest Sq,d,t value
is sufficiently small and is unlikely to affect the relative order of the high-
est ranked documents. Whether an Sq,d,t value is "sufficiently small" can be
heuristically determined by examining the current accumulator values. An al-
ternative approach is to explicitly bound the time required to evaluate a query,
and terminate processing when the time bound is reached.
Such processing is supported by frequency-sorted indexes, in which the high-
est frequencies in each list (and thus highest Sq,d,t values in each list) are at the
start, and (el, id,t) values can be retrieved from each list in decreasing order.
Interleaved query evaluation is shown in Figure 5.6.
The main potential disadvantage of interleaved ranking is that inverted lists
are fetched on demand, piecemeal, rather than with a single read. Thus fetching
the whole list at once incurs the overhead of retrieving unnecessary data, while
fetching the list at need can incur the overhead of unnecessary disk activity. In
practice, however, the problem does not appear to be significant-in most cases

TEXT DATABASES 175
all of the required (d, Id,t) pairs are in the first few kilobytes of each inverted list,
so fetching a single disk block from the start of each list is sufficient [Brown,
1995]. Moreover, in some cases not even the first block is required; if the
maximum Id,t value for each term is held with the term in the lexicon, it is
possible to identify that, for some terms, no 5q,d,t value will be sufficiently large.
These are not the only possible approaches for improving the basic ranking
algorithm. Elimination of stopwords can be used to reduce the computation
costs. However, it is sometimes difficult to determine the correct set of stop-
words for a particular document collection. For example, in a database of
articles from the Wall Street Journal within the TREC collection, the word
"text"-not a particularly common word in English-is encountered in every
document in the collection.
Other proposals have been based on dynamic stopping conditions. One is
that the number of accumulators be limited by considering only documents that
contained a term with a sufficiently high inverse document frequency [Harman
and Candela, 1990]. Another possible stopping condition is to reduce the num-
ber of (d,ld,t) pairs by computing an upper bound for the similarity of the
current document being considered, and ignoring 5q,d,t if the computed upper
bound was smaller then the weight of the least important document in the set
of answers [Lucarella, 1988]. The efficiency of the basic ranking algorithm can
also be improved using the assumption that only k top ranked documents are
to be retrieved [Buckley and Lewit, 1985]. In this method, query processing is
terminated when the upper bound of the similarity of the k +1st document be-
comes less than the similarity of the kth document. However, these schemes do
not provide the dramatic improvements given by the methods discussed above.
5.4 Refinements to text databases
5.4.1 Structure and fields
Traditional text retrieval systems regard each document as an unstructured
sequence or bag of words. However, documents consist of fields such as titles,
sections, and paragraphs. These components often conform to a hierarchical
structure that can be represented by a formal schema such as an SGML docu-
ment type description [Goldfarb, 1990].
Compared to traditional database applications, text objects conforming to
the same schema can vary widely in both structure and size. Consider, for
example, a collection of documents relating to the technical details for the
products of a manufacturing company. These documents might include mem-
oranda, engineering reports, and surveys of technical literature, all written to
conform to the company's official proforma. They might also include other
memoranda written by office staff without reference to the official forms, let-

<letter>
<head><from>Mark Twain«from>
<to>W. D. Howells>«to>
<date>15 June 1872«date>
«head>
<body><sentence> Friend Howells
< (sentence> <sentence>
Could you tell me how I could get a copy of your portrait as
published in Hearth & Home? «sentence><sentence>
I hear so much talk
about it as being among the finest works of art which have
yet appeared in that journal, that I feel a strong desire to see it.
«sentence><sentence> Is it suitable for framing?
«sentence> ... «body>
</letter>
Figure 5.7. SGML document illustrating hierarchical structure.
tel's that have little structure in common with either of the other classes of
memoranda, documents from external sources, and so on. Yet all these docu-
ments must be searched as a single collection. The lack of uniformity among
the documents in a single collection makes indexing and retrieval more complex
than if the documents had uniform structure and size.
We illustrate structure by considering a collection of documents in which
markup (such as SGML tags) is included in the text to represent the structural
information. Consider for example the document in Figure 5.7, which is a letter
consisting of a head and body. The head consists of three fields: from, to and
date and the body consists of a number of sentences. Each structural unit is
delimited by a start tag and an end tag. For example, a sentence starts with
a <sentence> tag and ends with a </sentence> tag. The document forms a
simple tree, in which the text is in the leaves and each structural unit is a node.
Structured documents can be queried in the traditional way, as if they were
no more than a sequence of words, but query languages can take advantage of
the structure to provide more effective retrieval. A simple example of a query
involving structure is
find documents with a chapter whose title contains the phrase "metal fatigue"
If such queries are to be evaluated efficiently they require support from indexing
mechanisms. One possibility is to use conventional relational or object-oriented
database technology to store and index the leaf elements of the hierarchical

TEXT DATABASES 177
structure, and maintain the relationships between these leaf elements and the
higher level elements of the document structure in other relations (or object
classes). Join operations can then be used to reconstruct the original docu-
ments or document components. The problem with using such technology is
that a large number of database objects may be required to store the infor-
mation from a single document, so that it is expensive both to search across
the document and to retrieve it for presentation. For these reasons specialized
indexing techniques for structured documents have been developed.
Perhaps the simplest method for supporting structure is to index the docu-
ments and process queries as for unstructured docllments, so that the result of
query resolution is a set of documents that potentially match the query; these
documents can then be filtered to remove false matches. As a general prin-
ciple it is always possible to trade the size and complexity of indexes against
post-retrieval processing on fetched documents-there is a tradeoff between the
amount of information in the index and the number of false matches that must
be filtered out query time, and indeed for just about any class of data and in-
dex type it is possible to conceive of queries that cannot be completely resolved
using the index. It is often the case, however, that addition of a relatively
small amount of information to an index can greatly reduce the number of false
matches to process; consider how adding positional information eliminates the
need to check whether query terms are adjacent in retrieved documents. More-
over, the cost of query evaluation via inverted lists of known length is usually
much more predictable than the cost of processing an (unknown) number of
false matches. We therefore consider query evaluation techniques that involve
increased index complexity and reduced post-retrieval processing.
One approach is to encode document structure in the index. For each doc-
ument containing a given word, rather than storing the document number and
the ordinal positions at which it is possible to store, say, the document number;
the chapter number within document; paragraph within chapter; and finally
position within paragraph.
Indexes for hierarchically structured documents require that considerably
more information be stored for each word occurrence, but the magnitudes of
the numbers involved are rather smaller, the "take difference and encode" com-
pression strategies can be applied, and there is plenty of scope to remove re-
dundancy: if a word occurs twice in a document, the document number is only
stored once; if it occurs twice in a chapter, the chapter number is only stored
once; and so on. Experiments have shown that, compressed, the size of such an
index roughly doubles compares to storing ordinal word positions, from about
22% of the data size to 44% of the data size [Thom et al., 1995]. The resulting
indexes allow much more powerful queries to be evaluated directly, without
recourse to false matching.

Rather than encode the structural information within the inverted indexes,
another approach is to maintain simple word position indexes for each term in
the database and record the structural information in separate indexes.
In order to represent the positions of the words and the markup symbols,
the words in each document are given consecutive integer numbers and the
markup symbols are given intermediate rational numbers. Thus, for example,
a certain word might occur at position 66, the start tag for paragraph occur
at position 53.5, and the end tag at 69. I-from which it can be deduced that
the word occurs in the paragraph. The positions between a start tag and the
corresponding end tag constitute an interval.
Evaluating Boolean queries with conventional text indexes involves merging
the inverted lists the query terms. In contrast, the processing of structural
queries involves merging the inverted lists of word positions and inverted lists
of intervals. For example, processing the query
find sentences containing "fatigue"
involves merging the inverted lists of word positions for the term "fatigue" and
the inverted list of intervals for the tag sentence to identify a set of intervals
containing the word.
An approach to query on structure based on text intervals was formalized
as the GCL (Generalized Concordance Lists) model [Clarke et aI., 1995]. The
GCL model includes an algebra that incorporates operators to eliminate inter-
vals that wholly contain (or are wholly contained in) other intervals. These
operators are important for efficient query processing. GCL evolved from two
earlier structured text retrieval languages developed at the University of Wa-
terloo [Burkowski, 1992, Gonnet and Tompa, 1987], one of which, the Pat text
searching system, was developed for use with the New Oxford English Dictio-
nary. Dao et al. [Dao et aI., 1996] extended the GCL model to manage recursive
structures (such as lists within lists).
Compared to the approach of incorporating document structure within the
inverted indexes, the GCL model and its variants have two important advan-
tages: queries on structure only (such as "find documents containing lists") can
be evaluated efficiently using the interval index; and the GCL model does not
require that the document structure be hierarchical. On the other hand, it is
expensive to create and manipulate inverted lists of commonly occurring tags
(such as section or paragraph) that are contained in every document so that,
for hierarchical document collections, incorporating document structure within
the inverted index is likely to have performance advantages. For example, a
simple query to find sentences containing two given terms only requires, with a
hierarchical index, that the inverted lists for the query terms be retrieved and
processed; while with the interval approach it is also necessary to fetch and
process the inverted list of sentence tags.

TEXT DATABASES 179
5.4.2 Pattern matching
Standard query languages for text databases include pattern matching con-
structs such as wildcard characters and other forms of partial specification of
query terms. In particular, in both ranking and Boolean queries users often
use query terms such as comput* to match all words starting with the letters
comput, and more general patterns may also be used. A common approach
is to scan the lexicon to find all terms that satisfy the pattern matching con-
struct and then retrieve all the corresponding inverted lists. Since the lexicon
is ordered, prefix queries, where patterns are of the form X*, can be evaluated
efficiently since, with a lexicon structure such as a B-tree, all possible matching
terms are stored contiguously. However, other pattern queries can require a
linear scan of the whole lexicon. The problem, in a large lexicon, is to rapidly
find all terms matching the specified pattern.
A standard solution is to use a trie or a suffix tree [Morrison, 1968, Gonnet
and Baeza-Yates, 1991)' which indexes every substring in the lexicon. Tries
provide extremely fast access to substrings but have a serious drawback in this
application: the need for random access means that they must be stored in
core which means that, at typically eight to ten times larger than the indexed
lexicon, for TREC up to 100 megabytes of memory is required. Unless speed
is the only constraint smaller structures are preferable.
One alternative is to use a permuted dictionary [Bratley and Choueka, 1982,
Gonnet and Baeza-Yates, 1991] containing all possible rotations of each word in
the lexicon, so that, for example, the word range would contribute the original
form Irange and the rotations range I, ange Ir, nge Ira, ge Iran, and e Irang,
where I indicates the beginning of a word. The resulting set of strings is then
sorted lexicographically. Using this mechanism, all patterns of the form X*,
*X, *X* and x*y can be rapidly processed by binary search on the permuted
lexicon. The permuted lexicon can be implemented as an array of pointers, one
to each character of the original lexicon, or about four times the size of the
indexed data. Update of the structure is fairly slow.
Another approach is to index the lexicon with compressed inverted files [Zo-
bel et al., 1993]. The lexicon is treated as a database that can be accessed using
an index of fixed length substrings of length n, or n-grams. To retrieve strings
that match a pattern, all of the n-grams in the pattern are extracted, the words
in the lexicon that contain these substrings are identified via the index; and
these words are checked against the pattern for false matches. This approach
provides the general pattern matching and a smaller overhead, with indexes
of around the same size of the indexed data; matching is significantly slower
than with the methods discussed above but still much faster than exhaustive

search. A related approach is to index n-grams with signature files [Owolabi
and McGregor, 1988], which can have similar performance for short strings.
5.4.3 Phonetic matching
Pattern matching is not the only kind of string matching of value for text
databases. Another kind of matching is by similarity of sound-to identify
strings that, if voiced, may have the same pronunciation. Such matching is
of particular value for databases of names; consider for example a telephone
directory enquiry line.
To provide such matching it is necessary to have a mechanism for determin-
ing whether two strings may sound alike-that is, a similarity measure-and, if
matching is to be fast, an indexing technique. Thus phonetic matching is a form
of ranking. Many phonetic similarity measures have been proposed. The best
known (and oldest) is the Soundex algorithm [Hall and Dowling, 1980, Kukich,
1992] and its derivatives, in which strings are reduced to simple codes and are
deemed to sound alike if they have the same encoding. Despite the popularity of
Soundex, however, it is not an effective phonetic matching method. Far better
matching is given by lexicographic methods such as n-gram similarities, which
use the number of n-grams in common between two strings; edit distances,
which use the number of changes required to transform one string to another;
and phonetically-based edit distances, which make allowance for the similarity
of pronunciation of the characters involved [Zobel and Dart, 1995, Zobel and
Dart, 1996].
An n-gram index can be used to accelerate matching, by selecting the strings
that have short sequences of characters in common with the query string to be
subsequently checked directly by the similarity measure. The speed-up available
by such indexes is limited, however, because typically 10% of the strings are
selected by the index as candidates.
5.4.4 Passage retrieval
Documents in text databases can be extremely large-one of the documents in
the TREe collection, for example, is considerably longer than Tolstoy's War
and Peace. Retrieval of smaller units of information than whole documents
has several advantages: it reduces disk traffic; small units are more likely to
be useful to the user; and they may represent blocks of relevant material from
otherwise irrelevant text. Such smaller units, or passages, could be logical units
such as sections or series of paragraphs, or might simply be any contiguous
sequence of words.
Passages can be used to determine the most relevant documents in a collec-
tion, on the principle that it is better to identify as relevant a document that

TEXT DATABASES 181
contains at least one short passage of text with a high number of query terms
rather than a document with the query terms spread thinly across its whole
length. Experiments with the TREC collection and other databases shows that
use of passages can significantly improve effectiveness [Callan, 1994, Hearst and
Plaunt, 1993, Kaszkiel and Zobel, 1997, Knaus et al., 1995, Mittendorf and
Schauble, 1994, Salton et al., 1993, Wilkinson, 1994, Zobel et al., 1995b]. Use
of passages does increase the cost of ranking, because more distinct items must
be ranked, but the various techniques described earlier for reducing the cost of
ranking are as applicable to passages as they are to whole documents.
5.4.5 Query expansion and combination of evidence
Improvement of effectiveness-finding similarity measures that are better at
identifying relevant documents-is a principal goal of research in information
retrieval. Passage retrieval is one approach to improving effectiveness. Two
other approaches of importance are query expansion and combination of evi-
dence.
The longer a query, the more likely it is to be effective. It follows that it can
be helpful to introduce further query terms, that is, to expand the query. One
such approach is thesaural expansion, in which either users are encouraged
to add new query terms drawn from a thesaurus or such terms are added
automatically. Another approach is relevance feedback: after some documents
have been returned as matches, the user can indicate which of these are relevant;
the system can then automatically extract likely additional query terms from
these documents and use them to identify further matches. A recent innovation
is automatic query expansion, in which, based on the statistical observation that
the most highly-ranked documents have a reasonable likelihood of relevance,
these documents are assumed to be relevant and used as sources of further
query terms. All of these methods can improve performance, with relevance
feedback in particular proving successful [Salton, 1989].
A curious feature of document retrieval is that different approaches to mea-
suring similarity can give very different rankings-and yet be equally effective.
That is, different measures identify different documents, because they use differ-
ent forms of evidence to construe relevance. This property can be exploited by
explicitly combining the similarities from different' measures, which frequently
leads to improved effectiveness [Fox and Shaw, 1993].
5.5 Summary
We have reviewed querying and indexing for text databases. Since queries to
text databases are inherently approximate, text querying paradigms must be
judged by their effectiveness, that is, whether they allow users to readily locate

relevant documents. Research in information retrieval has identified statistical
ranking techniques, based on similarity measures, that can be used for effective
querying. The task of text query evaluation is to compute these measures
efficiently, or to efficiently compute heuristic approximations to these measures
that allow faster response without compromising effectiveness.
The last decade has seen vast improvements in text query evaluation and
text indexes. First, compression has been successfully applied to inverted files,
reducing the space requirements of an index with full positional information to
less than 25% of that of the indexed data, or less than 10% for an index with
only the document-level information required for ranking. This compares very
favorably with the space required for traditional inverted file or signature file
implementations. Use of compression has no impact on overall query evaluation
time, since the additional processing costs are offset by savings in disk traffic.
Also, compression makes possible new efficient index construction techniques.
Second, improved algorithms have led to further dramatic reductions in the
costs of text query evaluation, and in particular of ranking, giving savings in
memory requirements, processing costs, and disk traffic.
Currently, however, the needs of document database systems are rapidly
changing, driven by the rapid expansion of the Web and in the use of intranets
and corporate databases. We have described some of the new requirements for
text databases, including the need to index and retrieve documents according
to structure and the need to identify relevant passages within text collections.
Improved retrieval methodologies are being proposed and consequently there is
a need to support new evaluation modes such as query expapsion and combina-
tion of evidence. These improvements are not yet well understood; and before
they can be used in practice new indexing and query evaluation techniques
are required. Future research in text database indexing will have to meet the
demands of these advanced kinds of querying.
Notes
1. The ongoing TREC text retrieval experiment, involving participants from around
the world, is an N'IST-funded initiative that provides queries, large test collections, and
blind evaluation of ranking techniques. Prior to TREC the cost of relevance judgments had
restricted ranking experiments to toy collections of a few thousand documents.
2. Some of the online search engines, such as AltaVista, report the number of occurrences
of each query term. Currently (the start of 1997) these numbers often run up to a million or
so, against a database of around ten million records, showing that meaningful query terms
can indeed occur in a large fraction of the database.
3. Note, however, that text databases are free of some of the costs of traditional databases.
Although text database index processing can seem exorbitantly expensive in comparison to
the cost of processing a query against, say, a file of bank account records, there is no equiv-
alent in the text domain to the concept of join. All queries are to the same table and query
evaluation has linear asymptotic complexity.

TEXT DATABASES 183
4. Fractional-bit codes such as those produced by arithmetic coding require less space,
but are not appropriate for this application because they give relatively slow decompression.
5. The effectiveness of solutions of this kind depends on the overall design of the database
system. Most current text database systems are implemented as some form of client-server
architecture, with the data and server resident one machine and, to simplify locking, with
a single server process handling all queries and updates (perhaps via multiple threads) and
communicating with multiple clients.
6. The array of document lengths is not strictly necessary. Instead of storing each
document frequency as Id,t and storing the W d values separately, it would be possible to store
normalized frequencies Id,t fWd in the inverted lists and dispense with the W d array. However,
such normalization is incompatible with compression and on balance degrades overall query
evaluation time because of the increased disk traffic. Note that the array of Wd values can
be compacted to a few bits per entry without loss of effectiveness [Moffat et aI., 1994].

6 EMERGING APPLICATIONS
Because performance is a crucial issue in database systems, indexing techniques
have always been an area of intense research and development. Advances in
indexing techniques are primarily driven from the need to support different
data models, such as the object-oriented data model, and different data types,
such as image and text data. However, advances in computer architectures
may also require significant extensions to traditional indexing techniques. Such
extensions are required to fully exploit the performance potential of new archi-
tectures, such as in the case of parallel architectures, or to cope with limited
computing resources, such as in the case of mobile computing systems. New
application areas also play an important role in dictating extensions to indexing
techniques and in offering wider contexts in which traditional techniques can
be used.
In this chapter we cover a number of additional topics, some of which are in
an early stage of research. We first discuss extensions to index organizations
required by advances in computer system architectures. In particular, in Sec-
tion 6.1 we discuss indexing techniques for parallel and distributed database
systems. We outline the main issues and present two techniques, based on B-
tree and hashing, respectively. In Section 6.2 we discuss indexing techniques

for databases on mobile computing systems. In this section, we first briefly de-
scribe a reference architecture for mobile computing systems and then discuss
two indexing approaches. Following those two sections, we focus on extensions
required by new application areas. In particular, Section 6.3 and Section 6.4
discuss indexing issues for data warehousing systems and for the Web, respec-
tively. Data warehousing and Web are currently "hot" areas in the database
field and have interesting requirements with respect to indexing organizations.
We then conclude this chapter by discussing in Section 6.5 indexing techniques
for constraint databases. Constraint databases are able to store and manipu-
late infinite relations and they are, therefore, particularly suited for applications
such as spatial and temporal applications.
6.1 Indexing techniques for parallel and distributed databases
Parallel and distributed systems represent a relevant architectural approach
to efficiently support mission-critical applications, requiring fast processing of
very large amounts of data. The availability of fast networks, like 10 Mb/sec
Ethernet or 100 Mb/sec to 1 Gb/sec Ultranet [Litwin et al., 1993a], makes it
possible to process in parallel large volumes of data without any communication
bottleneck.
In a distributed or parallel database system, a set-oriented database object
such as a relation may be horizontally partitioned and each partition stored at a
database node. Such a node is called store node for the data object [Choy and
Mohan, 1996] and the number of nodes storing partitions of the data object is
called the partitioning degree. Data are accessed from application programs and
users residing on client nodes. A client node mayor may not reside on the same
physical node as a store node is located. A query addressed to a given data
object can be executed in parallel over the partitions into which the data object
has been decomposed, thus achieving substantial performance improvements.
In practice, however, efficient parallel query processing entails many issues, such
as parallel join execution techniques, optimal processor allocation, and suitable
indexing techniques. In particular, if indexing techniques are not designed
properly, they may undermine the performance gains of parallel processing.
Data structures for distributed and parallel database systems should satisfy
several requirements [Litwin et al., 1993a]. Data structures should gracefully
scale up with the partitioning degree. The addition of a new store node to a data
object should not require extensive reorganization of the data structure. There
should be no central node through which searches and updates to the data
structure must go. Therefore, no central directories or similar notions should
exist. Finally, maintenance operations on the data structure, like insertions or
deletions, should not require updates to the client nodes.

EMERGING APPLICATIONS 187
In the remainder of this section, we present two data structures. The first is
based on organizing the access structure on two levels. Given a query, the top-
most global level is used to detect the nodes where data relevant to the query
are stored; the lowest local level of the access structure is used to retrieve the
actual data satisfying the query. There is one local level of the data structure
for each partition node of the indexed data object. The second data structure
is a distributed extension of the well-known linear hashing technique [Litwin,
1980]. This data structure does not require any global component. A query is
sent by the client, issuing the query, to the store node that, according to the
information the client has, contains the required data. If the data are not found
at that store node, the query is forwarded by that node to the appropriate store
node.
6.1.1 Two-tier indexing technique
Two simple approaches to indexing data in a distributed database can be de-
vised based, respectively, on the notions of local index and global index [Choy
and Mohan, 1996]. Under the first approach, a separate local index is main-
tained at each store node of a given data object. Therefore, each local index
is maintained for the respective partition like a conventional index on a non-
partitioned object. This approach requires a number of local indexes equal to
the number of partitions. A key lookup requires sending the key value to all the
local indexes to perform local searches. Such approach is therefore convenient
when qualifying records are found in most partitions. If, however, qualifying
records are only found in a small fraction of partitions, this approach is very
inefficient and in particular does not scale up for large number of partitions.
The main advantages of this approach are that no centralized structure exists,
and updates are efficient because an update to a record in a partition only
involves modifications to the local index associated with the partition.
Under the global index approach, a single, centralized index exists that in-
dexes all records in all partitions. This approach requires globally unique record
identifiers (RID) be stored in the index entries. Indeed, two different records in
two different partitions may happen to have the same (local) RID and there-
fore at a global level, a mechanism to uniquely identify such records must be in
place. A simple approach is to concatenate each local RID with the partition
identifier [Choy and IVlohan, 1996]. The global index can be stored at any node
and may be partitioned.
The global approach allows the direct identification, without requiring use-
less local searches, of the records having a given key value. However, it has sev-
eral disadvantages. First, remote updates are required whenever a partition is
modified. Remote updates are expensive because of the two-phase commit pro-

tocols that must be applied whenever distributed transactions are performed.
Second, a remote shared lock must be acquired on the index, whenever a par-
tition is read, to ensure serializability. Third, the global index approach is
not efficient for complex queries requiring intersection or union of lists of RIDs
returned by searches on different global indexes, if these global indexes are lo-
cated at different sites. In such a case, long lists of RIDs must be exchanged
among sites. Storing all the global indexes at the same site would not be a
viable solution. The site storing all the global indexes would become an hot
spot, thus reducing parallelism.
An alternative approach, called two-tier index, has been proposed [Choyand
Mohan, 1996], trying to combine the advantages of the above two approaches.
Under the two-tier index approach, a local index is maintained for each parti-
tion. An additional coarse global index is superimposed on the local indexes.
Such a global index keeps for each key value the identifier of the partition stor-
ing records with this key value. The coarse global index is, however, optional.
Its allocation mayor may not be required by the database administrator de-
pending on the query patterns. The coarse global index may be located at any
site and may be partitioned.
An important requirement is that the overall index structure should be main-
tained consistent with respect to the indexed objects. Therefore, updates to
any of the local indexes have to be propagated, if needed, to the coarse global
index. However, compared to the global index approach, the two-tier index
approach is much more efficient with respect to updates. Whenever a record
having a key value v is removed from a partition, the global coarse index needs
to be modified only if the removed record is the last one in its partition having
v as key value. By contrast, if other records with key value v are stored in the
partition, the coarse global index needs not to be modified. Of course, the local
index needs to be modified in both cases. Insertions are handled according to
the same principle. Whenever a new record is inserted into a partition, the
coarse global index needs to be modified only if the newly inserted record has
a key value which is not already in the local index. Algorithms for efficient
maintenance operations and locking protocols have also been proposed [Choy
and Mohan, 1996].
With respect to query performance, the two-tier index approach has the
same advantage as the global index approach. The coarse global index allows
the direct identification of the partitions containing records with the searched
key value. Then, the search is routed to the identified partitions where the
local indexes are searched to determine the records containing the key value.
However, unlike the global index approach, the two-tier approach maximizes
opportunity for parallelism. Once the partitions are identified from the coarse
global index, the search can be performed in parallel on the local indexes of

the identified partitions. In addition, the two-tier approach provides more
opportunities for optimization. For example, if a search condition is not very
selective with respect to the number of partitions, the coarse global index can be
bypassed and the search request be simply broadcasted to all the local indexes
(as in the case of the local indexes approaches).
It has been shown that the two-tier index represents a versatile and scalable
indexing technique for use in distributed database systems [Choy and Mohan,
1996]. Many issues are still open to investigation. In particular, the two-tier
index structure can be extended to a multi-tier index structure, where the index
organization consists of more than two levels. Query optimization strategies
and cost models need to be developed and analyzed.
6.1.2 Distributed lineal' hashing
The distributed linear hashing technique, also called LH*, has been proposed
in a precise architectural framework. Basically, the availability of very fast
networks makes it more efficient to retrieve data from the RAM of another
processor than from a local disk [Litwin et al., 1993a]. A system consisting of
hundreds, or even thousands, of processors interconnected by a fast network
would be able to provide a large, distributed RAM store adequate to large
amount of data. By exploiting parallelism in query execution, such a system
would be much more efficient than systems based on more tJ;aditional archi-
tectures. Such an architecture may be highly dynamic with new nodes added,
as more storage is required. Therefore, there is the need of access structures
for use in systems with very large number of nodes, hundreds or thousands,
and able to gracefully scale. A given file, in such a system, may be shared by
several clients. Clients may issue both retrieval and update operations.
The distributed linear hashing has been proposed with the goal of addressing
the above requirements. An important feature of this organization is that it
does not require any centralized directory and is rather efficient. It has been
proved [Litwin et al., 1993a] that retrieval of a data item given its key value
usually requires two messages, and four in the worst case. In the remainder of
this section, we first briefly review the linear hashing technique and then we
discuss the distributed linear hashing in more detail.
Linear hashing. Linear hashing organizes a file into a collection of buckets.
The number of buckets linearly increases as the number of data items in the
file grows. In particular, whenever a bucket b overflows, an additional bucket
is allocated. Because of the dynamic bucket allocation, the hash function must
be dynamically modified to be able to address also the newly allocated buckets.
Therefore, as in other hashing techniques, different hashing functions need to be

used because more bits of the hashed value are used as the address space grows.
In particular, the linear hashing uses two functions hi and hi+1 , i =0,1,2, ....
Function hi generates addresses in the range (0, N x 2i
- 1),1 where N is the
number of buckets that are initially allocated (N can be also equal to 1). A
commonly used function [Litwin et al., 1993a] is:
hi: C mod N x 2i
where C is the key value. Each bucket has a parameter called bucket level
denoting which hash function, between hi and hi +1, must be used to address
the bucket.
Whenever a bucket overflows, a new bucket is added and a split operation is
performed. However, the bucket which is split is not usually the bucket which
generated the overflow. Rather, another bucket is split. The bucket to split
is determined by a special parameter n, called split pointer. Once the split is
performed, the split pointer is properly modified. It always denotes the leftmost
bucket which uses function hi. Once a bucket is split, the bucket level of the
two buckets involved in the splitting, is incremented by one, thus replacing
function hi with hi+1 for these two buckets.
Consider the example in Figure 6.1(a) adapted from [Litwin et al., 1993a].
In the example, we assume that N = 1. Suppose that the key value 145 is
added. The insertion of such a key results in an overflow for the second bucket
and in the addition of a third bucket. However, the bucket which is split is not
the second one; it is the first one. Figure 6.1(b) illustrates the structure after
the insertion and splitting. Note that a special overflow bucket is added to the
second bucket to store the record with key value 145. Because n is equal to 0,
the first bucket is split; the hash function to use for the first and third buckets
(the newly allocated one) is h2 . Figure 6.1(c) illustrates the organization after
the insertion of records with key values 6, 12, 360, and 18. Those insertions
do not cause any overflow. Suppose now that a record with key value 7 is
inserted. Such insertion results in an overflow for the bucket 1. Because n is
equal to 1, the bucket number 1 is split. Figure 6.1(d) illustrates the resulting
organization. Note that the hash functions to use for the second and fourth
buckets became now h2 . Because all buckets have the same local level, that is,
2, the split pointer is assigned O.
Retrieval of a record, given its key, is very efficient. It is performed according
to the following simple algorithm (AI).
Let C be the key to be searched, then
a f- hi(C);
if a < n then a f- hi+dC). (AI)

216 153 10 7
32 145 18 251
12 321 6 215
360
bucket 0
number
216 251
32 153
10 215
321
overflow
bucket
o
216 251 10
32 153
215
321
overflow
bucket
o
216 251 10
32 153 18
12 215 6
360 321
o 2 3
hI hi
split pointer =0
(a)
h2 hi h2
split pointer = I
(b)
h2 hI h2
split pointer = I
(c)
h2 h2 h2 h2
split pointer =0
(d)
Figure 6.1. Organization of a file under linear hashing.
Basically, the second step checks whether the bucket, obtained by applying
function hi to the key, has already been split. If so, the function hi+1 is to be
used. The index i or i + 1 to be used for a bucket is the bucket level, whereas
i + 1 is the file level.
LH* . In the distributed version of linear hashing, each bucket of the dis-
tributed file is actually the RAM of a node in the system. Therefore, the hash
function returns identifiers of store nodes. Note that LH* could be used also
if the data were stored in the disks of the various nodes rather than in RAM.
However, LH* is particularly suited for systems with a very large number of
nodes, as is the case when using RAM for storing a (large) database.
Data stored at the various nodes are directly manipulated by clients. A
client can perform searches or updates. Whenever a client issues an operation,
for example a search, the first step to perform is the address calculation to de-
termine the store node interested by the operation. Calculating such addresses
requires, according to algorithm (AI), that the client be aware of the up-to-date
values of nand i. Satisfying such constraints in an environment where there is
a large number of clients and store nodes is quite difficult. Propagating those
values, whenever they change, is not feasible given the large number of clients.
Therefore, LH'" does not require that clients have a consistent view of i and n.
Rather, each client may have its own view for such parameters, and therefore
each client may have an image of the file that may differ from the actual file.
Also, the image of a file a client has may differ from the images other clients
have. We denote by i' and n' the view that a client has of the file parameters
i and n.
The basic principle of LH* is to let a client use its own local parameters
for computing the identifier of the node interested by the operation the client
wishes to perform on the file. Therefore, the address calculation is performed

by using algorithm (AI) with the difference that the client's local parameters
are used. That is, the address is computed in terms of parameters i' and n'
instead of i and n. The request is then forwarded to the store node, whose
address is returned by the address calculation step. Because a client may not
have correct values for the file parameters, the store node may not be the correct
one. An addressing error thus arises. In order to handle such error, another
basic principle is that each store node performs its own address calculation;
such step is called server address calculation. Note that each store node knows
the level of the bucket it stores; however, it does not know the current value of
n. The server address calculation is thus performed according to the following
algorithm (A2).
Let C be the key to be searched
Let a be the address of store node s
Let j be the level of the bucket stored at s, then
a' f- hj(C);
if a i= a' then
a" f- hj -1 (C);
if a" > a and a" < a' then a' f- a". (A2)
The address a' returned by the above algorithm is the address of the store
node to which the request should be forwarded if an addressing error has oc-
curred.
Therefore, whenever a store node receives a request, it performs its own
address calculation. If the calculated address is its own address, the address
calculated by the client is the correct one (therefore, the client has an up-to-
date image of the file). If not, the server forwards the request to the store node
whose address has been returned by the server address calculation, according to
the above algorithm. The recipient of the forwarded operation checks again the
address, by performing again the server address calculation, and may perhaps
forward the request to a third store node. It has been, however, formally
proved [Litwin et al., 1993a] that the third recipient is the final one. Therefore,
delivering the request to the correct store node requires forwarding the request
at most twice.
As final step, a client image adjustment is performed by the store node firstly
contacted by the client, if an addressing error occurred. The store node simply
returns to the client its own values for i and n, so that the client image becomes
closer to the actual image.
To illustrate, consider the example in Figure 6.2(a). The example includes
a client having 0 as value for both n' and i'. Suppose that the client wishes
to insert a new record with key value 7. The client address calculation returns
oas store node. The request is then sent to store node O. Such store node

(a)
insert
key 7
(b)
Figure 6.2. Message exchanges in distributed linear hashing when performing insertion of
a new key.
performs the address calculation according to algorithm (A2). The first step of
the calculation returns 3 (as it can be easily verified by performing 7 mod 4).
Note, however, that sending the request to store node 3 would result in an error
because there is no such store node. The check performed by the other steps of
the algorithm prevents such a situation by generating the address of store node
1 (by applying function hj _ d. The request is then forwarded to store node 1.
Store node 1 again performs the calculation. The calculation returns 1 and the
record can therefore be inserted at store node 1.
To illustrate a situation where two forwards are performed, consider the
example in Figure 6.2(b) where four store nodes are allocated and each store
node has a local level equal to 2. As in the above case, the request is forwarded
from store node 0 to store node 1. Store node 1 performs the address calculation
which returns 3. The request is then forwarded again to store node 3 where
the key is finally stored.
Whenever an overflow occurs at one store node, a split operation must be
performed. As for linear hashing, the store node to split is not necessarily the
one where the overflow occurs. To determine the store node to split the values
of nand i must be known. One of the proposed approaches to splitting [Litwin

et aI., 1993a) is based on maintaining such information at a fixed store node
called split coordinator. Whenever an overflow occurs at a store node, such
node notifies the coordinator that then starts the splitting of the proper node
and calculates the new values for nand i, according to what follows:
nt-n+l
if n ~ 2i then n t- 0, i t- i + 1.
Retrieval in LH* is extremely effkient. It takes a minimum of two messages-
one for sending the request and the other for receiving the reply-and a max-
imum of four. The worst case, with a cost of four messages, arises when two
forward messages are required. Extensive simulation experiments have shown,
however, that the average performance is very close to the optimal performance.
Other indexing techniques have been also proposed, as variations of the same
principles of LH*, to support order-preserving indexing [Litwin et aI., 1994) and
multi-attribute indexing [Litwin and Neimat, 1996).
6.2 Indexing issues in mobile computing
Cellular communications, wireless LAN, radio links, and satellite services are
rapidly expanding technologies. Such technologies will make it possible for
mobile users to access information independently from their actual locations.
Mobile computing refers to this new emerging technology extending computer
networks to deal with mobile hosts, retaining their network connections even
while moving. This kind of computation is expected to be very useful for
mail enabled applications, by which, using personal communicators, users will
be able to receive and send electronic mail from any location, as well as be
alerted about certain predefined conditions (such as a train being late or traffic
conditions on a given route), irrespective of time and location [Imielinski and
Badrinath, 1994].
The typical architecture of a mobile network (see Figure 6.3) consists of two
distinct sets of entities: mobile hosts (MRs) and fixed hosts (FRs). Some of the
fixed hosts, called Mobile Support Stations (MSSs) are equipped with a wireless
interface. By using such wireless interface, a MSS is able to communicate with
MHs residing in the same cell. A cell is the area in which the signal sent
by a MSS can be received by MRs. The diameter of a cell, as well as the
available bandwidth, may vary according to the specific wireless technology.
For example, the diameter of a cell spans a few meters for infrared technology
to 1 or 2 miles for radio or satellite networks. With respect to the bandwidth,
LANs using infrared technology have transfer rates of the order of 1-2 Mb/sec,
whereas WANs have poorer performance [Lee, 1989, Salomone, 1995).
The message sent by a MSS is broadcasted within a cell. The MHs filter
the messages according to their destination address. On the other hand, MHs

FH
//@
: '
".,,'
, MSS
1 ' 1 '
" ," " ;,""';,---'--'- J.:.::.:::.;::.L./
/,----78-:,@':,>...
'6 ' M '
:~ , :
0 /, ,
- -
,,,
.@ ,
Figure 6.3. Reference architecture of a mobile network.
located in the same cell can communicate only by sending messages to the MSS
associated with that cell. MSSs are connected to other FMs through a fixed
network, used to support communication among cells. The fixed network is
static, whereas the wireless network is mobile, since MHs may change their
position (and therefore the cell in which they rely) in the time.
MSSs provide commonly used application software, so that a mobile user
can download the software from the closest MSS and run it on the palmtop or
execute it remotely on the MSS. Each MH is associated with a specific MSS,
called Home MSS. A Home MSS for a MH maintains specific information about
the MH itself, such as the user profile, logic files, access rights, and user private
files. The association between a MH and a MSS is replicated through the
network. Additionally, a user may register as a visitor under some other MSSs.
Thus, a MSS is responsible for keeping track of the addresses of users who are
currently residing in the cell supervised by the MSS itself.
MHs can be classified in dumb terminals or walkstations [Imielinski and
Badrinath, 1994]. In the first case, they are diskless hosts (such for instance

palmtops) with reduced memory and computing capabilities. Walkstations are
comparable to classical workstations, and can both receive and send messages
on the wireless network. In any case, MRs are not usually connected to any
direct power source and run on small batteries and communicate on narrow
bandwidth wireless channels.
The communication channel between a MSS and MRs consists of a down-
link, by which information flows from the MSS to MRs, and an uplink, by
which information flows from MRs to the MSS. In general, information can be
acquired by a MR under two different modes:
• Interactive/On-demand: The client requests a piece of data on the uplink
channel and the MSS responds by sending these data to the client on the
downlink channel.
• Data broadcasting: Periodic broadcasting of data is performed by the MSS
on the downlink cannel. This type of communication is unidirectional. The
MRs do not send any specific data requests to the MSS. Rather, they filter
data coming from the downlink channel, according to user specified filters.
In general, combined solutions are used. However, the most frequently de-
manded items will be periodically broadcasted, by creating a sort of storage on
the air [Imielinski et aI., 1994a]. The main advantage of data broadcasting is
that it scales well when the number of MRs grows, as its cost is independent
from the number of MRs. The on-demand mode should be used for data items
that are seldom required.
The main problem of broadcasting is related to energy consumption. Indeed,
MRs are in general powered by a battery. The lifetime of a battery is very short
and is expected to increase only 20% over the next 10 years [Sheng et aI., 1992].
When a MR is listening to the channel, the CPU must be in active mode for
examining data packets. This operation is very expensive from an energy point
of view, because often only few data packets are of interest for a particular MR.
It is therefore important for the MR to run under two different modes:
• Doze mode: The MR is not disconnected from the network but it is not
active.
• Active mode: The MR performs its usual activities; when the MR is listening
to the channel, it should be in active mode.
Clearly, an important topic is to switch from doze mode to active mode in a
clever way, so that energy dissipation is reduced without incurring in a loss of
information. Indeed, if a MR is in doze mode when the information of interest
is being broadcasted, such information is lost by the MR.

MH
,...T~un::i=n;:g:= Broadcast8ss
Filtering
Broadcast Channel
Figure 6.4. MH and MSS interaction.
Approaches to reduce energy dissipation are therefore important for several
reasons. First of all, they make it possible to use smaller and less powerful
batteries to run the same applications for the same time. Moreover, the same
batteries can also run for a longer time, resulting in a monetary saving. In order
to develop such efficient solutions, allowing MRs to timely switch from doze
mode to active mode and vice versa, indexing approaches have been proposed.
In the next subsection, the general issues related to the development of an
index structure for data broadcasting is described, whereas Subsection 6.2.2
illustrates some specific indexing data structures. The discussion follows the
approaches presented in [Imielinski et al., 1994a].
6.2.1 A general index structure for broadcasted data
We assume that, without leading the generality of the discussion, broadcasted
data consist of a number of records identified by a key. Each MSS periodically
broadcasts the file containing such data, on the downlink channel (also called
broadcast channel). Clients receive the broadcasted data and filter them. Fil-
tering is performed by a simple pattern matching operation against the key
value. Thus, clients remain in doze mode most of the time and tune in periodi-
cally to the broadcast channel, to download the required data (see Figure 6.4).
To provide selective tuning, the server must broadcast, together with data, also
a directory that indicates the point of time in the broadcast channel when par-
ticular records are broadcasted. The first issue to address is how MRs access
the directory. Two solutions are possible:
L. MRs cache a copy of the directory.
This solution has several disadvantages. First of all, when MRs change the
cell where they reside, the cached directory may not be any longer valid
and the cache must be refreshed. This problem, together with the fact
that broadcasted data can change between successive broadcasts, with a
consequent change of the directory, may generate an excessive traffic between
clients and the server. Moreover, if many different files are broadcasted on
different channels, the storage occupancy at clients may become too high,
and storage in MRs is usually a scarce resource.

Current BCAST
Previous
BCAST
Data Bucket ~ Index Bucket
Figure 6.5. A general organization for broadcasted data.
!. The directory is broadcasted in the form of an index on the broadcast chan-
nel.
This solution has several advantages. When the index is not used, the client,
in order to filter the required data records, has to tune into the channel,
on the average, half the time it takes to broadcast the file. This is not
acceptable, because the MH, in order to tune into the channel, must be
in active mode, thus consuming scarce battery resources. Broadcasting the
directory together with the data allows the MH to selectively tune into the
channel, becoming active only when data of interest are being broadcasted.
Because of the above reasons, broadcasting the directory together with data
is the preferred solution. It is usually assumed that only one channel exists.
Multiple channels always correspond to a single channel with capacity equiva-
lent to the combined capacity of the corresponding channels.
Figure 6.5 shows a general organization for broadcasted data (including the
directory). Each broadcasted version of the file, together with all the interleaved
index information, is called beast. A bcast consists of a certain number of
buckets, each representing the smaller unit that can be read by a MH (thus, a
bucket is equivalent to the notion of block for disk organizations). Pointers to
specific buckets are specified as an offset from the bucket containing the pointer
to the bucket to which the pointer points to. The time to get the data pointed
by an offset s is given by (s - 1) x T, where T is the time to broadcast a bucket.
Figure 6.6 shows the general protocol for retrieving broadcasted data:
L. The MH tunes into the channel and looks for the offset pointing to the
next index bucket. During this operation, the MH must be in active mode.
A common assumption is that each bucket contains the offset to the next
index bucket. Thus, this step requires only one bucket access. Let n be the
determined offset.

TIME
Figure 6.6. The general protocol for retrieving broadcasted data.
2. The MH switches to doze mode until time (n - 1) x T. At that time, the
MH tunes into the channel (thus, it is again in active mode) and, following a
chain of pointers, determines the offset m, corresponding to the first bucket
containing data of interest (with respect to the considered key value).
3. The MH switches to doze mode until time (m - 1) x T. At that time, the
MH tunes into the channel (thus, it is again in active mode) and retrieves
data of interest.
In general, no new indexing structures are required to implement the pre-
vious protocol. Rather, existing data structures can be extended to efficiently
support the new data organization. The main issues are therefore related to
how define efficient data organizations, that is, how data and index buckets
must be interleaved and which are the parameters to use in order to compare
different data organizations. The considered parameters are the following:
• Access time: It is the average duration from the instant in which a client
wants to access records with a specific key value to the instant when all
required records are downloaded by the client. The access time is based on
the following two parameters:
Probe time: The duration from the instant in which a client wants to
access records with a specific key value to the instant when the nearest
index information related to the relevant data is obtained by the client.
Beast wait: The duration from the point the index information related
to the relevant data is encountered to the point when all required records
are downloaded.

Note that if one parameter is reduced, the other increases.
• Tuning time: It is the time spent by a client listening to the channel. Thus
it measures the time during which the client is in active mode and therefore
determines the power consumed by the client to retrieve the relevant data.
The use of a directory reduces the tuning time, increasing at the same time
the access time. It is therefore important to determine good bucket interleaving
in order to obtain a good trade-off between access time (thus reducing the time
the client has to wait for relevant data) and tuning time (thus reducing battery
consumption) .
With respect to disk organization, the tuning time corresponds to the access
time, in terms of block accesses. However, the tuning time is fixed for each
bucket, whereas the disk access time depends on the position of the head. There
is no disk parameter corresponding to the access time. Finally, we recall that
other indexing techniques, based on hash functions, have also been proposed
[Imielinski et al., 1994b). However, in the remainder of this chapter we do not
consider such techniques.
6.2.2 Specific solutions to indexing broadcasted data
With respect to the general data organization proposed in Subsection 6.2.1,
several specific indexing approaches have been proposed. In the following,
we survey some of these approaches [Imielinski et al., 1994a, Imielinski et al.,
1994b).
With respect to how parameters are chosen, index organizations can be
classified in configurable indexes and non-configurable indexes. In the latter
case, parameter values are fixed. In the former case, the organizations are
parameterized: by changing the parameter values, the trade-off between the
costs changes. This allows to use the same organization to satisfy different user
requirements.
Index organizations can also be classified in clustered and non-clustered or-
ganizations. In the first case, all records with the same value for the key
attribute are stored consecutively in the file. Non-clustered organizations are
often obtained from clustered organizations, by decomposing the file in clus-
tered subcomponents. For this reason, in the following, we do not consider
organizations for non-clustered files.
Non-configurable indexing. Non-configurable index organizations can be
classified according to their behavior with respect to access and tuning time.
An optimal strategy with respect to the access time can be simply obtained
by not broadcasting the directory. On the other hand, an optimal strategy

• Full Index
Figure 6.7. Beast organization in the (l-m) indexing method.
Previous
IIII
Next
BCAST
..........
BCAST
'-----
• Relevant Index
Figure 6.8. Beast organization in the distributed indexing method.
with respect to the tuning time is obtained by broadcasting the complete index
at the beginning of the bcast. Since in practice both access and tuning time
are of interest, the above algorithms have only theoretical significance. Several
intermediate solutions have therefore been devised.
The (l-m) indexing [Imielinski et al., 1994a) is an index allocation method
in which the complete index is broadcasted m times during a bcast (see Fig-
ure 6.7). All buckets have an offset to the beginning of the next index segment.
The first bucket of each index segment has a tuple containing in the first field
the attribute value of the record that was broadcasted as the last and in the
second field an offset pointing to the beginning of the next bcast.
The main problem of the (l-m) index organization is related to the repli-
cation of the index buckets. The distributed indexing [Imielinski et al., 1994a)
is a technique in which the index is partially replicated (see Figure 6.8). In-
deed, there is no need to replicate the complete index between successive data
blocks. Rather, it is sufficient to make available only the portion of index re-
lated to the data buckets which follow it. Thus, the distributed index, with
respect to the (l-m) index, interleaves data buckets with relevant index buckets
only. Several distributed indices can be defined by changing the degree of the
replication [Imielinski et al., 1994a).
The distributed index guarantees a performance comparable to those of the
optimal algorithms, with respect to both the q.ccess time and the tuning time.

p=4
1 2 3 4
Previous Next
BCAST BCAST
f~Binary Contro
Control Index
Index
Local Index
Data!
Records
1
Figure 6.9. Beast organization in the flexible indexing method.
The (I-m) index has a good tuningtime. However, due to the index replication,
the access time is high.
Configurable indexing. Configurable index organizations are parameter-
ized in such a way that, depending on the values of the parameters, the ratio
between the access and tuning time can be modified. The first configurable in-
dex that has been proposed is called flexible indexing [Imielinski et aI., I994b].
In such organization, data records are assumed to be sorted in ascending (or
descending) order and the data file is divided into p data segments. It is as-
sumed that each bucket contains the offset to the beginning of the next data
segment. Depending on the chosen value for p, the trade-off between access
time and tuning time changes. The first bucket of each data segment contains
a control part, consisting of the control index, as well as some data records
(see Figure 6.9). The control index is a binary index which helps locating data
buckets containing records with a given key value.
Each index entry is a pair, consisting of a key value and an offset to a data
bucket. The control index is divided in two parts, the binary control index and
the local index. The binary control index supports searches for keys preceding
the ones stored in the current data segment and in the following ones. It
contains flog2 il tuples, where i is the number of data segments following the
one under consideration. The first tuple of the binary control index consists of

the key of the first data record in the current data bucket and an offset to the
beginning of the next bcast. The following tuples consist of the key of the first
data record of the (llog2 i/2k
-
1
J+l),th data segment followed by the offset to
the first data bucket of that data segment.
The local index supports searches inside the data segment in which it is
contained. It consists of m tuples, where m is a parameter which depends on
several factors, including the number of tuples a bucket can hold. The local
index partitions the data segment into m+ 1 subsegments. Each tuple contains
the key of the first data record of a subsegment and the offset to the first data
bucket of that subsegment.
The access protocol is the following:
1. First, the offset of the next data segment is retrieved and the MH switches
to doze mode.
2. The MH tunes in again at the beginning of the designed next data segment
and performs the following steps:
• If the search key k is lower than the value contained in the first field
of the first tuple of the binary control index, the MH switches to doze
mode, waiting for the offset specified by the tuple, and again executes
step (2).
• If the previous condition is not satisfied, the MH scans the other tuples
of the binary control index, from top to bottom, until it reaches a tuple
whose key value is lower than k. If such tuple is reached, the MH
switches to doze mode, waiting for the offset specified by the tuple, and
again executes step (2).
• If the previous condition is not satisfied, the'MH scans the local index, to
determine whether records with key value k are contained in the current
data segment. If this search succeeds, the offset is used to determine
the bucket contained in the current data subsegment, from which the
retrieval of the data segments starts. The retrieval terminates when the
last bucket of the searched subsegment is reached.
6.3 Indexing techniques for data warehousing systems
Recent years have witnessed an increasing interest in database systems able
to support efficient on-line analytical processing (OLAP). OLAP is a crucial
element of decision support systems in that essential decisions are often taken on
the basis of information extracted by very large amount of data. In most cases,
such data are stored in different, possibly heterogeneous, databases. Examples
of typical queries are [Chauduri and Dayal, 1996]:

• What are the sale volumes by region and product categories for the last
year?
• How did the share price of computer manufactures correlate with quarterly
profits over the past 10 years?
Because requirements of OLAP applications are quite different with respect
to traditional, transaction-oriented applications, specialized systems, known as
data warehousing systems, have been developed to effectively support these
applications. A data warehouse is a large, special-purpose database containing
data integrated from a number of independent sources and supporting users in
analyzing the data for patterns and anomalies [O'Neil and Quass, 1997]. With
respect to traditional database systems, historical data and not only current
data values must be stored in a data warehouse. Moreover, data are updated
off-line and therefore no transactional issues are relevant here. By contrast,
typical OLAP queries are rather complex, often involving several joins and
aggregation operations. OLAP queries are in most cases "ad-hoc" queries as
opposed to repetitive transactions, typical of traditional applications. It is
therefore important to develop sophisticated, complex indexing techniques to
provide adequate performance, also exploiting the fact that the update costs of
indexing structures is not a crucial problem.
A possible approach to efficiently process OLAP queries is to use material-
ization techniques to precompute queries. This approach has the main incon-
venience that precomputing all possible queries along all possible dimensions
is not feasible, especially if there is a very large number of dynamically vary-
ing selection predicates. Therefore, even though more frequent queries may be
precalculated, techniques are required to efficiently execute non-precalculated
querIes.
In the remainder of this section, we first briefly review logical data organi-
zations in data warehousing systems and exemplify typical OLAP queries. We
then discuss a number of techniques supporting efficient query execution for
data warehousing systems. Some of those techniques, namely the join index and
the domain index, had initially been developed for traditional DBMSs. They
have, however, recently found a relevant application scope in data warehousing
systems. Other techniques, namely bitmap and p1'Ojection indexes, have been
specifically developed for data warehousing systems. Some of them have been
incorporated in commercial systems [Edelstein, 1995, French, 1995]. Another
relevant technique which we do not discuss here is the bit-sliced index, whose
aim is the efficient computation of aggregate functions. We refer the reader
to [O'Neil and Quass, 1997] for a description of such technique.

6.3.1 Logical data organization
In a data warehouse, data are often organized according to a star schema
approach. Under this approach, for each group of related data there exist
a central fact table, also called detail table, and several dimension tables. The
fact table is usually very large, whereas each dimension table is usually smaller.
Every tuple (fact) in the fact table references a tuple in each of the dimension
tables, and may have additional attributes. References from the fact table to
the dimension tables are modeled through the usual mechanism of external
keys. Therefore, each tuple in the fact table is related to one tuple from each
of the dimension tables. Vice versa, each tuple from a dimension table may
be related to more than one tuple in the fact table. Dimension tables may, in
turn, be organized into several levels. A data warehouse may contain additional
summary tables containing pre-computed aggregate information.
As an example, consider a (classical) example of data concerning product
sales [O'Neil and Quass, 1997]. Such data are organized around a central
fact table, called Sales, and the following dimension tables: Time, contain-
ing information about the dates of the sales; Product, containing informa-
tion on the products sold; and finally, Customer, containing information about
the customers involved in the sales. The schema is graphically represented
in Figure 6.10. Alternative schema organization approaches exist, including
the snowflake schema and the fact constellation schema [Chauduri and Dayal,
1996]. The following discussion is however quite independent on the specific
schema approach adopted.
Many typical OLAP queries are based on placing restrictions on the dimen-
sion tables that result into restrictions on the tuples of the fact table. As an
example consider the query asking for all sales of products, with price higher
than $50,000, from customers residing in California during July 1996. Such
type of query is often referred to as star-join query because it involves the join
of the same central fact table with several dimension tables. Another important
characteristic of OLAP queries is that aggregates must often be computed on
the results of a star-join query and aggregate functions may also be involved
in selecting relevant groups of tuples. An example of query including aggre-
gate calculation is the query asking for the total dollar sales that were made
for a brand of products during the past 4 weeks to customers residing in New
England [O'Neil and Quass, 1997].
6.3.2 Join index and domain index
The join index technique [Valduriez, 1987] aims at optimizing relational joins
by precalculating them. This technique is optimal when the update frequency

table
CUSTOMER table
customer_id
PRODUCT
gender
1 /
producUd
city brand
state table size
zip SALES weight
hobby package_type
customer_id
/
producUd
table day
TIME dollacsales
dollar30st
day uniCsales
week
month
year
holiday_fig
week fig
Figure 6.10. An example of star-schema database with a central fact table (SALES) and
several dimension tables.
is low. Because in OLAP applications joins are very frequent and the update
frequency is low, the join index technique can be profitably used here.
There are several variations of join index. The basic one is the binary join
index which is formally defined as follows:
Given two tables Rand S, and attributes A and B, respectively from Rand S,
a binary equijoin index is
Bi1= {(ri, sk)lri.A = Sk.B}
where ri (Sk) denotes the row identifier (RID) of a tuple of R (5), and ri.A
(Sk .B) denotes the value of attribute A (B) of the tuple whose RID is ri (Sk)'
Note that comparison operators, different than equality, can be used in a join
index. However, because most joins in OLAP queries are based on equijoins on
external keys, we restrict our discussion to the binary join index. Moreover, in
some variants of the join index technique, the primary key values for tuples in
one table can be used instead of the RIDs of these tuples.
A BlI can be implemented as a binary relation and two copies may be kept,
one clustered on RIDs of R and the other clustered on RIDs of S. A Ell
may also include the actual values of the join columns thus resulting in a set
of triples {(ri.A,ri,sk)lri.A = Sk.B}. This alternative is useful when given a
value of the join column, the tuples from R and from S must be determined
that join with that value.

Join indexes are particularly suited to relate a tuple from a given dimen-
sion table to all the tuples in the fact table. For example, suppose that a
join index is allocated on relations Sales and Customer for the join predicate
Customer.customerjd =Sales.customerjd. Such join index would list for each
tuple of relation Customer (that is, for each customer), the RIDs of tuples of
Sales verifying the join predicates (that is, the sales of the customer). Join
indexes may also be extended to support precomputed joins along several di-
mensions [Chauduri and Dayal, 1996].
Another relevant generalization of the join index notion is represented by
the domain index. A domain index is defined ona domain (for example, the
zip code) and it may index tuples from several tables. It associates with a value
of the domain the RIDs of the tuples, from all the indexed tables, having this
value in the indexed column. Therefore, a domain index may support equality
joins among any number of tables in the set of indexed tables.
6.3.3 Bitmap index
In a traditional index, each key value is associated with the list of RIDs of
tuples having this value for the indexed column. RIDs lists can be quite long.
Moreover, when using multiple indexes for the same table, intersection, union or
complement operations must be performed on such lists. Therefore, alternative,
more efficient implementations of RID lists are relevant.
The notion of bitmap index has been proposed as an efficient implementation
of RID lists. Basically, the idea is to represent the list of RIDs associated with
a key value through a vector of bits. Such vector, usually referred to as bitmap,
has a number of elements equal to the number of tuples in the indexed table.
Each tuple in the indexed table is assigned a distinct, unique bit position in
the bitmap; such position is called ordinal number of the tuple in the relation.
Different tuples have different bit positions, that is, different ordinal numbers.
The ith element of the bitmap associated with a key value is equal to 1 if the
tuple, whose ordinal number is i, has this value for the indexed column; it is
equal to 0 otherwise. Figure 6.11 presents an example of a bitmap index entry
for an index allocated on the column package_type of relation Product. Because
the Product relation has 150 tuples, the bitmap consists of 150 bits. Consider
the entry related to key value equal to A; the bitmap contains 1 in position 1
to denote that the tuple, whose ordinal number is 001, has such value for the
indexed column. By contrast, the bitmap contains 0 in position 2 to denote
that the tuple, whose ordinal number is 002, does not have such value for the
indexed column.
The bitmap representation is very efficient when the number of key values
in the indexed column is low (as an example, consider a column sex of a table

table
PRODUCT producUd brand size weight package_type
120 XXX 30 50 A
122 XXX 30 40 B
124 YYY 20 30 A
127 XXX 30 20 A
130 XXX 20 70 C
131 YYY 30 80 C
................................................
970 ZZZ 80 80 B
ordinal
number
001
002
003
004
005
006
150
Entry of key value A for an index on column package_type
bitmap - 150 bits
position I
position 2
position 3
o
position 150
Figure 6.11. An example of a bitmap index entry.
Person having only two values: Female and Male) [O'Neil and Quass, 1997]. In
such case, the number of O's in each bitmap is not high. By contrast, when the
number of values in the indexed column is very high, the number of l's in each
bitmap is quite low, thus resulting in sparsely.populated bitmaps. Compression
techniques must then be used. The main advantage of bitmaps is that they
result in significant improvement in processing time, because operations such
as intersection, union and compl~ment of RID lists can be performed very
efficiently by using bit arithmetic. Operations required to compute aggregate
functions, typically counting the number of RIDs in a list, are also performed
very efficiently on bitmaps. Another important advantage of bitmaps is that
they are suitable for parallel implementation [O'Neil and Quass, 1997].
Note that the bitmap representation can be combined with the join index
technique, thus resulting in a bitmap join index [O'Neil and Graefe, 1995]. An
entry in a bitmap join index, allocated on a fact table and a dimension table,
will associate the RID of a tuple t from the dimension table with the bitmap of

dimension table
PRODUCT
fact table
SALES
RID
POOl
P002
P003
P004
producUd brand size weight package_type producUd customer_id ..
120 XXX 30 50 A 120 C25
122 XXX 30 40 B 122 C25
124 YYY 20 30 A 120 C26
127 XXX 30 20 A 120 C28
130 XXX 20 70 C 130 C25
131 YYY 30 80 C 120 C37
..................................................... 122 C40
970 ZZZ 80 80 B 120 C70
.....•................
130 C40
ordinal
number
0001
0002
0003
0004
0005
0006
0007
0008
1800
Entry of key value POOl for a bitmap join index allocated
on the join between tables PRODUCT and SALES and
inverted on RID's of table PRODUCT
bitmap - 1800 bits
I I 0 I
position I 1position 2
position 3
RID of a tuple of
table PRODUCT
o
position 1800
Figure 6.12. An example of a bitmap join index entry.
the tuples in the fact table that join with t. Figure 6.12 presents an example
of a bitmap join index.
6.3.4 Projection index
Projection index is an access structure whose aim is to reduce the cost of
projections. The basic idea of this technique is as follows. Consider a column
C of a table T. A projection index on C consists of a vector having a number of
elements equal to the cardinality of T. The ith element of the vector contains
the value of C for the ith tuple of R. Such technique is thus based,. as is
the bitmap representation, on assigning ordinal numbers to tuples in tables.
Determining the value of column C for a tuple, given the ordinal number of

fact table
projection index on column
SALES
unit_sales
ordinal ordinal number
number producUd customecid .... uniCsales
of index entries
0001 120 C25 50 50 0001
0002 122 C25 20 20 0002
0003 120 C26 30 30 0003
0004 120 C28 70 70 0004
0005 130 C25 50 50 0005
0006 120 C37 50 50 0006
0007 122 C40 70 70 0007
0008 120 C70 20 20 0008
.............................................
1800 130 C40 50 50 1800
Figure 6.13. An example of projection index.
this tuple, is very efficient. It only requires accessing the ith entry of the
vector. When the key values have a fixed length, the secondary storage page
containing the relevant vector entry is determined by a simple offset calculation.
Such calculation is function of the number· of entries of the vector that can be
stored per page and the ordinal number of the tuple. When the key values have
varying lengths, alternative approaches are possible. A maximum length can
be fixed for the key values. Alternatively, a B-tree can be used, having as key
values the ordinal numbers of tuples and associating with each ordinal number
the corresponding value of column C. Figure 6.13 presents an example of a
projection index.
Projection indexes are very useful when very few columns of the fact table
must be returned by the query and the tuples of the fact table are very large or
not well clustered. For typical OLAP queries, projection indexes are typically
best used in combination with bitmap join indexes. Recall that a typical query
restricts the tuples in the fact table through selections on the dimension tables.
The ordinal numbers of fact tuples satisfying the restrictions on the dimension
tables are retrieved from the bitmap join indexes. By using these ordinal num-
bers, projection indexes can then be accessed to perform the actual projection.
Note that the actual tuples of the fact table need not to be accessed at all.
6.4 Indexing techniques for the Web
In the past five years, the World Wide Web has completely reshaped the
world of communication, computing and information exchange. By introduc-
ing graphical user interfaces and an intuitively simple concept of navigation,

the Web facilitated access to the Internet which during about ten years was re-
stricted to a few universities and research laboratories. Appearance of advanced
navigation tools like Netscape and Microsoft Explorer made it easy for everyone
on the Internet to roam, browse and contribute to the Web information space.
With the rapid explosion of the amount of data available through Inter-
net, locating and retrieving relevant information becomes more difficult. To
facilitate retrieval of information, many Internet providers (for example, stock
markets, private companies, universities) offer users the possibility of using so
called search engines which facilitate the search process. Search engines offer a
simple interface for the query formulation and refinement, and a wide range of
search options and result reporting.
Moreover, with the growth of data on the Web, a number of special services
has appeared on the Internet whose major goal is searching through many differ-
ent information sources. Even the raw information they return to users becomes
the starting point for retrieval of relevant information (for example, e-mail ad-
dresses, phone numbers, Frequently Asked Questions files). Popular general
purpose searching tools, such as Altavista (http://www.altavista.com/). We-
bcrawler (http://www.webcrawler.com),InfoSeek (http://www.infoseek.com/).
Excite (http://www.excite.com/) become indispensable in the toolkit of every-
body working with the Internet information sources.
Internet technology poses some specific requirements to the tools both in
terms of time and space. Some indexing techniques used in standard text
databases were adopted to meet those requirements. Also, several new ap-
proaches were developed to overcome some limitations of standard techniques.
In the remainder of this section we present a short overview and classification
of indexing methods used in some Internet information systems such as WAIS,
Gopher, Archie, which became popular in the late 80s and early 90s. Then we
discuss some problems related to search engines on the Web. We conclude the
section with a brief overview of the main ideas underlying the Internet spiders
which combine indexing and navigation techniques on the Web.
6.4.1 WAfS, Gopher, Archie, Whois++
The importance of searching the information available through the Internet
was realized by the Internet community from the very first years. Searching
and retrieval tools were growing in both quantity and quality together with
the growth of the Internet itself. Such popular tools as Archie, Gopher, Whois,
WAIS [Bowman et al., 1994, Cheong, 1996] represented a good starting point for
a new generation of the Internet searching tools. Archie is a tool which searches
for relevant information in a distributed collection of FTP sites.2 Gopher is
a distributed information system which makes available hierarchical campus-

wide data collections and provides a simple text search interface. Whois (and
its advanced version Whois++) is a popular tool to query Internet sources
about people and other entities (for example, domains, networks, and hosts).
WAIS (Wide Area Information Server) is a distributed service with a simple
natural-language interface for looking up information in Internet databases.
Indexing techniques used in those tools are quite different. In particular, the
various tools can be classified in three groups [Bowman et al., 1994J depending
on the amount of information which is included in the indexes. The first group
includes tools which have very space efficient indexes, but only represent the
names of the files or menus they index. For example, Archie and Veronica index
the file and menu names of FTP and Gopher servers. Because these indexes
are very compact, a single index is able to support advanced forms of search.
Yet, the range of queries that can be supported by these systems is limited to
file names only, and content-based searches are possible only when the names
happen to reflect some of the contents.
The second group includes systems providing full-text indexing of data lo-
cated at individual sites. For example, a WAIS index records every keyword
in a set of documents located at a single site. Similar indexes are available for
individual Gopher and WWW servers.
The third group includes systems adopting solutions which are a compro-
mise between the approaches adopted by the systems in the other two groups.
Systems in the third group represent some of the contents of the objects they
index, based on selection procedures for including important keywords or ex-
cluding less important keywords. For example, Whois++ indexes templates
that are manually constructed by site administrators wishing to describe the
resources at their sites.
6.4.2 Search engines
The two main types ofsearch against text files are based on sequential searching
and inverted indexes. The sequential search works well only when the search
is limited to a small area. Most pattern-based search tools like Unix's grep
use the sequential search. Inverted indexes (see Chapter 5 for an extensive
presentation) are a common tool in information retrieval systems [Frakes and
Baeza-Yates, 1992J. An inverted index stores in a table all word occurrences in
the set of the indexed documents and indexes the table using a hash method
or a B-tree structure. Inverted indexes are very efficient with respect to query
evaluation but have a storage occupancy which, in the worst case, may be
equal to the size of the original text. To reduce the size of the table, storing
the word occurrences, advanced inverted indexes use the trie indexing method
[Mehlhorn and Tsakalidis, 1990J which stores together the words with common

initial characters (like "call" and "capture"). Moreover, the use of various
compression methods allows to reduce the index size to 10%-30% of the text
size (see Chapter 5).
Another drawback of standard inverted indexes is that their basic data struc-
ture requires the exact spelling of the words in the query. Any misspelling (for
example, when typing "Bhattacharya" or "Clemenc;on") would result in the
empty result set. To provide a correct spelling, users should try different pos-
sibilities by hand, which is frustrating and time consuming.
An example of the search engine which allows the word misspelling is Glimpse
[Manber and Wu, 1994]. Glimpse is based on the agrep search program [Wu
and Manber, 1992] which is similar in use to Unix's grep search program. Es-
sentially, Glimpse is an hybrid between the sequential search and the inverted
index techniques. It is index-based but it uses the sequential search (agrep
program) for approximation matching when the search area is small. To check
a possible word misspelling, it allows a specified number of errors which can be
insertions, deletions or substitutions of characters in a word. Also, it supports
wild cards, regular expressions and Boolean queries like OR and AND. In most
cases, Glimpse requires a very small index, 2%-4% of the original text. How-
ever, the cost of the combination of indexing and sequential search is a longer
response time. For most queries, the search in Glimpse takes 3-15 seconds.
Such response time is unacceptable for classical database applications but is
quite tolerable in most personal applications like the navigation through the
Web.
Intensive development of different techniques for indexing Web documents
has resulted in the appearance of a number of advanced search engines. They
offer a wide list of features for the query formulation and provide a small index
size along with the fast response time. However, building metasearchers which
provide unified query interfaces to multiple search engines is still a hard task.
This is because most search engines are largely incompatible. They propose dif-
ferent query languages and use secret algorithms for ranking documents which
make hard merging data from different sources. Moreover, they do not export
enough information about the source's contents which may be helpful for a bet-
ter query evaluation. All these problems have led to the Stanford protocol pro-
posal for Internet retrieval and search (STARTS) [Gravano et 301., 1997]. This
proposal is a group effort involving 11 companies and organizations. The proto-
col addresses and analyzes metasearch requirements and describes the facilities
that a source needs to provide in order to help a metasearcher. If implemented,
STARTS can significantly streamline the implementation of metasearchers, as
well as enhance the functionality they can offer.

6.4.3 Internet spiders
Users usually navigate through the Web to find information and resources by
following hypertext links. As the Web continues to grow, users may need to
traverse more and more links to locate what they are looking for. Indexing
tools like search engines only help when searching on a single site or predefined
set of sites. Therefore, a new family of programs, often called Web robots or
spiders, has been developed with the aim of providing more powerful search
facilities. Web spiders combine browsing and indexing [Cheong, 1996]. They
traverse the Web space by following hypertext links and retrieve and index new
Web documents. The most well-known Internet spiders are WWW Worm, Web
Crawler and Harvest.
The World Wide Web Worm (http://wwww.cs.colorado.comjwwwwj) was
the first widely used Internet spider. It navigates through Web pages and
builds an index of titles and hypertext links of over 100,000 Web documents.
It provides users with a search interface. Similar to the systems in the first
group in our classification, the WWW Worm does not index the content of
documents.
Webcrawler (http://www.webcrawler.com/) is a resource discovery tool which
is able to speedily search for resources on the Web. It is able to build indexes
on the Web documents and to automatically navigate on demand. WebCrawler
uses an incomplete breath-first traversal to create an index (on both titles and
data content) and relies on an automatic navigation mechanism to find the rest
of information.
The Harvest project [Bowman et al., 1995] addresses the problem of how
to make effective use of the Web information in the face of a rapid growth
in data volume, user base and data diversity. One of the Harvest goals is
to coordinate retrieval of information among a number of agents. Harvest
provides a very efficient means of gathering and distributing index information
and supports the construction of very different types of indexes customized
to each particular information collection. In addition, Harvest also provides
caching and replication support and uses Glimpse as a search engine.
6.5 Indexing techniques for constraint databases
The main idea of constraint languages is to state a set of relations (constraints)
among a set of objects in a given domain. It is a task of the constraint satisfac-
tion system (or constraint solver) to find a solution satisfying these relations.
An example of constraint is F = 1.80 + 32, where 0 and F are respectively
the Celsius and Fahrenheit temperature. The constraint defines the existing
relation between F and O. Constraints have been used for different purposes,
for example they have been successfully integrated with logic programming

[Jaffar and Lassez, 1987]. The constraint programming paradigm is fully declar-
ative, since it specifies computations by specifying how these computations are
constrained. Moreover, it is very attractive as often constraints represent the
communication language of several high-level applications.
Even if constraints have been used in several fields, only recently this
paradigm has been used in databases. Traditionally, constraints have been used
to express conditions on the semantic correctness ofdata. Those constraints are
usually referred to as semantic integrity constraints. Integrity constraints have
no computational implications. Indeed, they are not used to execute queries
(even if they can be used to improve execution performance) but they are only
used to check the database validity.
Constraints intended in a broader sense have lately been used in database
systems. Constraints can be added to relational database systems at different
levels [Kanellakis et aI., 1995]. At the data level, they finitely represent infi-
nite relational tuples. Different logical theories can be used to model different
information. For example, the constraint X < 21 Y > 3, where X and Yare
integer variables, represents the infinite set of tuples having the X attribute
lower than 2 and the Y attribute greater than 3. A quantifier-free conjunc-
tion of constraints is called generalized tuple and the possibly infinite set of
relational tuples it represents is called extension of the generalized tuple. A
finite set of generalized tuples is called generalized relation. Thus, a general-
ized relation represents a possibly infinite set of relational tuples, obtained as
the union of the extension of the generalized tuples contained in the relation.
A generalized database is a set of generalized relations. When constraints are
used to retrieve data, they allow to restrict the search space of the computa-
tion, increasing the expressive power of simple relational languages by allowing
arithmetic computations.
Constraints are a powerful mechanism for modeling spatial [Paredaens, 1995,
Paredaens et al., 1994] and temporal concepts [Kabanza et al., 1990, Koubarakis,
1994], where often infinite information should be represented. Consider for ex-
ample a spatial database consisting of a set of rectangles in the plane. A
possible representation of this database in the relational model is that of hav-
ing a relation R, containing a tuple of the form (n, a, b, c, d) for each rectangle.
In such tuple, n is the name of the rectangle with corners (a, b), (a, d), (c, b)
and (c, d). In the generalized relational model, rectangles can be represented by
generalized tuples of the form (Z = n) 1 (a :::; X :::; c) 1 (b :::; Y :::; d), where X
and Yare real variables. The latter representation is more suitable for a larger
class of operations. Figure 6.14 shows the rectangles representing the extension
of the generalized tuples contained in a generalized relation rl (white) and in a
generalized relation r2 (shadow). rl contains the following generalized tuples:

r2,1
Figure 6.14. Relation rl (white) and r2 (shadow).
'"1,1 : 1 :::; X :::; 4 AI:::; Y:::; 2
rl,2 : 2 :::; X :::; 7 A 2 :::; Y:::; 3
rl,3 : 3 :::; X :::; 6 A -1 :::; Y :::; 1.5.
r2 contains the following tuples:
r2,1 : -3 :::; X :::; -1 AI:::; Y :::; 3
r2,2 : 5 :::; X :::; 6 A -3 :::; Y :::; O.
Usually, spatial data are represented using the linear constraint theory. Lin-
ear constraints have the form p(X1 , ... , X n ) () 0, where p is a linear polynomial
with real coefficients in variables Xl, ..., X nand () E {=,:f, ::;, <, 2::, >}. Such
class of constraints is of particular interest. Indeed, a wide range of applications
use linear polynomials. Moreover, linear polynomials have been investigated in
various fields (linear programming, computational geometry) and therefore sev-
eral techniques have been developed to deal with them [Lassez, 1990].
From a temporal perspective, constraints are very useful to represent situ-
ations that are infinitely repeated in time. For example, we may think of a
train, leaving each day at the same time. In such case, dense-order constraints
are often used. Dense-order constraints are all the formulas of the form X()Y
or X()c, where X,Y are variables, c is a constant and () E {=,:f,::;,<, 2::,>}.
The domain D is a countably infinite set (for example, rational numbers) with
a binary relation which is a dense linear order.
It has been recognized [Kanellakis et al., 1995] that the integration of con-
straints in traditional databases must not compromise the efficiency of the sys-
tem. In particular, constraint query languages should preserve all the good fea-

tures of relational languages. For example, they should be closed and bottom-
up evaluable. With respect to relational databases, constraint databases should
also preserve efficiency. Thus, data structures for querying and updating con-
straint databases must be developed, with time and space complexities com-
parable to those of data structures for relational databases. Complexity of the
various operations is expressed in terms of input-output (I/O) operations. An
I/O operation is the operation of reading or writing one block of data from or
to a disk. Other parameters are: B, representing the number of items (gen-
eralized tuples) that can be stored in one page; n, representing the number of
pages to store N generalized tuples (thus, n = N/B); t, representing the num-
ber of pages to store T generalized tuples, representing the result of a query
evaluation (thus, t = T/B).
At least two constraint language features should be supported by index struc-
tures:
• ALL selection. It retrieves all generalized tuples contained in a specified
generalized relation whose extension is contained in the extension of a given
generalized tuple, specified in the query (called query generalized tuple).
From a spatial point of view, such selection corresponds to a range query.
• EXIST selection. It retrieves all generalized tuples contained in a specified
generalized relation whose extension has a non-empty intersection with the
extension of a query generalized tuple. Equivalently, it finds a generalized
relation that represents all relational tuples, implicitly represented by the
input generalized relation, that satisfy the query generalized tuple.
From a spatial point of view, such selection corresponds to an intersection
query.
Consider for example the generalized tuples representing the objects pre-
sented in Figure 6.14. The EXIST selection with respect to the query gen-
eralized tuple Y ~ X-I and relation 1'1 returns all three generalized tuples
1'1.1,1'1,2 and 1'1,3· The ALL selection with respect to the query generalized
tuple Y ~ X-I and relation 1'1 returns only the generalized tuple 1'1,3.
As constraints support the representation of infinite information, data struc-
tures defined to index relations (such as B-trees and B+-trees [Bayer and Mc-
Creight, 1972, Comer, 1979]) cannot be used in constraint databases, since they
rely on the assumption that the number of tuples is finite. For this reason, spe-
cific classes of constraints for which efficient indexing data structures can be
provided must be determined.
Due to the analogies between constraint databases and spatial databases,
efficient indexing techniques developed for spatial databases can often be ap-
plied to (linear) constraint databases. Efficient data structures are usually

required to process queries in G(logB n + t) I/O operations, use G(n) blocks
of secondary storage, and perform insertions and deletions in G(logB n) I/O
operations (this is the case ofB-trees and B+-trees). Note that all complexities
are worst-case. For spatial problems, by contrast, data structures with optimal
worst-case complexity have been proposed only for some specific problems, in
general dealing with 1- or 2- dimensional spatial objects. Nevertheless, several
data structures proposed for management of spatial data behave quite well on
the average for different source data. Examples of such data structures are grid
files [Nievergelt et al., 1984], various quad-trees [Samet, 1989], z-orders [Oren-
stein, 1986], hB-trees [Lomet and Salzberg, 1990a], cell-trees [Gunther, 1989],
and various R-trees [Guttman, 1984, Sellis et al., 1987] (see Chapter 2).
Symmetrically, in the context of constraint databases two different classes of
techniques have been proposed, the first consisting of techniques with optimal
worst-case complexity, and the second consisting of techniques with good aver-
age bounds. Techniques belonging to the first class apply to (linear) generalized
tuples representing 1- or 2- dimensional spatial objects and often optimize only
EXIST selection. Techniques belonging to the second class allow to index more
general generalized tuples, by applying some approximation. In the following,
both approaches will be surveyed.
6.5.1 Generalized 1-dimensional indexing
In relational databases, the I-dimensional searching problem on a relational
attribute X is defined as follows:
Find all tuples such that their X attribute satisfies the condition a1 ::; X ::; a2.
The problem of I-dimensional searching on a relational attribute X can be
reformulated in constraint databases, defining the problem of i-dimensional
searching on the generalized relational attribute X, as follows:
Find a generalized relation that represents all tuples of the input generalized
relation such that their X attribute satisfies the condition a1 ::; X ::; a2.
A first trivial, but inefficient, solution to the generalized I-dimensional search-
ing problem is to add the query range condition to each generalized tuple. In
this case, the new generalized tuples represent all the relational tuples whose
X attribute is between a1 and a2. This approach introduces a high level of
redundancy in the constraint representation. Moreover, several inconsistent
(with empty extension) generalized tuples can be generated.
A better solution can be defined for convex theories. A theory <I> is convex
if the projection of any generalized tuple defined using <I> on each variable X is
one interval b1 :S X :S h. This is true when the extension of the generalized
tuple represents a convex set. The dense-order theory and the real polynomial
inequality constraint theory are examples of convex theories. The solution is

based on the definition of a generalized I-dimensional index on X as a set of
intervals, where each interval is associated with a set of generalized tuples and
represents the value of the search key for those tuples. Thus, each interval in
the index is the projection on the attribute X of a generalized tuple. By using
the above index, the determination of a generalized relation, representing all
tuples from the input generalized relation such that their X attribute satisfies a
given range condition al :s X :s a2, can be performed by adding the condition
to only those generalized tuples whose associated interval has a non-empty
intersection with al :s X :s a2. Insertion (deletion) of a given generalized tuple
is performed by computing its projection and inserting (deleting) the obtained
interval into (from) a set of intervals.
From the previous discussion it follows that the generalized I-dimensional
indexing problem reduces to the dynamic interval management problem on
secondary storage. Dynamic interval management is a well-known problem in
computational geometry, with many optimal solutions in internal memory [Chi-
ang and Tamassia, 1992]. Secondary storage solutions for the same problem
are, however, non-trivial, even for the static case. In the following, we survey
some of the proposed solutions for secondary storage.
Reduction to stabbing queries. A first class of proposals is based on the
reduction of the interval intersection problem to the stabbing query problem
[Chiang and Tamassia, 1992]. Given a set of I-dimensional intervals, to answer
a stabbing query with respect to a point x, all intervals that contain x must be
reported.
The main idea of the reduction is the following [Kanellakis and Ramaswamy,
1996]. Intervals that intersect a query interval fall into four categories (see
Figure 6.15). Categories (1) and (2) can be easily located by sorting all the
intervals with respect to their left endpoint and using a B+-tree to locate all
intervals whose first endpoint lies in the query interval. Categories (3) and (4)
can be located by finding all data intervals which contain the first endpoint of
the query interval. This search represents a stabbing query.
By regarding an interval [Xl, X2] as the point (Xl, X2) in the plane, a stabbing
query reduces to a special case of the 2-dimensional range searching problem.
Indeed, all points (Xl, X2), corresponding to intervals, lie above the line X = Y.
An interval [Xl, X2] belongs to a stabbing query with respect to a point X if and
only if the corresponding point (Xl, X2) is contained in the region of space
represented by the constraint X :s X 1 Y 2: x. Such 2-sided queries have their
corner on line X = Y. For this reason, they are called diagonal corner queries
(see Figure 6.16).

y
(xl,x2)
Data intervals
2---------
3 - - - - -
4-------------
"'-------;..-------X
,x
Query interval xl x2
Figure 6.15. Categories of possible In-
tersections of a query interval with a
database of intervals.
Figure 6.16. Reduction of the interval
intersection problem to a diagonal-corner
searching problem with respect to x.
The first data structure that has been proposed to solve diagonal-corner
queries is the meta-block tree, and it does not support deletions (it is semi-
dynamic) [Kanellakis and Ramaswamy, 1996). The meta-block tree is fairly
complicated, has optimal worst-case space G(n) and optimal I/O query time
G(logE n + t). Moreover, it has G(logE n + (log~ n)/B) amortized insert I/O
time.
A dynamic (thus, also supporting deletions) optimal solution to the stab-
bing query problem [Arge and Vitter, 1996) is based on the definition of an
external memory version of the internal memory interval tree. The interval
tree for internal memory is a data structure to answer stabbing queries and
to store and update a set of intervals in optimal time [Chiang and Tamassia,
1992). It consists of a binary tree over the interval endpoints. Intervals are
stored in secondary structures, associated with internal nodes of the binary
tree. The extension of such data structure to secondary storage entails two
issues. First, the fan-out of nodes must be increased. The fan-out that has
been chosen is VB [Arge and Vitter, 1996). This fan-out allows to store all the
needed information in internal nodes, increasing only of 2 the height of the tree.
If interval endpoints belong to a fixed set E, the binary tree is replaced by a
balanced tree, having VB as branching factor, over the endpoints E. Each leaf
represents B consecutive points from E. Segments are associated with nodes
generalizing the idea of the internal memory data structure. However, since
now a node contains more endpoints, more than two secondary structures are
required to store segments associated with a node.' The main problem of the
previous structure is that it requires the interval endpoints to belong to a fixed
set. In order to remove such assumption, the weight-balanced B-tree has been

introduced [Arge and Vitter, 1996]. The main difference between a B-tree and
a weight-balanced B-tree is that in the first case, for each internal node, the
number of children is fixed; in the second case, only the weight, that is, the
number of items stored under each node, is fixed. The weight-balanced B-tree
allows to remove the assumption on the interval endpoints, still retaining opti-
mal worst-case bounds for stabbing queries.
Revisiting a Chazelle's algorithm. The solutions described above to solve
stabbing queries in secondary storage are fairly complex and rely on reduc-
ing the interval intersection problem to special cases of the 2-dimensional
range searching problem. A different and much simpler approach to solve the
static (thus, not supporting insertions and deletions) generalized I-dimensional
searching problem [Ramaswamy, 1997] is based on an algorithm developed by
Chazelle [Chazelle, 1986] for interval intersection in main memory and uses
only B+-trees, achieving optimal time and using linear space.
The proposed technique relies on the following consideration. A straightfor-
ward method to solve a stabbing query consists of identifying the set of unique
endpoints of the set of input intervals. Each endpoint is associated with the set
of intervals that contain such endpoint. These sets can then be indexed using
a B+-tree, taking endpoints as key values. To answer a stabbing query it is
sufficient to look for the endpoint nearest to the query point, on the right, and
examine the intervals associated with it, reporting those intervals that intersect
the query point.
This method is able to answer stabbing queries in G(logE n). However, it
requires G(nZ) space. It has been shown [Ramaswamy, 1997] that the space
complexity can be reduced to G(n) by appropriately choosing the considered
endpoints. More precisely, let el, ez, ... , eZn be the ordered lists of all end-
points. A set of windows Wi, ... ,Wp should be constructed over endpoints
Wi = el, ... , Wp+l = e2n such that Wj = [Wj, Wj+d, j = 1, ... ,p. Thus, windows
represent a partition of the interval between el and e2n into p contiguous in-
tervals. Each window Wj is associated with the list of intervals that intersect
Wj.
Window-lists can be stored in a B+-tree, using their starting points as key
values. A stabbing query at point p can be answered by searching for the
query point and retrieving the window-lists associated with the windows that
it falls into. Each interval contained in such lists is then examined, reporting
only the intervals intersecting the query point. Some algorithms have been
proposed [Ramaswamy, 1997] to appropriately construct windows, in order to
answer queries by applying the previous algorithm in G(logE n), using only
G(n) pages.

6.5.2 Indexing 2-dimensionallinear constraints
The approaches briefly illustrated in Subsection 6.5.1 rely on the assumption
that index values are represented by intervals. Thus, they are able to index
generalized tuples using information about only one variable. Less work has
been done in order to define techniques for 2-dimensional generalized tuples,
having optimal worst-case complexity. One of these techniques [Bertino et aI.,
1997] deals with index values represented by generalized tuples with two vari-
ables, say X and Y, having the form G1 1 ... 1 Gn , where each Gi , i = 1, ... , n
has the form Gi == Y B aiX + bi, B E {:S, ~}.
Besides the application to different types of generalized tuples, the main dif-
ference of this technique with respect to the ones presented in Subsection 6.5.1
is that it is defined for solving not only EXIST selection but also ALL selection.
In both cases, the query generalized tuple must represent a half-plane.
The main novelty of the approach is the reduction of both EXIST and ALL
selection problems, under the above assumptions, to a point location problem
from computational geometry [Preparata and Shamos, 1985]. The proof of such
reduction is based on the transformation of the extension of generalized tuples
from a primal plane to a dual plane. In particular, each generalized tuple is
transformed in a pair of non-intersecting, but possibly touching, open polygons3
in the plane, whereas a half-plane Y BaX +b, BE {:S,~} is translated in point
(a, b).
This translation satisfies an interesting property. Indeed, the EXIST and the
ALL selection problems with respect to a half-plane query Y B aX +b reduce
to the point location problem of point (a, b) with respect to the constructed
open polygons. In particular, it can be shown that point (a, b) belongs to one
of the open polygons that have been constructed for a generalized tuple t iff
line Y = aX + b does not intersect the interior of the figure representing the
extension oft (see Figure 6.17). Using this property, point location algorithms
for the dual plane, equivalent to the EXIST and ALL selections in the Euclidean
plane, have been proposed.
The same open polygons have then be used to show that an optimal dy-
namic solution to ALL and EXIST selection problems exists, using simply data
structures such as B+-trees, if the angular coefficient of the line associated with
the half-plane query belongs to a predefined set.
6.5.3 Filtering
To facilitate the definition of indexing structures for arbitrary objects in spatial
databases, a filtering approach is often used. The same approach can be used
in constraint databases to index generalized tuples with complex extension.

(a)
(b)
Figure 6.17. (a) A polygon p representing the extension of a linear generalized tuple;
(b) A pair of open polygons representing p in the dual plane, together with the points
representing lines ql, qz, q3, q4 in the dual plan.
Under the filtering approach, an object is approximated by using some other
object, having a simpler shape. The approximated objects are then used as
index objects. The evaluation of a query under such approach consists of two
steps, filtering and refinement. In the filtering step, an index is used to retrieve
only relevant objects, with respect to a certain query. To this purpose, the
approximated figures are used instead of the objects themselves. During the
refinement step, the set of objects retrieved by the filtering step is directly tested
with respect to the query, to determine the exact result. Here, the main topic
is the definition of "good" approximated objects, ensuring a specific degree of
filtering.
The use of minimum bounding box (MBB) in spatial databases to filter ob-
jects is of common use. In 2-dimensional space, the MBB of a given object is
the smallest rectangle that encloses the object and whose edges are perpendicu-
lar to the standard coordinate axes. The previous definition can be generalized
to higher dimensions in a straightforward manner.
The filtering method based on MBB is simple and has a number of advan-
tages over index methods working directly on objects:
• It has a low storage cost, because only a small number of intervals are main-
tained in addition to each object.

• There is a clear separation between the complexity of the object geometry
and the complexity of the search. Index structures for (multidimensional) in-
tervals have better worst-case performance with respect to index techniques
working on arbitrary objects. Indeed, several index structures having close to
optimal worst-case bounds for managing (multidimensional) intervals have
been proposed (see Chapter 2). However, similar approaches have not been
defined yet for arbitrary objects.
The filtering approach based on MBBs, even if appealing, has some draw-
backs. In particular, it may be ineffective if the set of objects returned by the
filtering step is too large. This means that there are too many intersecting
MBBs. Moreover, it does not scale well to large dimensions. The issue of han-
dling objects in spaces of large dimension is less crucial for spatial databases,
where we can generally rely on a dimension of 3 or less, but it is critical for
constraint databases.
In order to improve the selectivity of filtering, an approach has been pro-
posed, based on the notion of minimum bounding polybox [Brodsky et al., 1996].
A minimum bounding polybox for an object 0 is the minimum convex polyhe-
dron that encloses 0 and whose facets are normal to preselected axes. These
axes are not necessarily the standard coordinate axes and, furthermore, their
number is not determined by the dimension of the space. Algorithms for com-
puting optimal axes (according to specific optimality criteria with respect to
storage overhead or filtering rate) in d-dimensions have also been proposed
[Brodsky et al., 1996].
Notes
1. We assume that buckets are numbered starting from O.
2. FTP is the Internet standard high-level protocol for the file transfer.
3. An open polygon is a finite chain of line segments with the first and last segments
approaching 00. An open polygon is upward (downward) open if both segments approach
+00 (-00).

REFERENCES 225
References
Abel, D. J. and Smith, J. L. (1983). A data structure and algorithm based
on a linear key for a rectangle retrieval problem. International Journal of
Computer Vision, Graphics and Image Processing, 24(1):1-13.
Abel, D. J. and Smith, J. L. (1984). A data structure and query algorithm for
a database of areal entities. Australian Computing Journal, 16(4):147-154.
Achyutuni, K. J., Omiecinski, E., and Navathe, S. (1996). Two techniques for
on-line index modification in shared-nothing parallel systems. In Proc. 1996
ACM SIGMOD International Conference on Management of Data, pages
125-136.
Ang, C. and Tan, K. (1995). The Interval B-tree. Information Processing Let-
ters, 53(2):85-89.
Arge, L. and Vitter, J. (1996). Optimal dynamic interval management in exter-
nal memory. In Pmc. 37th Symposium on Foundations of Computer Science,
pages 560-569.
Aslandogan, Y. A., Yu, C., Liu, C., and Nair, K. R. (1995). Design, implemen-
tation and evaluation of SCORE. In Proc. 11th International Conference on
Data Engineering, pages 280-287.
Bancilhon, F. and Ferran, G. (1994). ODMG-93: The object database standard.
IEEE Bulletin on Data Engineering, 17(4):3-14.
Banerjee, J. and Kim, W. (1986). Supporting VLSI geometry operations in a
database system. In Proc. 3rd International Conference on Data Engineer-
ing, pages 409-415.
Bartels, D. (1996). ODMG93 - The emerging object database standard. In
Proc. 12th International Conference on Data Engineering, pages 674-676.
Bayer, R. and McCreight, E. (1972). Organization and maintenance of large
ordered indices. Acta Informatica, 1(3):173-189.
Bayer, R. and Schkolnick, M. (1977). Concurrency of operations on B-trees.
Acta Informatica, 9:1-21.
Beck, J. (1967). Perceptual grouping produced by line figures. Perception and
Psychophysics, 2:491-495.
Becker, B., Gschwind, S., T. Ohler, B. S., and Widmayer, P. (1993). On op-
timal multiversion access structures. In Proc. 3rd International Symposium
on Large Spatial Databases, pages 123-141.
Beckley, D. A., Evens, M. W., and Raman, V. K. (1985a). Empirical com-
parison of associative file structures. In Proc. International Conference on
Foundations of Data Organization, pages 315-319.
Beckley, D. A., Evens, M. W., and Raman, V. K. (1985b). An experiment
with balanced and unbalanced k-d trees for associative retrieval. In Proc.

9th International Conference on Computer Software and Applications, pages
256-262.
Beckley, D. A., Evens, M. W., and Raman, V. K. (1985c). Multikey retrieval
from k-d trees and quad trees. In Proc. 1985 ACM SIGMOD International
Conference on Management of Data, pages 291-301.
Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. (1990). The R*-
tree: An efficient and robust access method for points and rectangles. In
Proc. 1990 ACM SIGMOD International Conference on Management of
Data, pages 322-331.
Belkin, N. and Croft, W. (1992). Information filtering and information retrieval:
Two sides of the same coin? Communications of the ACM, 35(12):29-38.
Bell, T., Moffat, A., Nevill-Manning, C., Witten, I., and Zobel, J. (1993). Data
compression in full-text retrieval systems. Journal of the American Society
for Information Science, 44(9) :508-531.
Bell, T., Moffat, A., Witten, I., and Zobel, J. (1995). The MG retrieval system:
Compressing for space and speed. Communications of the ACM, 38(4):41-42.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative
searching. Communications of the ACM, 18(9):509-517.
Bentley, J. L. (1979a). Decomposable searching problems. Information Process-
ing Letters, 8(5):244-251.
Bentley, J. L. (1979b). Multidimensional binary search trees in database appli-
cations. IEEE Transactions on Software Engineering, 5(4):333-340.
Bentley, J. L. and Friedman, J. H. (1979). Data structures for range searching.
ACM Computing Surveys, 11(4):397-409.
Berchtold, S., Keirn, D., and Kriegel, H. (1996). The X-tree: An index structure
for high-dimensional data. In Proc. 22nd International Conference on Very
Large Data Bases, pages 28-39.
Bertino, E. (1990). Query optimization using nested indices. In Proc. 2nd In-
ternational Conference on Extending Database Technology, pages 44-59.
Bertino, E. (1991a). An indexing technique for object-oriented databases. In
Proc. 7th International Conference on Data Engineering, pages 160-170.
Bertino, E. (1991b). Method precomputation in object-oriented databases. In
Proc. A CM-SIGOIS and IEEE- TC-OA International Conference on Orga-
nizational Computing Systems, pages 199-212.
Bertino, E. (1994). On indexing configuration in object-oriented databases.
VLDB Journal, 3(3):355-399.
Bertino, E., Catania, B., and Shidlovsky, B. (1997). Towards optimal two-
dimensional indexing for constraint databases. Technical Report TR-196-97,
Dipartimento di Scienze dell'Informazione, University of Milano, Italy.

REFERENCES 227
Bertino, E. and Foscoli, P. (1995). Index organizations for object-oriented
database systems. IEEE Transactions on Knowledge and Data Engineering,
7(2):193-209.
Bertino, E. and Guglielmina, C. (1991). Optimization of object-oriented queries
using path indices. In Proc. International IEEE Workshop on Research Is-
sues on Data Engineering: Transaction and Query Processing, pages 140-
149.
Bertino, E. and Guglielmina, C. (1993). Path-index: An approach to the effi-
cient execution of object-oriented queries. Data and Knowledge Engineering,
6(1):239-256.
Bertino, E. and Kim, W. (1989). Indexing techniques for queries on nested
objects. IEEE Transactions on Knowledge and Data Engineering, 1(2):196-
214.
Bertino, E. and Martino, L. (1993). Object-Oriented Database Systems - Con-
cepts and Architectures. Addison-Wesley.
Bertino, E. and Quarati, A. (1991). An approach to support method invoca-
tions in object-oriented queries. In Proc. International IEEE Workshop on
Research Issues on Data Engineering: Transaction and Query Processing,
pages 163-169.
Blanken, H., Ijbema, A., Meek, P., and Akker, B. (1990). The generalized grid
file: Description and performance aspects. In Proc. 6th International Con-
ference on Data Engineering, pages 380-388.
Bookstein, A., Klein, S., and Raita, T. (1992). Model based concordance com-
pression. In Proc. IEEE Data Compression Conference, pages 82-91.
Bowman, C., Danzig, P., Hardy, D., Manber, D., and Schwartz, M. (1995). The
harvest information discovery and access system. Computer Networks and
ISDN Systems, 28(1-2):119-125.
Bowman, C., Danzig, P., Manber, D., and Schwartz, M. (1994). Scalable inter-
net discovery: Research problems and approaches. Communications of the
ACM,37(8):98-107.
Bratley, P. and Choueka, Y. (1982). Processing truncated terms in document
retrieval systems. Information Processing fj Management, 18(5):257- 266.
Bretl, R., Maier, D., Otis, A., Penney, D., Schuchardt, B., Stein, J., Williams,
E., and Williams, M. (1989). The GemStone data management system.
In Object-Oriented Concepts, Databases, and Applications, pages 283-308.
Addison-Wesley.
Brinkhoff, T., Kriegel, H.-P., Schneider, R., and Seeger, B. (1994). Multi-step
processing of spatial joins. In Proc. 1994 ACM SIGMOD International Con-
ference on Management of Data, pages 197-208.
Brodsky, A., Lassez, C., Lassez, J., and Maher, M. (1996). Separability of poly-
hedra and a new approach to spatial storage. In Proc. 14th ACM SIGACT-

SIGMOD-SIGART Symposium on Principles of Database Systems, pages
54-65.
Brown, E. (1995). Fast evaluation ofstructured queries for information retrieval.
In Proc. 18th ACM-SIGIR International Conference on Research and De-
velopment in Information Retrieval, pages 30-38.
Buckley, C. and Lewit, A. (1985). Optimization of inverted vector searches. In
Proc. 8th ACM-SIGIR International Conference on Research and Develop-
ment in Information Retrieval, pages 97-110.
Burkowski, F. (1992). An algebra for hierarchically organized text-dominated
databases. Information Processing fj Management, 28(3):333-348.
Callan, J. (1994). Passage-level evidence in document retrieval. In Proc. 17th
ACM-SIGIR International Conference on Research and Development in In-
formation Retrieval, pages 302-309.
Cattell, R. (1993). The Object Database Standard: ODMG-93 Release 1.2. Mor-
gan Kaufmann Publishers.
Cesarini, F. and Soda, G. (1982). Binary trees paging. Information Systems,
7(4):337-344.
Chan, C., Goh, C., and Ooi, B. C. (1997). Indexing OODB instances based on
access proximity. In Proc. 13th International Conference on Data Engineer-
ing, pages 14-21.
Chan, C. Y., Ooi, B. C., and Lu, H. (1992). Extensible buffer management of
indexes. In Proc. 18th International Conference on Very Large Data Bases,
pages 444-454.
Chang, J. M. and Fu, K. S. (1979). Extended k-d tree database organization:
A dynamic multi-attribute clustering method. In Proc. 3rd International
Conference on Computer Software and Applications, pages 39-43.
Chang, S. K. and Fu, K. S., editors (1980). Pictorial Information Systems.
Springer-Verlag.
Chang, S. K. and Hsu, A. (1992). Image information systems: Where do we
go from here? IEEE Transactions on Knowledge and Data Engineering,
4(5):431-442.
Chang, S. K., Jungert, E., and Li, Y. (1989). Representation and retrieval of
symbolic pictures using generalized 2D strings. In Proc. Visual Communi-
cations and Image Processing Conference, pages 1360-1372.
Chang, S. K., Shi, Q. Y., and Van, C. W. (1987). Iconic indexing by 2-d string.
IEEE Transaction on Pattern Analysis and Machine Intelligence, 9(3):413-
428.
Chang, S. K., Van, C. W., Dimitroff, D. C., and Arndt, T. (1988). An intel-
ligent image database system. IEEE Transaction on Software Engineering,
15(5):681-688.

REFERENCES 229
Chauduri, S. and Dayal, U. (1996). Decision support, data warehousing, and
olap (tutorial notes). In Proc. 22nd International Conference on Very Large
Data Bases.
Chazelle, B. (1986). Filtering search: A new approach to query-answering.
SIAM Journal of Computing, 15(3):703-724.
Cheong, C. (1996). Internet agents. New Riders - Macmillan Publishing.
Chiang, Y. and Tamassia, R. (1992). Dynamic algorithms in computational
geometry. Proceedings of the IEEE, 80(9):1412-1434.
Chiu, D. K. Y. and Kolodziejczak, T. (1986). Synthesizing knowledge: A cluster
analysis approach using event-covering. IEEE Transactions on Systems, Man
and Cybernetics, 16(2):462-467.
Choenni, S., Bertino, E., Blanken, H., and Chang, T. (1994). On the selection
of optimal index configuration in 00 databases. In Proc. 10th International
Conference on Data Engineering, pages 526-537.
Choueka, Y., Fraenkel, A., and Klein, S. (1988). Compression of concordances in
full-text retrieval systems. In Proc. 11th ACM-SIGIR International Confer-
ence on Research and Development in Information Retrieval, pages 597-612.
Choy, D. and Mohan, C. (1996). Locking protocols for two-tier indexing of
partitioned data. In Proc. International Workshop on Advanced Transaction
Models and Architectures, pages 198-215.
Chua, T. S., Lim, S. K., and Pung, H. K. (1994). Content-based retrieval of
segmented images. In Proc. 2nd ACM Multimedia Conference, pages 211-
218.
Chua, T. S., Tan, K. 1., and Goi, B. C. (1997). Fast signature-based color-
spatial image retrieval. In Proc. 4th International Conference on Multimedia
Computing and Systems.
Chua, T. S., Teo, K. C., Goi, B. C., and Tan, K. L. (1996). Using domain
knowledge in querying image database. In Proc. 3rd Multimedia Modeling
Conference, pages 339-354.
Clarke, C., Cormack, G., and Burkowski, F. (1995). An algebra for structured
text search and a framework for its implementation. Computer Journal,
38(1):43-56.
Cluet, S., Delobel, C., Lecluse, C., and Richard, P. (1989). Reloop, an algebra
based query language for an object-oriented database system. In Proc. 1st
International Conference on Deductive and Object Oriented Databases, pages
313-332.
Comer, D. (1979). The ubiquitous B-tree. ACM Computing Surveys, 11(2):121-
137.
Costagliola, G., Tucci, M., and Chang, S. K. (1992). Representing and retrieving
symbolic pictures by spatial relations. In Visual Database Systems II, pages
49-59.

Dao, T., Sacks-Davis, R, and Thorn, J. (1996). Indexing structured text for
queries on containment relationships. In Pmc. 7th Australasian Database
Deux, O. (1990). The story of O2 . IEEE Transactions on Knowledge and Data
Engineering, 2(1):91-108.
Eastman, C. M. and Zemankova, M. (1982). Partially specified nearest neighbor
using kd trees. Information Processing Letters, 15(2) :53-56.
Easton, M. (1986). Key-sequence data sets in indeiible storage. IBM Journal
of Research and Development, 30(12).
Edelsbrunner, H. (1983). A new approach to rectangular intersection. Interna-
tional Journal of Computational Mathematics, 13:209-219.
Edelstein, H. (1995). Faster data warehouses. In Information Week, pages 77-
88.
Elias, P. (1975). Universal codeword sets and representations of the integers.
IEEE Transactions on Information Theory, IT-21(2):194-203.
Elmasri, R, Wuu, G. T., and Kouramajian, V. (1990). The Time Index: An
access structure for temporal data. In Proc. 16th International Conference
on Very Large Data Bases, pages 1-12.
Fagin, R, Nievergelt, J., Pippenger, N., and Strong, H. R (1979). Extendible
hashing - A fast access method for dynamic files. A CM Transactions on
Database Systems, 4(3):315-344.
Faloutsos, C. (1988). Gray-codes for partial match and range queries. IEEE
Transactions on Software Engineering, 14(10):1381-1393.
Faloutsos, C., Equitz, W., Flickner, M., Niblack, W., Petkovic, D., and Bar-
ber, R. (1994). Efficient and effective querying by image content. Journal of
Intelligent Information Systems, 3(3):231-262.
Faloutsos, C. and Jagadish, H. (1992). On B-tree indices for skewed distI'i-
butions. In Proc. 18th International Conference on Very Large Databases,
pages 363-374.
Faloutsos, C. and Roseman, S. (1989). Fractals for secondary key retrieval. In
Proc. 1989 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, pages 247-252.
Finkel, R. A. and Bentley, J. L. (1974). Quad trees: A data structure for retrieval
on composite keys. Acta Informatica, 4:1-9.
Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dam, B., Gorkani,
M., Hafner, J., Petkovic, D. L. D., Steele, D., and Yanker, P. (1995). Query
by image and video content: The QBIC system. IEEE Computer, 28(9):23-
32.
Fox, E., editor (1995). Communications of the ACM, volume 38(4). Special
issue on Digital Libraries.

REFERENCES 231
Fox, E. and Shaw, J. (1993). Combination of multiple searches. In Proc. Text
Retrieval Conference (TREC), pages 35-44. National Institute of Standards
and Technology Special Publication 500-215.
Frakes, W. and Baeza-Yates, R., editors (1992). Information Retrieval: Data
Structures and Algorithms. Prentice-Hall.
Francos, J. M., Meiri, A. Z., and Porat, B. (1993). A unified texture model based
on a 2-d wold like decomposition. IEEE Transactions on Signal Processing,
pages 2665-2678.
Freeston, M. (1987). The BANG file: A new kind of grid file. In Proc. 1987
260-269.
Freeston, M. (1995). A general solution of the n-dimensional B-tree problem.
In Proc. 1995 ACM SIGMOD International Conference on Management of
Data, pages 80-91.
French, C. (1995). One size fits all. In Proc. 1995 ACM SIGMOD International
Friedman, J. H., Bentley, J. L., and Finkel, R. A. (1987). An algorithm for
finding best matches in logarithmic expected time. ACM Transactions on
Mathematical Software, 3(3):209-226.
Gallager, R. and Van Voorhis, D. (1975). Optimal source codes for geometrically
distributed integer alphabets. IEEE Transactions on Information Theory,
IT-21(2):228-230.
Gargantini, I. (1982). An effective way to represent quadtrees. Communications
of the ACM, 25(12):905-910.
Goh, C. H., Lu, H., Ooi, B. C., and Tan, K. L. (1996). Indexing temporal data
using B+-tree. Data and Knowledge Engineering, 18:147-165.
Goldfarb, C. (1990). The SGML Handbook. Oxford University Press.
Golomb, S. (1966). Run-length encodings. IEEE Transactions on Information
Theory,IT-12(3):399-401.
Gong, Y., Chua, H. C., and Guo, X. (1995). Image indexing and retrieval based
on color histograms. In Proc. 2nd Multimedia Modeling Conference, pages
115-126.
Gonnet, G. and Baeza-Yates, R. (1991). Handbook of data structures and algo-
rithms. Addison-Wesley, second edition.
Gonnet, G. and Tompa, F. (1987). Mind your grammar: A new approach
to modeling text. In Proc. 13th International Conference on Very Large
Databases, pages 339-346.
Graefe, G. (1993). Query evaluation techniques for large databases. ACM Com-
puting Surveys, 25(2) :73-170.

Gravano, L., Chang, C., Garcia-Molina, H., and Paepcke, A. (1997). STARTS:
Stanford proposal for internet meta-searching. In Proc. 1997 ACM SIGMOD
International Conference on Management of Data.
Greene, D. (1989). An implementation and performance analysis of spatial data
access methods. In Proc. 5th International Conference on Data Engineering,
pages 606-615.
Gudivada, V. and Raghavan, R. (1995). Design and evaluation of algorithms
for image retrieval by spatial similarity. ACM Transactions on Information
Systems, 13(1):115-144.
Gunadhi, H. and Segev, A. (1993). Efficient indexing methods for temporal
relation. IEEE Transactions on Knowledge and Data Engineering, 5(3):496-
509.
Gunther, O. (1988). Efficient Structures for Geometric Data Management.
Springer-Verlag.
Gunther, O. (1989). The design of the cell tree: An object-oriented index struc-
ture for geometric databases. In Proc. 5th International Conference on Data
Engineering, pages 598-605.
Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching.
Data, pages 47-57.
Hall, P. and Dowling, G. (1980). Approximate string matching. Computing
Surveys, 12(4):381-402.
Harman, D. (1991). How effective is suffixing? Journal of the American Society
for Information Science, 42(1):7-15.
Harman, D., editor (1992). Proc. TREC Text Retrieval Conference. National
Institute of Standards Special Publication 500-207.
Harman, D., editor (1995a). Information Processing 0 Management, volume
31(3). Special Issue: The Second Text Retrieval Conference (TREC-2).
Harman, D. (1995b). Overview of the second text retrieval conference (TREC-
2). Information Processing 0 Management, 31(3):271-289.
Harman, D. and Candela, G. (1990). Retrieving records from a gigabyte of
text on a minicomputer using statistical ranking. Journal of the American
Society for Information Science, 41 (8) :581-589..
Hearst, M. and Plaunt, C. (1993). Subtopic structuring for full-length document
access. In Proc. 16th ACM-SIGIR International Conference on Research and
Development in Information Retrieval, pages 59-68.
Henrich, A., Six, H.-W., and Widmayer, P. (1989a). The LSD tree: spatial access
to multidimensional point and non-point objects. In Proc. 15th International
Conference on Very Large Data Bases, pages 45-53.

REFERENCES 233
Henrich, A., Six, H.-W., and Widmayer, P. (1989b). Paging binary trees with
external balancing. In Proc. International Workshop on Graphtheoretic Con-
cepts in Computer Science.
Hinrichs, K. (1985). Implementation of the grid file: Design concepts and ex-
perience. BIT, 25:569-592.
Hinrichs, K. and Nievergelt, J. (1983). The grid file: A data structure designed
to support proximity queries on spatial objects. In Proc. International Work-
shop on Graphtheoretic Concepts in Computer Science, pages 100-113.
Hirata, K., Hara, Y., Takano, H., and Kawasaki, S. (1996). Content-oriented
integration in hypermedia systems. In Proc. 1996 ACM Conference on Hy-
pertext, pages 11-21.
Hoel, E. and Samet, H. (1992). A qualitative comparison study of data struc-
tures for large line segment databases. In Proc. 1992 ACM SIGMOD Inter-
national Conference on Management of Data, pages 205-214.
Hsu, W., Chua, T. S., and Pung, H. K. (1995). An integrated color-spatial
approach to content-based image retrieval. In Proc. 3rd ACM Multimedia
Hutflesz, A., Six, H.-W., and Widmayer, P. (1990). The R-file: An efficient
access structure for proximity queries. In Proc. 6th International Conference
on Data Engineering, pages 372-379.
Iannizzotto, G., Vita, L., and Puliafito, A. (1996). A new shape distance for
content-based image retrieval. In Proc. 3rd Multimedia Modeling Conference,
pages 371-386.
Imielinski, T. and Badrinath, B. (1994). Mobile wireless computing: solutions
and challenges in data management. Communications of the ACM, 37(10):18-
28.
Imielinski, T., Viswanathan, S., and Badrinath, B. (1994a). Energy efficient
indexing on air. In Proc. 1994 ACM SIGMOD International Conference on
Management of Data, pages 25-36.
Imielinski, T., Viswanathan, S., and Badrinath, B. (1994b). Power efficient
filtering of data on air. In Proc. 4th International Conference on Extending
Database Technology, pages 245-258.
Ioka, M. (1989). A method of defining the similarity of images on the basis of
color information. Technical Report RT-0030, IBM Tokyo Research Lab.
Jaffar, J. and Lassez, J. (1987). Constraint logic programming. In Proc. 14th
Annual ACM Symposium on Principles of Programming Languages, pages
111-119.
Jagadish, H. V. (1991). A retrieval technique for similar shape. In Proc. 1991
208-217.

Jea, K. F. and Lee, Y. C. (1990). Building efficient and flexible feature-based
indexes. Information Systems, 16(6):653-662.
Jenq, P., Woelk, D., Kim, W., and Lee, W. (1990). Query processing in dis-
tributed ORION. In Proc. 2nd International Conference on Extending Data-
base Technology, pages 169-187.
Jensen, C. S., editor (1994). A consensus glossary of temporal database concepts.
Jensen, C. S., Mark, L., and Roussopoulos, N. (1991). Inc'remental implemen-
tation model for relational databases with transaction time. IEEE Transac-
tions on Knowledge and Data Engineering, 3(4):461-473.
Jensen, C. S. and Snodgrass, R. (1994). Temporal specialization and generaliza-
tion. IEEE Transactions on Knowledge and Data Engineering, 6(6):954-974.
Jhingran, A. (1991). Precomputation in a complex object environment. In Proc.
7th IEEE International Conference on Data Engineering, pages 652-659.
Jiang, P., Ooi, B. C., and Tan, K. L. (1996). An experimental study of
temporal indexing structures, unpublished manuscript, available at
http://www.iscs.nus.sg/ooibc/tp.ps.
Kabanza, F., Stevenne, J., and Wolper, P. (1990). Handling infinite temporal
data. In Proc. 9th ACM SIGACT-SIGMOD-SIGART Symposium on Prin-
ciples of Database Systems, pages 392-403.
Kanellakis, P., Kuper, G., and Revesz, P. (1995). Constraint query languages.
Journal of Computer and System Sciences, 51(1):26-52.
Kanellakis, P. and Ramaswamy, S. (1996). Indexing for data models with con-
straints and classes. Journal of Computer and System Sciences, 52(3) :589-
612.
Kaszkiel, M. and Zobel, J. (1997). Passage retrieval revisited. In Proc. 20th
A CM-SIGIR International Conference on Research and Development in In-
formation Retrieval.
Kemper, A., Kilger, C., and Moerkotte, G. (1994). Function materialization
in object bases: Design, realization and evaluation. IEEE Transactions on
Knowledge and Data Engineering, 6(4):587-608.
Kemper, A. and Kossmann, D. (1995). Adaptable pointer swizzling strategies in
object bases: Design, realization, and quantitative analysis. VLDB Journal,
4(3):519-566.
Kemper, A. and Moerkotte, G. (1992). Access support relations: An indexing
method for object bases. Information Systems, 17(2):117-145.
Kent, A., Sacks-Davis, R., and Ramamohanarao, K. (1990). A signature file
scheme based on multiple organizations for indexing very large text databases.
Journal of the American Society for Information Science, 41(7):508--534.
Kilger, C. and Moerkotte, G. (1994). Indexing multiple sets. In Proc. 20th
International Conference on Very Large Data Bases, pages 180-191.

REFERENCES 235
Kim, K., Kim, W., Woelk, D., and Dale, A. (1988). Acyclic query processing in
object-oriented databases. In Proc. 7th International Conference on Entity-
Relationship Approach, pages 329-346.
Kim, W. (1989). A model of queries for object-oriented databases. In Proc. 15th
International Conference on Very Large Data Bases, pages 423-432.
Kim, W., Kim, K., and Dale, A. (1989). Indexing techniques for object-oriented
databases. In Object-Oriented Concepts, Databases, and Applications, pages
371-394. Addison-Wesley.
Knaus, D., Mittendorf, E., Schauble, P., and Sheridan, P. (1995). Highlighting
relevant passages for users of the interactive SPIDER retrieval system. In
Proc. 4th Text Retrieval Conference (TREC), pages 233-243.
Knuth, D. E. (1973). Fundamental Algorithms: The art of computer program-
ming, Volume 1. Addison-Wesley.
Knuth, D. E. and Wegner, L. M., editors (1992). Proc. IFIP TC2/WG2.6 2nd
Working Conference on Visual Database Systems. North-Holland.
Kolovson, C. (1993). Indexing techniques for historical databases. In Temporal
Databases: Theory, Design and Implementation, Chapter 17, pages 418-432.
A. Benjamin/Cummings.
Kolovson, C. and Stonebraker, M. (1991). Segment indexes: Dynamic indexing
techniques for multi-dimensional interval data. In Proc. 1991 ACM SIGMOD
International Conference on Management of Data, pages 138-147.
Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., and Protopapas, Z. (1996).
Fast nearest neighbor search in medical image databases. In Proc. 22nd In-
ternational Conference on Very Large Data Bases, pages 215-226.
Koubarakis, M. (1994). Database models for infinite and indefinite temporal
information. Information Systems, 19(2): 141-173.
Kriegel, H. (1984). Performance comparison of index structures for multi-key
retrieval. In Proc. 1984 ACM SIGMOD International Conference on Man-
agement of Data, pages 186-196.
Kriegel, H. and Seeger, B. (1986). Multidimensional order preserving linear
hashing with partial expansion. In Proc. 1st International Conference on
Database Theory, pages 203-220.
Kriegel, H. and Seeger, B. (1988). PLOP-Hashing: A grid file without directory.
In Proc. 4th International Conference on Data Engineering, pages 369-376.
Kroll, B. and Widmayer, P. (1994). Distributing a search tree among a growing
number of processors. In Proc. 1994 ACM SIGMOD International Confer-
ence on Management of Data, pages 265-276.
Kukich, K. (1992). Techniques for automatically correcting words in text. Com-
puting Sw'veys, 24(4):377-440.

Kumar, A., Tsotras, V. J., and Faloutsos, C. (1995). Access methods for bi-
temporal databases. In Proc. International Workshop on Temporal Databases,
pages 235-254.
Kunii, T., editor (1989). Proc. IFfP TC2/WG2.6 1st Working Conference on
Visual Database Systems. North-Holland.
Larson, P. (1978). Dynamic hashing. BIT, 13:184-201.
Lassez, J. (1990). Querying constraints. In Proc. 9th ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems, pages 288-298.
Lee, D. T. and Wong, C. K. (1977). Worst-case analysis for region and partial
region searches in multidimensional binary search trees and balanced quad
trees. Acta Informatica, 9(1):23-'-29.
Lee, S. Y. and Hsu, F. J. (1990). 2D C-String: A new spatial knowledge repre-
sentation for image database system. Pattern Recognition, 23(10):1077-1087.
Lee, S. Y. and Leng, C. (1989). Partitioned signature files: Design issues and
performance evaluation. ACM Transactions on Office Information Systems,
7(2):158-180.
Lee, S. Y, Yang, M. C., and Chen, J. W. (1992). Signature file as a spatial
filter for iconic image database. Journal of Visual Languages and Computing,
3(4):373-397.
Lee, W. (1989). Mobile cellular telecommunication systems. McGraw-Hill.
Lin, K., Jagadish, H., and Faloutsos, C. (1995). The TV-tree: An index struc-
ture for high-dimensional data. VLDB Journal, 3(4):517-542.
Litwin, W. (1980). Linear hashing: A new tool for file and table addressing. In
Proc. 6th International Conference on Very Large Data Bases, pages 212-
223.
Litwin, W. and Neimat, M. (1996). k-RP*S: A scalable distributed data struc-
ture for high-performance multi-attribute access. In Proc. 4th Conference on
Parallel and Distributed Information Systems, pages 35-46.
Litwin, W., Neimat, M., and Schneider, D. (1993a). LH* - Linear hashing for
distributed files. In Proc. 1993 ACM SIGMOD International Conference on
Litwin, W., Neimat, M., and Schneider, D. (1994). RP*: A family of order-
preserving scalable data structures. In Proc. 20th International Conference
Litwin, W., Neimat, N. A., and Schneider, D. A. (1993b). LH* - Linear hashing
for distributed files. In Proc. 1993 ACM SIGMOD International Conference
on Management of Data, pages 327-336.
Lomet, D. (1992). A review of recent work on multi-attribute access methods.
ACM SIGMOD Record, 21(3):56-63.

REFERENCES 237
Lomet, D. and Salzberg, B. (1989). Access methods for multiversion data.
Lomet, D. and Salzberg, B. (1990a). The hB-tree: A multiattribute indexing
method with good guaranteed performance. ACM Transactions on Database
Systems, 15(4):625-658.
Lomet, D. and Salzberg, B. (1990b). The performance of a multiversion ac-
cess methods. In Proc. 1990 ACM SIGMOD International Conference on
Lomet, D. and Salzberg, B. (1993). Transaction time databases. In Temporal
Databases: Theory, Design and Implementation, Chapter 16, pages 388-417.
A. Benjamin/Cummings.
Lovins, J. (1968). Development of a stemming algorithm. Mechanical Transla-
tion and Computation, 11(1-2):22-31.
Low, C. C., Ooi, B. C., and Lu, H. (1992). H-trees: A dynamic associative search
index for OODB. In Proc. 1992 ACM SIGMODlnternational Conference on
Lu, H. and Ooi, B. C. (1993). Spatial indexing: Past and future. IEEE Bulletin
on Data Engineering, 16(3):16-21.
Lu, H., Ooi, B. C., and Tan, K. L. (1994). Efficient image retrieval by color con-
tents. In Proc. 1994 International Conference on Applications of Databases,
pages 95-108.
Lu, W. and Han, J. (1992). Distance-associated join indices for spatial range
search. In Proc. 8th International Conference on Data Engineering, pages
284-292.
Lucarella, D. (1988). A document retrieval system based upon nearest neighbor
searching. Journal of Information Science, 14:25-33.
Maier, D. and Stein, J. (1986). Indexing in an object-oriented database. In
Proc. IEEE Workshop on Object-Oriented DBMSs, pages 171-182.
Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet
representation. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 11(7):2091-2110.
Manber, U. and Wu, S. (1994). GLIMPSE: A tool to search through entire file
systems. In Proc. 1994 Winter USENIX Technical Conference, pages 23-32.
Maragos, P. (1989). Pattern spectrum and multiscale shape representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):701-
716.
Maragos, P. and Schafer, R. W. (1986). Morphological skeleton representation
and coding of binary images. IEEE Transactions on Acoustics, Speech, and
Signal Processing, 34:1228-1244.

Matsuyama, T., Hao, L., and Nagao, M. (1984). A file organization for geo-
graphic information systems based on spatial proximity. International Jour-
nal on Computer Vision, Graphics, and Image Processing, 26(3):303-318.
Mehlhorn, K. and Tsakalidis, A. (1990). Data structures. In Handbook of The-
oretical Computer Science, Volume A, pages 301-341. Elsevier Publisher.
Mehrotra, R. and Gary, J. E. (1993). Feature-based retrieval of similar shapes.
In Proc. 9th International Conference on Data Engineering, pages 108-115.
Melton, J. (1996). An SQL3 snapshot. In Proc. 12th International Conference
Mittendorf, E. and Schauble, P. (1994). Document and passage retrieval based
on hidden Markov models. In Proc. 17th ACM-SIGIR International Confer-
ence on Research and Development in Information Retrieval, pages 318-327.
Miyahara, M. and Yoshida, Y. (1989). Mathematical transform of (R,G,B) color
data to Munsell (H,Y,C) color data. Journal of the Institute of Television
Engineers, 43(10):1129-1136.
Moffat, A. and Zobel, J. (1996). Self-indexing inverted files for fast text re-
trieval. ACM Transactions on Information Systems, 14(4):349-379.
Moffat, A., Zobel, J., and Sacks-Davis, R. (1994). Memory efficient ranking.
Information Processing (j Management, 30(6):733-744.
Morrison, D. (1968). PATRICIA - Practical algorithm to retrieve information
coded in alphanumeric. Journal of the ACM, 15(4):514-534.
Morton, G. (1966). A computer oriented geodetic data base and a new technique
in file sequencing. In IBM Ltd.
Moss, J. (1992). Working with the persistent objects: to swizzle or not to swiz-
zle. IEEE Transactions on Software Engineering, 18(8):657-673.
Nabil, M., Ngu, A. H. H., and Shepherd, J. (1996). Picture similarity re-
trieval using the 2D projection interval representation. IEEE Transactions
on Knowledge and Data Engineering, 8(4):533-539.
Nagy, G. (1985). Image databases. Image and Vision Computing, 3(3): 111-117.
Nascimento, M. A. (1996). Efficient Indexing of Temporal Database via B+-
trees. PhD thesis, School of Engineering and Applied Science, Southern
Methodist University.
Nelson, R. and Samet, H. (1987). A population analysis for hierarchical data
structures. In Proc. 1987 ACM SIGMOD International Conference on Man-
Ng, V. and Kameda, T. (1993). Concurrent accesses to R-trees. In Proc. 3rd
International Symposium on Advances in Spatial Databases, pages 142-161.
Niblack, W., Equitz, R. B. W., Glasman, M. F. E., Petkovic, D., YankeI', P.,
and Faloutsos, C. (1993). The QBIC project: Query images by content using
color, texture and shape. In Storage and Retrieval for Image and Video
Databases, Vulume 1908, pages 173-187.

REFERENCES 239
Nievergelt, J. and Hinrichs, K. (1985). Storage and access structures for geo-
metric data bases. In Proc. International Conference on Foundations of Data
Organization, pages 335-345.
Nievergelt, J., Hinterberger, H., and Sevcik, K. C. (1984). The grid file: An
adaptable, symmetric multikey file structure. A CM Transactions on Database
Systems, 9(1):38-71.
Nievergelt, J. and Widmayer, P. (1997). Spatial data structures: Concepts and
design choices. In Algorithmic Foundations of GIS, pages 1-61. Springer-
Verlag.
Nori, A. (1996). Object relational database management systems (tutorial notes)
In Proc. 22nd International Conference on Very Large Data Bases.
ObjectStore (1995). ObjectStore C++ - User Guide Release 4.0.
Ogle, V. E. and Stonebraker, M. (1995). Chabot: Retrieval from a relational
database of images. IEEE Computer, 28(9):40-48.
Ohsawa, Y. and Sakauchi, M. (1983). The BD-tree: A new n-dimensional data
structure with highly efficient dynamic characteristics. In Proc. IFIP Congres~
pages 539-544.
Ohsawa, Y. and Sakauchi, M. (1990). A new tree type data structure with
homogeneous nodes suitable for a very large spatial database. In Proc. 6th
International Conference on Data Engineering, pages 296-303.
O'Neil, P. and Graefe, G. (1995). Multi-table joins through bitmapped join
indices. ACM SIGMOD Record, 24(3):8-11.
O'Neil, P. and Quass, D. (1997). Improved query performance with variant
indexes. In Proc. 1997 ACM SIGMOD International Conference on Man-
agement of Data.
Ooi, B. C. (1990). Efficient Query Processing in Geographical Information Sys-
tems. Springer-Verlag.
Ooi, B. C., McDonell, K. J., and Sacks-Davis, R. (1987). Spatial kd-tree: An
indexing mechanism for spatial databases. In Proc. 11th International Con-
ference on Computer Software and Applications.
Ooi, B. C., Sacks-Davis, R., and Han, J. (1993). Spatial indexing structures,
unpublished manuscript, available at http://www.iscs.nus.edu.sg/ooibc/.
Ooi, B. C., Sacks-Davis, R., and McDonell, K. J. (1991). Spatial indexing by bi-
nary decomposition and spatial bounding. Information Systems, 16(2):211-
237.
Ooi, B. C., Tan, K. L., and Chua, T. S. (1997). Fast image retrieval using color-
spatial information. Technical report, Department of Information Systems
and Computer Science, NUS, Singapore.
Orenstein, J. A. (1982). Multidimensional tries for associative searching. Infor-
mation Processing Letters, 14(4):150-157.

Orenstein, J. A. (1986). Spatial query processing in an object-oriented database
system. In Proc. 1986 ACM SIGMOD International Conference on Manage-
ment of Data, pages 326-336.
Orenstein, J. A. (1990). A comparison of spatial query processing techniques
for native and parameter spaces. In Proc. 1990 ACM SIGMOD International
Orenstein, J. A. and Merrett, T. H. (1984). A class of data structures for
associative searching. In Proc. 1984 ACM-SIGACT-SIGMOD Symposium
on Principles of Database Systems, pages 181-190.
Ouksel, M. and Scheuermann, P. (1981). Multidimensional B-trees: Analysis of
dynamic behavior. BIT, 21:401-418.
Overmars, M. H. and Leeuwen, J. V. (1982). Dynamic multi-dimensional data
structures based on Quad- and KD- trees. Acta Information, 17:267-285.
Owolabi, O. and McGregor, D. (1988). Fast approximate string matching. Soft-
ware - Practice and Experience, 18:387-393.
Papadias, D., Theodoridis, Y, Sellis, T., and Egenhofer, M. J. (1995). Topo-
logical relations in the world of minimum bounding rectangles: A study with
R-trees. In Proc. 1995 ACM SIGMOD International Conference on Man-
Paredaens, J. (1995). Spatial databases, the final frontier. In Proc. 5th Inter-
national Conference on Database Theory, pages 14-31.
Paredaens, J., Van den Bussche, J., and Van Gucht, D. (1994). Towards a theory
of spatial database queries. In Proc. 13th ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, pages 279-288.
Persin, M. (1996). Efficient implementation of text retrieval techniques. Mas-
ter's thesis, Department of Computer Science, RMIT, Melbourne, Australia.
Persin, M., Zobel, J., and Sacks-Davis, R. (1996). Filtered document retrieval
with frequency-sorted indexes. Journal of the American Society for Infor-
mation Science, 47(10):749-764.
Pfaltz, J., Berman, W., and Cagley, E. (1980). Partial-match retrieval using
indexed descriptor files. Communications of the ACM, 23(9):522-528.
Porter, M. (1980). An algorithm for suffix stripping. Program, 13(3):130-137.
Preparata, F. and Shamos, M. (1985). Computational Geometry: An Introduc-
tion. Springer-Verlag.
Rabitti, F. and Savino, P. (1991). Image query processing based on multi-level
signatures. In Proc. 14th ACM-SIGIR International Conference on Research
and Development in Information Retrieval, pages 305-314.
Rabitti, F. and Stanchev, P. (1989). GRIM-DBMS: A graphical image database
management system. In Proc. IFIP TC2/WG2.6 1st Working Conference on
Visual Database Systems, pages 415-430.

REFERENCES 241
Ramaswamy, S. (1997). Efficient indexing for constraints and temporal data-
bases. In Pmc. 6th International Conference on Database Theory, pages 419-
431.
Ramaswamy, S. and Kanellakis, P. (1995). OODB indexing by class-division.
Roberts, C. (1979). Partial-match retrieval via the method of superimposed
codes. Pmceedings of the IEEE, 67(12):1624-1642.
Robinson, J. T. (1981). The k-d-b-tree: A search structure for large multi-
dimensional dynamic indexes. In Pmc. 1981 ACM SIGMOD International
Rosenberg, J. B. (1985). Geographical data structures compared: A study of
data structures supporting region queries. IEEE Transactions on Computer
Aided Design, 4(1):53-67.
Rotem, D. (1991). Spatial join indices. In Pmc. 7th International Conference
Rotem, D. and Segev, A. (1987). Physical organization of temporal data. In
Pmc. 3rd International Conference on Data Engineering, pages 547-553.
Sacks-Davis, R., Kent, A., and Ramamohanarao, K. (1987). Multi-key access
methods based on superimposed coding techniques. ACM Transactions on
Database Systems, 12(4) :655-696.
Sagiv, y. (1986). Concurrent operations on B*-trees with overtaking. Journal
of Computer System Science, 33(2) :275-296.
Salomone, S. (1995). Radio days. In Byte, Special Issue on Mobile Computing,
page 107.
Salton, G. (1989). Automatic Text Processing: The Transfol'mation, Analysis,
and Retrieval of Information by Computer. Addison-Wesley.
Salton, G., Allan, J., and Buckley, C. (1993). Approaches to passage retrieval
in full text information systems. In Pmc. 16th ACM-SIGIR International
Conference on Research and Development in Information Retrieval, pages
49-58.
Salton, G. and McGill, M. (1983). Introduction to Modern Information Re-
trieval. McGraw-Hill.
Salzberg, B. (1994). On indexing spatial and temporal data. Information Sys-
tems, 19(6):447-465.
Samet, H. (1989). The design and analysis of spatial data structures. Addison-
Wesley.
Scheuermann, P. and Ouksel, M. (1982). Multidimensional B-trees for associa-
tive searching in database systems. Information Systems, 7(2):123-137.

Seeger, B. and Kriegel, H. (1988). Techniques for design and implementation of
efficient spatial access methods. In Proc. 14th International Conference on
Very Large Data Bases, pages 360-371.
Sellis, T., Roussopoulos, N., and Faloutsos, C. (1987). The R+-tree: A dynamic
index for multi-dimensional objects. In Proc. 13th International Conference
Serra, J. (1988). Image Analysis and Mathematical Morphology, Volume 2, The-
oretical Advances. Academic Press.
Shamos, M. I. and Bentley, J. L. (1978). Optimal algorithm for structuring
geographic data. In Proc. 1st International Advanced Study Symposium on
Topological Data Structure for Geographic Information Systems.
Sharma, K. D. and Rani, R. (1985). Choosing optimal branching factors for
k-d-B trees. Information Systems, 10(1):127-134.
Shaw, G. and Zdonik, S. (1989). An object-oriented query algebra. In Proc.
2nd International Workshop on Database Programming Languages, pages
103-112.
Shen, H., Ooi, B. C., and Lu, H. (1994). The TP-index: A dynamic and ef-
ficient indexing mechanism for temporal databases. In Proc. 10th Interna-
tional Conference on Data Engineering, pages 274-281.
Sheng, S., Chandrasekaran, A., and Broderson, R. (1992). A portable mul-
timedia terminal for personal communications. In IEEE Communications
Magazine, pages 64-75.
Shidlovsky, B. and Bertino, E. (1996). A graph-theoretic approach to indexing
in object-oriented databases. In Proc. 12th International Conference on Data
Engineering, pages 230-237.
Snodgrass, R. (1987). The temporal query language TQuel. ACM Transaction
on Database Systems, 12(2):247-298.
Sreenath, B. and Seshadri, S. (1994). The hcC-tree: An efficient index structure
for object oriented databases. In Proc. 20th International Conference on
Very Large Data Bases, pages 203-213.
Straube, D. and Ozsu, M. T. (1995). Query optimization and execution plan
generation in object-oriented data management systems. IEEE Transactions
on Knowledge and Data Engineering, 7(2):210-227.
Swain, M. J. (1993). Interactive indexing into image database. In Storage and
Retrieval for Image and Video Databases, Volume 1908, pages 95-103.
Tamminen, M. (1982). Efficient spatial access to a data base. In Proc. 1982
200-206.
Tamura, H., Mori, S., and Yamawaki, T. (1978). Textural features correspond-
ing to visual perception. IEEE Transactions on Systems, Man and Cyber-
netics,8(6):460-472.

REFERENCES 243
Tamura, H. and Yokoya, N. (1984). Image database systems: A survey. Pattern
Recognition, 17(1):29-43.
Thorn, J., Zobel, J., and Grima, B. (1995). Design of indexes for structured
document databases. Technical Report TR-95-8, Collaborative Information
Technology Research Institute, RMIT and The University of Melbourne.
Treisman, A. and Paterson, R. (1980). A feature integration theory of attention.
Cognitive Psychology, 12:97-136.
Tsay, J. J. and Li, H. C. (1994). Lock-free concurrent tree structures for mul-
tiprocessor systems. In Proc. 1994 International Conference on Parallel and
Distributed Systems, pages 544-549.
Valduriez, P. (1986). Optimization of complex database queries using join in-
dices. IEEE Bulletin on Data Engineering, 9(4):10-16.
Valduriez, P. (1987). Join indices. ACM Transactions on Database Systems,
12(2) :218-246.
van Rijsbergen, C. (1979). Information Retrieval. Butterworths, second edition.
Whang, K. and Krishnamurthy, R. (1985). Multilevel grid files. Technical Re-
port RC-1l516, IBM Thomas J. Watson Research Center.
Wilkinson, R. (1994). Effective retrieval of structured documents. In Proc. 17th
ACM-SIGIR International Conference on Research and Development in In-
formation Retrieval, pages 311-317.
Witten, I., Moffat, A., and Bell, T. (1994). Managing Gigabytes: Compressing
and Indexing Documents and Images. Van Nostrand Reinhold.
Wu, S. and Manber, U. (1992). Agrep - A fast approximate pattern-matching
tool. In Proc. 1992 Winter USENIX Technical Conference, pages 153-162.
Xie, Z. and Han, J. (1994). Join index hierarchy for supporting efficient navi-
gation in object-oriented databases. In Proc. 20th International Conference
Zdonik, S. and Maier, D. (1989). Fundamentals of object-oriented databases.
In Readings in Object-Oriented Database Management Systems.
Zhou, Z. and Venetsanopoulos, A. N. (1988). Morphological skeleton represen-
tation and shape recognition. In Proc. IEEE 2nd International Conference
on ASSP, pages 948-951.
Zobel, J. and Dart, P. (1995). Finding approximate matches in large lexicons.
Software - Practice and Experience, 25(3):331-345.
Zobel, J. and Dart, P. (1996). Phonetic string matching: Lessons from infor-
mation retrieval. In Proc. 19th ACM-SIGIR International Conference on
Research and Development in Information Retrieval, pages 166-173.
Zobel, J., Moffat, A., and Ramamohanarao, K. (1995a). Inverted files versus
signature files for text indexing. Technical Report TR-95-5, Collaborative
Information Technology Research Institute, RMIT and The University of
Melbourne.

Zobel, J., Moffat, A., and Ramamohanarao, K. (1996). Guidelines for pre-
sentation and comparison of indexing techniques. ACM SIGMOD Record,
25(3):10-15.
Zobel, J., Moffat, A., and Sacks-Davis, R. (1992). An efficient indexing tech-
nique for full-text database systems. In Proc. 18th International Conference
on Very Large Databases, pages 352-362.
Zobel, J., Moffat, A., and Sacks-Davis, R. (1993). Searching large lexicons for
partially specified terms using compressed inverted files. In Proc. 19th In-
ternational Conference on Very Large Databases, pages 290-301.
Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R. (1995b). Efficient
retrieval of partial documents. Information Processing fj Management,
31(3):361-377.

About the Authors
Elisa Bertino is full professor of computer science in the Department of Com-
puter Science of the University of Milan. She has also been on the faculty
in the Department of Computer and Information Science of the University of
Genova, Italy. She has been a visiting researcher at the IBM Research Labo-
ratory (now Almaden) in San Jose, and at the Microelectronics and Computer
Technology Corporation in Austin, Texas. She is or has been on the editorial
board of the following scientific journals: IEEE Transactions on Knowledge and
Data Engineering, Theory and Practice of Object Systems Journal, Journal of
Computer Security, Very Large Database Systems Journal, Parallel and Dis-
tributed Database, the International Journal of Information Technology. She
is currently serving as Program co-chair of the 1998 International Conference
on Data Engineering.
Beng Chin Ooi received his B.Sc. and Ph.D in computer science from
Monash University, Australia, in 1985 and 1989 respectively. He was with
the Institute of Systems Science, Singapore, from 1989 to 1991 before joining
the Department ofInformation Systems and Computer Science at the National
University of Singapore, Singapore. His research interests include database
performance issues, database UI, multi-media databases and applications, and
GIS. He is the author of a monograph "Efficient Query Processing in Geographic
Information Systems" (Springer-Verlag, 1990). He has published many confer-
ence and journal papers and serves as a PC member in a number of international
conferences. He is currently on the editorial board of the following scientific
journals: International Journal of Geographical Information Systems, Journal
on Universal Computer Science, Geoinformatica and International Journal of
Information Technology.
Ron Sacks-Davis obtained his Ph.D. from the University of Melbourne in
1977. He currently holds the position of Professor and Institute Fellow at
245

RMIT. He has published widely in the areas of database management and
information retrieval and is an editor-in-chief of the International Journal on
Very Large Databases (VLDB) and a member of the VLDB Endowment Board.
Kian-Lee Tan received his Ph.D. in computer science, from the National
University of Singapore in 1994. He is currently a lecturer in the Depart-
ment of Information Systems and Computer Science, National University of
Singapore. He has published numerous papers in the areas of multimedia in-
formation retrieval, wireless computing, query processing and optimization in
multiprocessor and distributed systems.
Justin Zobel obtained his Ph.D. in computer science from the University
of Melbourne, where he was a member of staff from 1984 to 1990. He then
joined the Department of Computer Science at RMIT, where he is now a senior
lecturer. He has published widely in the areas of information retrieval, text
databases, indexing, compression, string matching, and genomic databases.
Boris Shidlovsky received his M.Sc. in applied mathematics and Ph.D. in
computer science from the University of Kiev, Ukraine, in 1984 and 1990 respec-
tively. He was an assistant professor in the Department of Computer Science
at the University of Kiev. From 1993 to 1996, he was with the Department of
Computer Engineering at University of Salerno, Italy and currently is a member
of the Scientific Stuff in RANK XEROX Research Center, Grenoble, France.
His research interests include design and analysis of algorithms, indexing and
query optimization in advanced database systems, processing semistructured
data on the Web.
Barbara Catania is enrolled in a Ph.D program .in computer science in the
University of Milano, Italy, since November 1993. She received with honour
the Laurea degree in computer science from the University of Genova, Italy,
in 1993. She has also been a visiting researcher at the European Computer-
Industry Research Center, Munich, Germany, where she joined in the ESPRIT
project IDEA, sponsorized by the European Economic Community. Her main
research interests include: constraint databases, deductive databases, indexing
techniques for constraint and object-oriented databases.

Index
02,4
x-tree, 25
(l-m) index, 201
I-dimensional generalized tuple, 218
2-dimensional generalized tuple, 218, 222
access support relation, 16, 19
access time, 199, 200, 202
active mode, 196
address calculation, 191
adjacency
querying on, 154
aggregation, 7, 29
aggregation graph, 3
agrep, 213
ALL selection, 217, 222
Altavista, 211
AP-tree, 125-127
Archie, 211
B+ -tree, 9, 20,30
of color-spatial index, 91
with linear order, 129-132
B-tree, 2
for lexicons, 159
battery, 196, 198, 200
beast wait, 199
BD-tree, 54-55
binary join index, 10, 206
bitemporal database, 114
bitemporal interval tree, 140
bitemporal relation, 118
bitmap, 207
bitmap join index, 209
bitslices, 169
Boolean queries
for text, 154-155
Boolean query evaluation
for text, 169-170
bounding rectangle, 40
bounding structure, 41
broadcast channel, 197
broadcasted data, 196
bucket, 198
BY-tree, 63-64
caching, 36
CG-tree,24
CH-tree,21
color, 90
CIE L*u*v, 108
color histogram, 90
Munsell HYC, 92
color index
color-spatial index
for image, 91
compression
of inverted lists, 161-164
configurable index, 200, 202
constraint, 214
constraint programming, 214
constraint theory, 216, 218
content-based index
for image, 80
content-based retrieval
for image, 78
convex theory, 218
cosine measure, 155-156
247

data warehouse, 204
decisions support system, 203
delta code, 162
detail table, 205
diagonal corner query, 219
dimension table, 205
distributed index, 201
distributed RAM, 189
doze mode, 196
dual plane, 222
dual R-tree, 140
dumb terminal, 195
dynamic interval management, 219
effectiveness
of ranking, 152
Elias codes, 161-162
emerging applications, 185-224
Excite, 211
EXIST selection, 217, 218
extension, 215
fact constellation schema, 205
fact table, 205
feature
color, 90
color-spatial, 91
semantic object, 87
shape, 84
spatial relationship, 88
texture, 89
feature extraction, 78
feature-based indexing, 78
file image, 191
file image adjustment, 192
filtering, 222
for ranking, 172
fixed host, 194
flexible indexing, 202
gamma code, 162
GBD-tree, 54-55
GemStone, 4
generalized I-dimensional indexing, 218
generalized concordance lists
for text, 178
generalized database, 215
generalized relation, 215
generalized relational model, 215
generalized tuple, 215
Glimpse, 213
global index, 187
Golomb codes, 162-163
Gopher, 211
grid file, 64-67
H-tree,23-
Harvest, 214
hashing, 2
hB-tree, 49-51
hcC-tree, 24
image database, 77-112
image database system, 78
architecture, 79
index construction
for text, 164-166
index update
for text, 166-168
indexing
of documents, 153
indexing graph, 9
information retrieval, 152, 155-157
InfoSeek,211
infrared technology, 194
inheritance, 5, 20, 29
inheritance graph, 4
inheritance hierarchy, 20
interleaving
for ranking, 173
interval B-tree, 127-129
interval tree, 220
inverse document frequency, 156
inverted file
for image, 83
inverted index, 212
for text, 157-168
inverted lists
for text, 158,160-164
join explicit, 5
join implicit, 5
join index, 10
join index hierarchy, 19
K-D-B-tree, 48-49
kd-tree, 46-48
non-hon10geneous,47
lexicons, 158-160
limiting accumulators
for ranking, 172
linear hashing, 189
local index, 187

locational keys, 70-71
LSD-tree, 55-56
mapping table, 158
materialization technique, 204
meta-block tree, 220
metasearcher, 213
method invocation, 3, 36
minimum bounding polybox, 224
minimum bounding rectangle, 41, 223
mobile host, 194
mobile network, 194
multi-index, 9, 17
navigational access, 2
nested attribute, 3
nested index, 14, 17
nested predicate, 5, 10, 29
nested-inherited index, 29
non-configurable index, 200
NST-tree, 126
object identifier, 3
object query language, 2, 5
object-oriented data model, 1, 3
object-oriented database, 1-38
object-relational database, 1
ObjectStore,4
OLAP, 203
OQL,2
ordinal number, 207
palmtop, 195
partition, 186
partitioning degree, 186
passage retrieval, 180-181
path, 7
path index, 15, 17
path instantiation, 7, 15
path splitting, 18
path-expression, 5
pattern matching
for text, 179-180
perceptually similar color, 108
phonetic matching
for text, 180
PLOP-hashing, 68-69
point location, 222
pointer swizzling, 2, 36
precomputed join, 207
probe time, 199
projection, 16
INDEX 249
proximity
querying on, 154
query expansion
for text, 181
query gr.aph, 6
query precomputation, 204
R+ -tree, 25, 60-63
R*-tree,59-60
R-file,67-68
R-tree, 25, 56-59, 132-137
2-D R-tree, 133
3-D R-tree, 133
ranked query evaluation
for text, 170-175
ranking, 155-157
relevance
judgments, 152
of documents, 152
satellite network, 194
SC-index, 21
search engine, 211
semantic object, 87
sequential search, 212
set-oriented access, 2
SGML,175
shape,84
signature file
for image, 84
for text, 168-169
similarity, 155, 156
measures, 79, 82, 155
approximate match, 82
Euclidean distance, 83
exact match, 82
signature-based, 107
signature-based (weighted), 109
skd-tree,51-54
SMAT
snowflake schema, 205
spatial access method
for image, 83
spatial database, 39-75,215
spatial index taxonomy, 42
non-overlapping, 43
overlapping, 44
transformation approach, 43
spatial operators, 39

adjacency, 40
containment, 40
intersection, 39, 41
spatial query processing, 40
approximation, 40
multi-step strategy, 42
spatial relationship, 88
SQL,l
SQL-3,2
stabbing query, 219
star schema, 205
stemming
of words, 154
stopwords, 156, 175
storage on the air, 196
structured documents, 175-178
indexing of, 177-178
suffixing
of words, 154
summary table, 205
temporal database, 113-149,215
temporal index, 121-142
B+ -tree with linear order, 129
temporal query, 119-121
bitemporal key-range time-slice, 120
bitemporal time-slice, 120
key, 120
key-range time-slice, 120
time-slice, 119
inclusion, 119
intersection, 119
point, 120
time-slice query
containment, 120
text database, 151-182
text indexing, 157-169
text passage retrieval, 180-181
texture, 89
time
lifespan, 115
time span, 115
transaction time, 114
valid time, 114
time index, 123-125
TP-index, 137-139
transaction time, 114-116
traversal strategy, 6
TREC, 159
TSB-tree, 122-123
tuning time, 200, 202
unary code, 161-162
valid time, 114,116-117
variable-bit codes, 161-163
WAIS,211
walkstation, 195
Web Crawler, 214
Web navigation, 210
Web robot, 214
Webcrawler,211
weight, 221
weight-balanced B-tree, 220
Whois,211
Whois++, 211
wireless interface, 194
WWW Worm, 214

Indexing techniques for advanced database systems

More Related Content

What's hot (20)

Similar to Indexing techniques for advanced database systems (20)

Recently uploaded (20)

Indexing techniques for advanced database systems