Query Language Issues in a Distributed Indexing Environment
Peter Valkenburg <peter.valkenburg@surfnet.nl>
Dan Brickley <daniel.brickley@bristol.ac.uk>
November 1998
Abstract
This contribution to the W3C's QL'98 workshop looks at some of the issues
of querying in the distributed indexing and searching environment available
on the Web (and Internet as a whole) today. We briefly discuss three
issues with respect to distributed querying of resources: schema discovery,
query language translation, and query routing. Drawing on experience,
some tentative requirements are formulated that we think need to be fullfilled
for the wide application of a new query language in a heterogeneous Web
metadata indexing environment.
Introduction
The authors have been involved in the development of several distributed
indexing, cataloguing and searching projects, notably DESIRE [1],
CHIC-Pilot [2,3], and ROADS [4].
The technologies involved in these were varied, since the goals included
the tying together of widely deployed services and protocols, and include
Z39.50/GILS, WHOIS++, LDAP, CIP, RDF and others.
Three issues have been persistent in these and other projects, when
trying to build services which `cross-search' multiple datasets, query
protocols, and search services:
-
Schema Discovery and Translation
- how to find out what attribute(-sets) are supported by a search service
and how to map them
-
Query Language Translation
- translating an individual query to one in another language, and vice-versa
-
Query Routing and Forward Knowledge
- mechanisms to identify relevant search services for answering particular
queries
It is likely that not all of these issues can be solved within the framework
of a query language only; for instance, query routing may presuppose registry
of search services that falls outside the domain of any particular query
language. However, a query language is very important as a mechanism
to locate distributed search services.
We will touch upon each of the above issues in the following sections.
Schema Discovery and Translation
Schema discovery is obviously an important element of distributed search
systems. Clients need to be able to find out what attribute sets
are supported by various services that they search. In a distributed
context one can often not rely on all search services to support the same
syntax and semantics of attributes as those that a client would want to
search on.
Some important requirements related to this are:
-
Flexible support for both `home-grown' and standardised attribute sets
-
The query language should allow for schema definition using public standards
such as the Dublin Core, but also for derived schemas and wholly independent
ones. A mechanism for defining schemas and their attributes (or properties)
is part of the "Resource Description Framework (RDF) Schema Specification"
[5]. An important element of this framework is the
universal identification of schemas, in this case through URIs. Through
assigning every attribute (property) a URI, some of the present ambiguity
in the distributed search environment may be removed.
-
Discovery of schemas
-
The discovery of schemas used by a particular service must be possible.
A client querying a service needs this to find out what appropriate attributes/properties
are supported by the service.
-
Schema mapping
-
In many cases search services offer derived or overlapping sets of searchable
properties. To effectively cross-search these services, it should
be possible to build and query schema mapping services. As
an example, cross-searching document metadata with different sets of `subject
category' properties may be done using a service that maps one a subject
category property in one schema to one or more in another schema. Any web
query language framework should allow communities of expertise to define
their own resource description vocabularies (such as the Z39.50 attribute
sets), and to describe in a machine-processable manner how those
vocabularies relate to others.
Query Language Translation
The importance of query language translation lies in the observation that
the Web and the Internet are very heterogeneous environments. No
single indexing or searching technology can be expected to cover all indexing
and searching applications. Consequently, having to translate queries
and their results from one query language into another will remain to be
a fact of life.
A query language that has to fulfill the role of a universal front-end
to a variety of other query languages and protocols typically faces the
following issues:
-
Finding a common core of distributed search functionality
-
End-users expect some types of search functionality to be available; in
most applications this encompasses at least unnested basic boolean operators
(AND, OR, AND-NOT), lstring (or prefix, or partial) matching and multiple
attribute searches (`search for any field'). On the other hand, advanced
applications may require SQL-type complex queries with variables etc.
A query language that can be applied in both cases and that also acts as
a front-end to multiple services should be amenable to defining profiles
of subsets of its functionality. Experience in profiling such as
WHOIS++/CHIC-Pilot [6], Z39.50/GILS [7]
and STARTS [8], indicates that it is quite important to
have the possibility of defining a reasonably universal, yet easily implementable
profile for distributed searching.
-
Discovery of search functionality
-
Related to the above, it should be possible to discover the supported search
functionality of a search service, in order to find out what queries can
be meaningfully thrown at it. In case a particular type of search
functionality is not available, a service may offer a feature that produces
a result which is nearly as good for the purpose of the client, so it can
choose to use that feature instead.
-
As an example of this, consider prefix (or lstring, or right-truncation)
versus substring searching. If a service only offers substring searching,
and the client wants to do a prefix search, than the client can rewrite
the query to do a substring search and filter the search results on matches
which do not contain prefixes. Another example is case sensitive
versus case-insensitive searching.
-
Simple search result syntax
-
In order to allow easy processing of search results by various clients,
a simple default result presentation format should be available. Other
representations may be available through format-negotiation, but it is
essential to be able to create simple clients which do not need to cope
with a variety of data formats when processing search results.
Query Routing and Forward Knowledge
The distributed nature of the Web presents a challenge for building usable
and intuitive resource discovery services. Deployment experience with
large-scale search services (eg. [9]) suggests that new
mechanisms are required for more effectively managing distibuted searches. If we want to
construct systems in which a user enters a single
search expression and has that request satisfied by a number of
searchable databases, it is essential to have "forward knowledge" about
the contents of those databases. Simply broadcasting all queries
to multiple databases will not scale.
There are several types of "forward knowledge" which may contribute
to a more scalable architecture for distributed searching. This data can
be used in a number of scenarios; a common approach is likely to be the
"referral" mechanism as used in the WHOIS++ and LDAP protocols. A
"referral" is an additional component of a search result which informs
the search client about alternative databases that could yield relevant
results. An alternative scenario involves a central index server or broker
that gathers forward knowledge for multiple databases, redirecting search
clients to the most appropriate target(s).
Forward knowledge requirements for effective query routing include the
following issues. These are largely independent of the choice of query
language, but nevertheless form a crucial component of any distributed
search system:
- Bulk metadata (eg. all the words in the database)
-
It is sometimes useful to extract summaries of the textual contents of a
database for use by search clients. This approach is used in
the WHOIS++ directory protocol, where a centroid for a database
contains in effect a list of all the unique terms in all the fields of
the database. By making use of such data, search clients can
perform simple checks (eg. word occurance) to avoid sending
unnecessary queries which are doomed to failure [10].
This principle has more recently been proposed in a more
generalised framework as the 'Common Indexing Protocol'(CIP) [11], which
allows for a variety of data formats to be used when characterising
the content of networked databases.
- Collection-level description
-
To give a search client enough information to decide whether a
query should be routed to a given database, or to construct an
appropriate interface to allow users to decide this, it is necessary to
also have some high level information about that database. The
vocabularies used to do so should not be a fixed and built in
component of the search infrastructure. Different communities will need
the flexibility to describe these resources using their own descriptive
schemas. In this respect, database characterisation is just a special
application of resource description.
- Description Service Location
- A PICS Rating Service [12]
offers an important facility. Given a URI, the service
will provide a machine-processable label, description or annotation for
the resource specified. RDF services will build upon the capability of
PICS; there are likely to be similar services using RDF which will
describe a resource given its URI. For these "description services" to
reach their full potential, a mechanism is needed that allows us to
discover, for any given URI, which services can offer descriptions, and
on what terms. For
example, there are an increasing number of catalogues which offer
library-like descriptions or reviews of Web resources [13]. No single catalogue can offer complete coverage
of the Web, so there is a need for a 'forward knowledge' mechanism by
which search agents might discover services that offer 3rd-party
descriptions and metadata annotations for some specified Web resource.
Since the URI can be regarded as simply another field in the database,
this problem can be seen as a special case of the "bulk metadata" issue.
Conclusions
Three important issues with respect to distributed querying of resources
are: schema discovery, query language translation, and forward knowledge.
We believe it is worthwhile to investigate whether a framework for a query
language can be developed that deals with the issues raised and can serve
as a common ground in distributed searching of Web based metadata.
References
| [1] |
Development of a European Service for Information on Research and Education
(DESIRE), http://www.desire.org/. |
| [2] |
TERENA CHIC-Pilot project, http://www.terena.nl/projects/chic-pilot/. |
| [3] |
P. Valkenburg, D. Beckett, M. Hamilton, S. Wilkinson, Standards
in the CHIC-Pilot Distributed Indexing Architecture, in: Computer Networks
and ISDN Systems special issue "Proceedings of the TERENA Networking Conference
1998", http://www.terena.nl/libr/tech/chic-fr.html. |
| [4] |
Resource Organisation and Discovery in Subject-Based Services (ROADS),
http://www.ilrt.bris.ac.uk/roads/
(project), http://www.roads.lut.ac.uk/
(software).
|
| [5] |
D. Brickley, R.V. Guha, A. Layman, Resource Description Framework
(RDF) Schema Specification, W3C Working Draft 30 October 1998, http://www.w3.org/TR/WD-rdf-schema/. |
| [6] |
TERENA CHIC-Pilot Deliverable D3.1: Search Profile Based on
WHOIS++, http://www.terena.nl/projects/chic-pilot/deliverables/D3.1_draft.html. |
| [7] |
Version 2 of Application Profile for GILS, http://www.gils.net/prof_v2.html. |
| [8] |
L. Gravano, K. Chang, H. Garcia-Molina, C. Lagoze, A. Paepcke,
Stanford Protocol Proposal for Internet Search and Retrieval, January 1997,
http://www-db.stanford.edu/~gravano/starts.html. |
| [9] |
Chris Rusbridge,Towards the Hybrid Library, D-Lib Magazine, July/August
1998.
http://www.dlib.org/dlib/july98/rusbridge/07rusbridge.html
|
| [10] |
Jon Knight, Dan Brickley, Martin Hamilton, John Kirriemuir, Susan Welsh.
Cross-Searching Subject Gateways: The Query Routing and Forward Knowledge
Approach. D-Lib Magazine, January 1998.
http://www.dlib.org/dlib/january98/01kirriemuir.html
|
| [11] |
The Architecture of the Common Indexing Protocol (CIP), Allen J.,
Mealling M., works-in-progress of the IETF Find working group.
http://www.ietf.org/ids.by.wg/find.html
|
| [12] |
Jim Miller (ed.), Paul Resnick, David Singer,
Rating Services and Rating Systems (and Their Machine Readable
Descriptions) Version 1.1, PICS Working Group, W3C.
http://www.w3.org/TR/REC-PICS-services
|
| [13] |
Emma Worsfold, Subject gateways - fulfilling the DESIRE for knowledge,
Computer Networks and ISDN Systems (Vol 30
Numbers 12-18) 30th Sept 1998).
http://www.desire.org/html/research/publications/tnc98gateways/(preprint
url)
|