Query Language Issues in a Distributed Indexing Environment

Peter Valkenburg <peter.valkenburg@surfnet.nl> Dan Brickley <daniel.brickley@bristol.ac.uk>
November 1998

Abstract

This contribution to the W3C's QL'98 workshop looks at some of the issues of querying in the distributed indexing and searching environment available on the Web (and Internet as a whole) today. We briefly discuss three issues with respect to distributed querying of resources: schema discovery, query language translation, and query routing. Drawing on experience, some tentative requirements are formulated that we think need to be fullfilled for the wide application of a new query language in a heterogeneous Web metadata indexing environment.

Introduction

The authors have been involved in the development of several distributed indexing, cataloguing and searching projects, notably DESIRE [1], CHIC-Pilot [2,3], and ROADS [4]. The technologies involved in these were varied, since the goals included the tying together of widely deployed services and protocols, and include Z39.50/GILS, WHOIS++, LDAP, CIP, RDF and others.

Three issues have been persistent in these and other projects, when trying to build services which `cross-search' multiple datasets, query protocols, and search services:

Schema Discovery and Translation

Query Language Translation

Query Routing and Forward Knowledge

It is likely that not all of these issues can be solved within the framework of a query language only; for instance, query routing may presuppose registry of search services that falls outside the domain of any particular query language. However, a query language is very important as a mechanism to locate distributed search services.

We will touch upon each of the above issues in the following sections.

Schema Discovery and Translation

Schema discovery is obviously an important element of distributed search systems. Clients need to be able to find out what attribute sets are supported by various services that they search. In a distributed context one can often not rely on all search services to support the same syntax and semantics of attributes as those that a client would want to search on.

Some important requirements related to this are:

Flexible support for both `home-grown' and standardised attribute sets: The query language should allow for schema definition using public standards such as the Dublin Core, but also for derived schemas and wholly independent ones. A mechanism for defining schemas and their attributes (or properties) is part of the "Resource Description Framework (RDF) Schema Specification" [5]. An important element of this framework is the universal identification of schemas, in this case through URIs. Through assigning every attribute (property) a URI, some of the present ambiguity in the distributed search environment may be removed.

Discovery of schemas: The discovery of schemas used by a particular service must be possible. A client querying a service needs this to find out what appropriate attributes/properties are supported by the service.

Schema mapping: In many cases search services offer derived or overlapping sets of searchable properties. To effectively cross-search these services, it should be possible to build and query schema mapping services. As an example, cross-searching document metadata with different sets of `subject category' properties may be done using a service that maps one a subject category property in one schema to one or more in another schema. Any web query language framework should allow communities of expertise to define their own resource description vocabularies (such as the Z39.50 attribute sets), and to describe in a machine-processable manner how those vocabularies relate to others.

Query Language Translation

The importance of query language translation lies in the observation that the Web and the Internet are very heterogeneous environments. No single indexing or searching technology can be expected to cover all indexing and searching applications. Consequently, having to translate queries and their results from one query language into another will remain to be a fact of life.

A query language that has to fulfill the role of a universal front-end to a variety of other query languages and protocols typically faces the following issues:

Finding a common core of distributed search functionality: End-users expect some types of search functionality to be available; in most applications this encompasses at least unnested basic boolean operators (AND, OR, AND-NOT), lstring (or prefix, or partial) matching and multiple attribute searches (`search for any field'). On the other hand, advanced applications may require SQL-type complex queries with variables etc. A query language that can be applied in both cases and that also acts as a front-end to multiple services should be amenable to defining profiles of subsets of its functionality. Experience in profiling such as WHOIS++/CHIC-Pilot [6], Z39.50/GILS [7] and STARTS [8], indicates that it is quite important to have the possibility of defining a reasonably universal, yet easily implementable profile for distributed searching.

Discovery of search functionality: Related to the above, it should be possible to discover the supported search functionality of a search service, in order to find out what queries can be meaningfully thrown at it. In case a particular type of search functionality is not available, a service may offer a feature that produces a result which is nearly as good for the purpose of the client, so it can choose to use that feature instead.; As an example of this, consider prefix (or lstring, or right-truncation) versus substring searching. If a service only offers substring searching, and the client wants to do a prefix search, than the client can rewrite the query to do a substring search and filter the search results on matches which do not contain prefixes. Another example is case sensitive versus case-insensitive searching.

Simple search result syntax: In order to allow easy processing of search results by various clients, a simple default result presentation format should be available. Other representations may be available through format-negotiation, but it is essential to be able to create simple clients which do not need to cope with a variety of data formats when processing search results.

Query Routing and Forward Knowledge

The distributed nature of the Web presents a challenge for building usable and intuitive resource discovery services. Deployment experience with large-scale search services (eg. [9]) suggests that new mechanisms are required for more effectively managing distibuted searches. If we want to construct systems in which a user enters a single search expression and has that request satisfied by a number of searchable databases, it is essential to have "forward knowledge" about the contents of those databases. Simply broadcasting all queries to multiple databases will not scale.

There are several types of "forward knowledge" which may contribute to a more scalable architecture for distributed searching. This data can be used in a number of scenarios; a common approach is likely to be the "referral" mechanism as used in the WHOIS++ and LDAP protocols. A "referral" is an additional component of a search result which informs the search client about alternative databases that could yield relevant results. An alternative scenario involves a central index server or broker that gathers forward knowledge for multiple databases, redirecting search clients to the most appropriate target(s).

Forward knowledge requirements for effective query routing include the following issues. These are largely independent of the choice of query language, but nevertheless form a crucial component of any distributed search system:

Bulk metadata (eg. all the words in the database): It is sometimes useful to extract summaries of the textual contents of a database for use by search clients. This approach is used in the WHOIS++ directory protocol, where a centroid for a database contains in effect a list of all the unique terms in all the fields of the database. By making use of such data, search clients can perform simple checks (eg. word occurance) to avoid sending unnecessary queries which are doomed to failure [10]. This principle has more recently been proposed in a more generalised framework as the 'Common Indexing Protocol'(CIP) [11], which allows for a variety of data formats to be used when characterising the content of networked databases.

Collection-level description: To give a search client enough information to decide whether a query should be routed to a given database, or to construct an appropriate interface to allow users to decide this, it is necessary to also have some high level information about that database. The vocabularies used to do so should not be a fixed and built in component of the search infrastructure. Different communities will need the flexibility to describe these resources using their own descriptive schemas. In this respect, database characterisation is just a special application of resource description.

Description Service Location: A PICS Rating Service [12] offers an important facility. Given a URI, the service will provide a machine-processable label, description or annotation for the resource specified. RDF services will build upon the capability of PICS; there are likely to be similar services using RDF which will describe a resource given its URI. For these "description services" to reach their full potential, a mechanism is needed that allows us to discover, for any given URI, which services can offer descriptions, and on what terms. For example, there are an increasing number of catalogues which offer library-like descriptions or reviews of Web resources [13]. No single catalogue can offer complete coverage of the Web, so there is a need for a 'forward knowledge' mechanism by which search agents might discover services that offer 3rd-party descriptions and metadata annotations for some specified Web resource. Since the URI can be regarded as simply another field in the database, this problem can be seen as a special case of the "bulk metadata" issue.

Conclusions

Three important issues with respect to distributed querying of resources are: schema discovery, query language translation, and forward knowledge. We believe it is worthwhile to investigate whether a framework for a query language can be developed that deals with the issues raised and can serve as a common ground in distributed searching of Web based metadata.

References

[1]	Development of a European Service for Information on Research and Education (DESIRE), http://www.desire.org/.
[2]	TERENA CHIC-Pilot project, http://www.terena.nl/projects/chic-pilot/.
[3]	P. Valkenburg, D. Beckett, M. Hamilton, S. Wilkinson, Standards in the CHIC-Pilot Distributed Indexing Architecture, in: Computer Networks and ISDN Systems special issue "Proceedings of the TERENA Networking Conference 1998", http://www.terena.nl/libr/tech/chic-fr.html.
[4]	Resource Organisation and Discovery in Subject-Based Services (ROADS), http://www.ilrt.bris.ac.uk/roads/ (project), http://www.roads.lut.ac.uk/ (software).
[5]	D. Brickley, R.V. Guha, A. Layman, Resource Description Framework (RDF) Schema Specification, W3C Working Draft 30 October 1998, http://www.w3.org/TR/WD-rdf-schema/.
[6]	TERENA CHIC-Pilot Deliverable D3.1: Search Profile Based on WHOIS++, http://www.terena.nl/projects/chic-pilot/deliverables/D3.1_draft.html.
[7]	Version 2 of Application Profile for GILS, http://www.gils.net/prof_v2.html.
[8]	L. Gravano, K. Chang, H. Garcia-Molina, C. Lagoze, A. Paepcke, Stanford Protocol Proposal for Internet Search and Retrieval, January 1997, http://www-db.stanford.edu/~gravano/starts.html.
[9]	Chris Rusbridge,Towards the Hybrid Library, D-Lib Magazine, July/August 1998. http://www.dlib.org/dlib/july98/rusbridge/07rusbridge.html
[10]	Jon Knight, Dan Brickley, Martin Hamilton, John Kirriemuir, Susan Welsh. Cross-Searching Subject Gateways: The Query Routing and Forward Knowledge Approach. D-Lib Magazine, January 1998. http://www.dlib.org/dlib/january98/01kirriemuir.html
[11]	The Architecture of the Common Indexing Protocol (CIP), Allen J., Mealling M., works-in-progress of the IETF Find working group. http://www.ietf.org/ids.by.wg/find.html
[12]	Jim Miller (ed.), Paul Resnick, David Singer, Rating Services and Rating Systems (and Their Machine Readable Descriptions) Version 1.1, PICS Working Group, W3C. http://www.w3.org/TR/REC-PICS-services
[13]	Emma Worsfold, Subject gateways - fulfilling the DESIRE for knowledge, Computer Networks and ISDN Systems (Vol 30 Numbers 12-18) 30th Sept 1998). http://www.desire.org/html/research/publications/tnc98gateways/(preprint url)