Text databases and information retrieval

Text Databases and Information Retrieval
ELLEN RILOFF and LEE HOLLAAR
Department of Computer Science, University of Utah ͗riloff,hollaar@cs.utah.edu͘

The goal of a traditional information
retrieval (IR) system is to search an
information repository, such as a text
database, and retrieve documents that
are potentially relevant to a query.
Since query-based IR systems must operate in real time, they must be able to
search large volumes of text quickly and
efficiently. Other information-retrieval
applications, such as text categorization, text routing, and text filtering, are
also becoming increasingly important.
These applications are generally concerned with long-term information
needs, where a topic is expected to be of
interest for an extended period of time.
Text categorization systems assign predefined category labels to texts. For example, a text categorization system for
computer science might use categories
such as operating systems, programming languages, artificial intelligence,
or information retrieval. Text routing
systems typically accept a set of user
profiles and automatically classify texts
so that relevant texts can be routed to
appropriate users [Harman 1994]. Text
filtering systems accept a list of topics
that are, or are not, of interest and
allow only texts that satisfy the filter to
pass through to the user [Belkin and
Croft 1992]. Text categorization systems
are typically applied to static databases,
while text routing and text filtering systems are usually applied to incoming
data streams.
Information-retrieval systems must
grapple with all of the ambiguities and
idiosyncrasies inherent in natural language, such as synonymy (e.g., “start”,
“begin”, and “initiate” have essentially
the same meaning) and polysemy (e.g.,

“shot” has many different meanings, including the act of shooting, an injection,
a quantity of liquor, a photograph, pellets, or an attempt). Phrases also require special attention because multiword
expressions
often
have
a
composite meaning different from the
individual words. For example, a “hot
dog” does not usually refer to a warm
canine, and an “operating system” does
not usually refer to a system that is
simply operating.
Most information-retrieval systems
preprocess a document collection into an
inverted file that allows the system to
determine quickly which words appear in
each document. Stopword lists are commonly used to remove highly frequent
words, such as “the” and “of,” under the
assumption that they don’t contribute
much to the meaning of a text. Stemming
algorithms are sometimes used to reduce
a word to its root form so that different
morphological variations will match
[Frakes and Baeza-Yates 1992]. An alternative text-representation scheme uses
superimposed codewords to produce a
fixed-length vector from the binary representations of words. The fixed-length vector is especially useful for parallel and
hardware systems, but this method can
sometimes hallucinate words that don’t
actually appear in the original document.
Traditional information-retrieval methods retrieve documents by searching for
relevant words or phrases. Most commercial IR systems allow the user to define a
query using keywords and standard Boolean operators. These systems retrieve
documents that precisely match the
query. The vector-space model [Salton

Copyright © 1996, CRC Press.

ACM Computing Surveys, Vol. 28, No. 1, March 1996

134

•

Ellen Riloff and Lee Hollaar

1971] is a well-known method for automatic indexing that views each document
and query as a vector in an N-dimensional space, where N is the number of
relevant terms in the database. The
query vector is compared to all of the
document vectors using a similarity metric. Another retrieval model for automatic
indexing uses probability estimates to determine whether a document satisfies a
user’s query. For example, Bayesian inference networks have been used to compute the belief associated with a query for
each document in a database.
Relevance feedback techniques can
improve performance by asking the user
for feedback about the retrieved texts
[Salton 1989; Van Rijsbergen 1979]. The
user labels a subset of the retrieved
texts as relevant, and this information
is fed back into the system to modify the
original query, usually by adding new
terms or by changing the weights of the
original query terms. Relevance feedback has consistently been shown to
improve the performance of IR systems.
Experiments with richer text representations have also been conducted using natural-language processing (NLP)
techniques. Syntactic approaches have
been used to generate more complex
indexing terms consisting of phrases
and head-modifier structures. Knowledge-based NLP systems have been
used to generate conceptual meaning
representations of queries and documents. Information extraction techniques [Lehnert and Sundheim 1991]
have also been shown to be effective for
text classification problems, and represent a compromise between word-based
techniques and in-depth natural-language processing.
The future holds great promise for
integrating information-retrieval techniques with natural-language processing systems. The strengths of these
methodologies are largely complementary. IR systems use shallow text representations, which allows them to process large amounts of text quickly and
efficiently. But the accuracy of these

systems often suffers because of a lack
of semantic analysis, especially for complex information requests. Natural-language processing systems, on the other
hand, usually perform conceptual analyses, which allows them to produce
richer meanings and representations.
However, NLP techniques are more
computationally expensive and therefore are more difficult to scale up to
large text collections.
The information-retrieval community is
facing new challenges posed by larger
and more heterogeneous text databases,
which have led to an explosion of new
approaches and methodologies. As longer
texts become available on-line, new approaches are needed to process texts that
discuss multiple topics. A variety of techniques for subtopic identification and passage-based retrieval are actively being explored. Another area of active research is
intelligent information retrieval, which
draws upon techniques from artificial intelligence to generate richer text representations. Natural-language processing
methods (such as information extraction),
case-based reasoning techniques, and machine learning algorithms are all being
applied to information retrieval tasks in
the hopes of building more effective retrieval systems (for example, see ACM
[1995]). Intelligent information retrieval
is an exciting new direction for IR research.
REFERENCES
ACM. 1995. Proceedings of the 18th Annual
International ACM SIGIR Conference on
Research and Development in Information Retrieval. ACM, New York.
BELKIN, N. AND CROFT, W. B. 1992. Information
filtering and information retrieval: Two sides
of the same coin? Commun. ACM 35, 12,
29 –38.
FRAKES, W. B. AND BAEZA-YATES, R., EDS.
1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ.
HARMAN, D., ED. 1994. The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special
Publication 500 –215, Gaithersburg, MD.
LEHNERT, W. G. AND SUNDHEIM, B. 1991. A per-

Text Databases and Information Retrieval
formance evaluation of text analysis technologies. AI Mag. 12, 3, 81–94.
SALTON, G., ED. 1971. The SMART Retrieval
System: Experiments in Automatic Document
Processing. Prentice-Hall, Englewood Cliffs,
NJ.

•

135

SALTON, G. 1989. Automatic Text Processing:
The Transformation, Analysis, and Retrieval
of Information by Computer. Addison-Wesley,
Reading, MA.
VAN RIJSBERGEN, C. J. 1979. Information Retrieval (2nd Ed.). Butterworths, London.


Text databases and information retrieval

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Text databases and information retrieval (20)

More from unyil96 (20)

Recently uploaded (20)

Text databases and information retrieval