SlideShare a Scribd company logo
Text Databases and Information Retrieval
ELLEN RILOFF and LEE HOLLAAR
Department of Computer Science, University of Utah ͗riloff,hollaar@cs.utah.edu͘

The goal of a traditional information
retrieval (IR) system is to search an
information repository, such as a text
database, and retrieve documents that
are potentially relevant to a query.
Since query-based IR systems must operate in real time, they must be able to
search large volumes of text quickly and
efficiently. Other information-retrieval
applications, such as text categorization, text routing, and text filtering, are
also becoming increasingly important.
These applications are generally concerned with long-term information
needs, where a topic is expected to be of
interest for an extended period of time.
Text categorization systems assign predefined category labels to texts. For example, a text categorization system for
computer science might use categories
such as operating systems, programming languages, artificial intelligence,
or information retrieval. Text routing
systems typically accept a set of user
profiles and automatically classify texts
so that relevant texts can be routed to
appropriate users [Harman 1994]. Text
filtering systems accept a list of topics
that are, or are not, of interest and
allow only texts that satisfy the filter to
pass through to the user [Belkin and
Croft 1992]. Text categorization systems
are typically applied to static databases,
while text routing and text filtering systems are usually applied to incoming
data streams.
Information-retrieval systems must
grapple with all of the ambiguities and
idiosyncrasies inherent in natural language, such as synonymy (e.g., “start”,
“begin”, and “initiate” have essentially
the same meaning) and polysemy (e.g.,

“shot” has many different meanings, including the act of shooting, an injection,
a quantity of liquor, a photograph, pellets, or an attempt). Phrases also require special attention because multiword
expressions
often
have
a
composite meaning different from the
individual words. For example, a “hot
dog” does not usually refer to a warm
canine, and an “operating system” does
not usually refer to a system that is
simply operating.
Most information-retrieval systems
preprocess a document collection into an
inverted file that allows the system to
determine quickly which words appear in
each document. Stopword lists are commonly used to remove highly frequent
words, such as “the” and “of,” under the
assumption that they don’t contribute
much to the meaning of a text. Stemming
algorithms are sometimes used to reduce
a word to its root form so that different
morphological variations will match
[Frakes and Baeza-Yates 1992]. An alternative text-representation scheme uses
superimposed codewords to produce a
fixed-length vector from the binary representations of words. The fixed-length vector is especially useful for parallel and
hardware systems, but this method can
sometimes hallucinate words that don’t
actually appear in the original document.
Traditional information-retrieval methods retrieve documents by searching for
relevant words or phrases. Most commercial IR systems allow the user to define a
query using keywords and standard Boolean operators. These systems retrieve
documents that precisely match the
query. The vector-space model [Salton

Copyright © 1996, CRC Press.

ACM Computing Surveys, Vol. 28, No. 1, March 1996
134

•

Ellen Riloff and Lee Hollaar

1971] is a well-known method for automatic indexing that views each document
and query as a vector in an N-dimensional space, where N is the number of
relevant terms in the database. The
query vector is compared to all of the
document vectors using a similarity metric. Another retrieval model for automatic
indexing uses probability estimates to determine whether a document satisfies a
user’s query. For example, Bayesian inference networks have been used to compute the belief associated with a query for
each document in a database.
Relevance feedback techniques can
improve performance by asking the user
for feedback about the retrieved texts
[Salton 1989; Van Rijsbergen 1979]. The
user labels a subset of the retrieved
texts as relevant, and this information
is fed back into the system to modify the
original query, usually by adding new
terms or by changing the weights of the
original query terms. Relevance feedback has consistently been shown to
improve the performance of IR systems.
Experiments with richer text representations have also been conducted using natural-language processing (NLP)
techniques. Syntactic approaches have
been used to generate more complex
indexing terms consisting of phrases
and head-modifier structures. Knowledge-based NLP systems have been
used to generate conceptual meaning
representations of queries and documents. Information extraction techniques [Lehnert and Sundheim 1991]
have also been shown to be effective for
text classification problems, and represent a compromise between word-based
techniques and in-depth natural-language processing.
The future holds great promise for
integrating information-retrieval techniques with natural-language processing systems. The strengths of these
methodologies are largely complementary. IR systems use shallow text representations, which allows them to process large amounts of text quickly and
efficiently. But the accuracy of these
ACM Computing Surveys, Vol. 28, No. 1, March 1996

systems often suffers because of a lack
of semantic analysis, especially for complex information requests. Natural-language processing systems, on the other
hand, usually perform conceptual analyses, which allows them to produce
richer meanings and representations.
However, NLP techniques are more
computationally expensive and therefore are more difficult to scale up to
large text collections.
The information-retrieval community is
facing new challenges posed by larger
and more heterogeneous text databases,
which have led to an explosion of new
approaches and methodologies. As longer
texts become available on-line, new approaches are needed to process texts that
discuss multiple topics. A variety of techniques for subtopic identification and passage-based retrieval are actively being explored. Another area of active research is
intelligent information retrieval, which
draws upon techniques from artificial intelligence to generate richer text representations. Natural-language processing
methods (such as information extraction),
case-based reasoning techniques, and machine learning algorithms are all being
applied to information retrieval tasks in
the hopes of building more effective retrieval systems (for example, see ACM
[1995]). Intelligent information retrieval
is an exciting new direction for IR research.
REFERENCES
ACM. 1995. Proceedings of the 18th Annual
International ACM SIGIR Conference on
Research and Development in Information Retrieval. ACM, New York.
BELKIN, N. AND CROFT, W. B. 1992. Information
filtering and information retrieval: Two sides
of the same coin? Commun. ACM 35, 12,
29 –38.
FRAKES, W. B. AND BAEZA-YATES, R., EDS.
1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ.
HARMAN, D., ED. 1994. The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special
Publication 500 –215, Gaithersburg, MD.
LEHNERT, W. G. AND SUNDHEIM, B. 1991. A per-
Text Databases and Information Retrieval
formance evaluation of text analysis technologies. AI Mag. 12, 3, 81–94.
SALTON, G., ED. 1971. The SMART Retrieval
System: Experiments in Automatic Document
Processing. Prentice-Hall, Englewood Cliffs,
NJ.

•

135

SALTON, G. 1989. Automatic Text Processing:
The Transformation, Analysis, and Retrieval
of Information by Computer. Addison-Wesley,
Reading, MA.
VAN RIJSBERGEN, C. J. 1979. Information Retrieval (2nd Ed.). Butterworths, London.

ACM Computing Surveys, Vol. 28, No. 1, March 1996

More Related Content

What's hot (20)

PPTX
Automatic indexing
dhatchayaninandu
 
PDF
Information retrieval-systems notes
BAIRAVI T
 
PPTX
Vector space model of information retrieval
Nanthini Dominique
 
PDF
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
PPTX
Lectures 1,2,3
alaa223
 
PPTX
information retrieval Techniques and normalization
Ameenababs
 
PPTX
Lec1,2
alaa223
 
PPT
Tovek Presentation by Livio Costantini
maxfalc
 
PPTX
Introduction to Information Retrieval
Carsten Eickhoff
 
PPTX
Indexing Automated Vs Automatic Galvan1
CorinaF
 
PPTX
Text mining
Pankaj Thakur
 
PDF
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
PPT
Information retrieval
hplap
 
PPTX
Text mining
Koshy Geoji
 
PDF
Technical Whitepaper: A Knowledge Correlation Search Engine
s0P5a41b
 
PDF
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
ijceronline
 
PDF
Tutorial 1 (information retrieval basics)
Kira
 
PPTX
Post coordinate indexing .. Library and information science
harshaec
 
PDF
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
IJET - International Journal of Engineering and Techniques
 
PPTX
Introduction to Information Retrieval
Roi Blanco
 
Automatic indexing
dhatchayaninandu
 
Information retrieval-systems notes
BAIRAVI T
 
Vector space model of information retrieval
Nanthini Dominique
 
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Lectures 1,2,3
alaa223
 
information retrieval Techniques and normalization
Ameenababs
 
Lec1,2
alaa223
 
Tovek Presentation by Livio Costantini
maxfalc
 
Introduction to Information Retrieval
Carsten Eickhoff
 
Indexing Automated Vs Automatic Galvan1
CorinaF
 
Text mining
Pankaj Thakur
 
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
Information retrieval
hplap
 
Text mining
Koshy Geoji
 
Technical Whitepaper: A Knowledge Correlation Search Engine
s0P5a41b
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
ijceronline
 
Tutorial 1 (information retrieval basics)
Kira
 
Post coordinate indexing .. Library and information science
harshaec
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
IJET - International Journal of Engineering and Techniques
 
Introduction to Information Retrieval
Roi Blanco
 

Viewers also liked (20)

PPTX
DATA WAREHOUSING
King Julian
 
PPT
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Arjen de Vries
 
PPTX
Deductive Databases Presentation
Maroun Baydoun
 
PDF
Deductive Databases
Maroun Baydoun
 
PDF
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Cataldo Musto
 
KEY
Introduction to MongoDB
Sean Laurent
 
PPTX
mongodb-brief-intro-february-2012
Chris Westin
 
ODP
MongoDB Devops Madrid February 2012
Juan Vicente Herrera Ruiz de Alejo
 
ODP
Seth Edwards on MongoDB
Skills Matter
 
PPTX
Getting Started with MongoDB
Pankaj Bajaj
 
PPTX
MongoDB
Tharun Srinivasa
 
PDF
Intro to NoSQL and MongoDB
MongoDB
 
PPTX
Mastering the MongoDB Javascript Shell
Scott Hernandez
 
KEY
Mongodb intro
christkv
 
KEY
An Evening with MongoDB - Orlando: Welcome and Keynote
MongoDB
 
ODP
Introduction to MongoDB
Knoldus Inc.
 
PPTX
Schema design with MongoDB (Dwight Merriman)
MongoSF
 
ODP
Introduction to MongoDB
Dineesha Suraweera
 
PPTX
MongoDB 3.0
Victoria Malaya
 
PDF
Plan de entrenamiento Maratón de Madrid Mes 3
Juan Vicente Herrera Ruiz de Alejo
 
DATA WAREHOUSING
King Julian
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Arjen de Vries
 
Deductive Databases Presentation
Maroun Baydoun
 
Deductive Databases
Maroun Baydoun
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Cataldo Musto
 
Introduction to MongoDB
Sean Laurent
 
mongodb-brief-intro-february-2012
Chris Westin
 
MongoDB Devops Madrid February 2012
Juan Vicente Herrera Ruiz de Alejo
 
Seth Edwards on MongoDB
Skills Matter
 
Getting Started with MongoDB
Pankaj Bajaj
 
Intro to NoSQL and MongoDB
MongoDB
 
Mastering the MongoDB Javascript Shell
Scott Hernandez
 
Mongodb intro
christkv
 
An Evening with MongoDB - Orlando: Welcome and Keynote
MongoDB
 
Introduction to MongoDB
Knoldus Inc.
 
Schema design with MongoDB (Dwight Merriman)
MongoSF
 
Introduction to MongoDB
Dineesha Suraweera
 
MongoDB 3.0
Victoria Malaya
 
Plan de entrenamiento Maratón de Madrid Mes 3
Juan Vicente Herrera Ruiz de Alejo
 
Ad

Similar to Text databases and information retrieval (20)

PDF
Ijetcas14 624
Iasir Journals
 
PDF
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
cscpconf
 
PDF
Word Embedding In IR
Bhaskar Chatterjee
 
PDF
Text Mining at Feature Level: A Review
INFOGAIN PUBLICATION
 
PDF
A0210110
inventionjournals
 
PDF
Inverted files for text search engines
unyil96
 
PDF
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
ijcsit
 
PDF
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
AIRCC Publishing Corporation
 
PDF
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
AIRCC Publishing Corporation
 
PPT
Hypertext
patrickalfredwaluchio
 
DOCX
Multilingualism in Information Retrieval System
Ariel Hess
 
PPT
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
PPTX
Information retrieval introduction
nimmyjans4
 
PDF
Great model a model for the automatic generation of semantic relations betwee...
ijcsity
 
PDF
Ontology Based Approach for Semantic Information Retrieval System
IJTET Journal
 
PDF
Document Retrieval System, a Case Study
IJERA Editor
 
PDF
IRJET - BOT Virtual Guide
IRJET Journal
 
PDF
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
cscpconf
 
PDF
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
ijcseit
 
Ijetcas14 624
Iasir Journals
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
cscpconf
 
Word Embedding In IR
Bhaskar Chatterjee
 
Text Mining at Feature Level: A Review
INFOGAIN PUBLICATION
 
Inverted files for text search engines
unyil96
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
ijcsit
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
AIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
AIRCC Publishing Corporation
 
Multilingualism in Information Retrieval System
Ariel Hess
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
Information retrieval introduction
nimmyjans4
 
Great model a model for the automatic generation of semantic relations betwee...
ijcsity
 
Ontology Based Approach for Semantic Information Retrieval System
IJTET Journal
 
Document Retrieval System, a Case Study
IJERA Editor
 
IRJET - BOT Virtual Guide
IRJET Journal
 
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
cscpconf
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
ijcseit
 
Ad

More from unyil96 (20)

PDF
Xml linking
unyil96
 
PDF
Xml data clustering an overview
unyil96
 
PDF
Word sense disambiguation a survey
unyil96
 
PDF
Web page classification features and algorithms
unyil96
 
PDF
The significance of linking
unyil96
 
PDF
Techniques for automatically correcting words in text
unyil96
 
PDF
Strict intersection types for the lambda calculus
unyil96
 
PDF
Smart meeting systems a survey of state of-the-art
unyil96
 
PDF
Semantically indexed hypermedia linking information disciplines
unyil96
 
PDF
Searching in metric spaces
unyil96
 
PDF
Searching in high dimensional spaces index structures for improving the perfo...
unyil96
 
PDF
Realization of natural language interfaces using
unyil96
 
PDF
Ontology visualization methods—a survey
unyil96
 
PDF
On nonmetric similarity search problems in complex domains
unyil96
 
PDF
Nonmetric similarity search
unyil96
 
PDF
Multidimensional access methods
unyil96
 
PDF
Machine transliteration survey
unyil96
 
PDF
Machine learning in automated text categorization
unyil96
 
PDF
Is this document relevant probably
unyil96
 
PDF
Integrating content search with structure analysis for hypermedia retrieval a...
unyil96
 
Xml linking
unyil96
 
Xml data clustering an overview
unyil96
 
Word sense disambiguation a survey
unyil96
 
Web page classification features and algorithms
unyil96
 
The significance of linking
unyil96
 
Techniques for automatically correcting words in text
unyil96
 
Strict intersection types for the lambda calculus
unyil96
 
Smart meeting systems a survey of state of-the-art
unyil96
 
Semantically indexed hypermedia linking information disciplines
unyil96
 
Searching in metric spaces
unyil96
 
Searching in high dimensional spaces index structures for improving the perfo...
unyil96
 
Realization of natural language interfaces using
unyil96
 
Ontology visualization methods—a survey
unyil96
 
On nonmetric similarity search problems in complex domains
unyil96
 
Nonmetric similarity search
unyil96
 
Multidimensional access methods
unyil96
 
Machine transliteration survey
unyil96
 
Machine learning in automated text categorization
unyil96
 
Is this document relevant probably
unyil96
 
Integrating content search with structure analysis for hypermedia retrieval a...
unyil96
 

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 

Text databases and information retrieval

  • 1. Text Databases and Information Retrieval ELLEN RILOFF and LEE HOLLAAR Department of Computer Science, University of Utah ͗riloff,[email protected]͘ The goal of a traditional information retrieval (IR) system is to search an information repository, such as a text database, and retrieve documents that are potentially relevant to a query. Since query-based IR systems must operate in real time, they must be able to search large volumes of text quickly and efficiently. Other information-retrieval applications, such as text categorization, text routing, and text filtering, are also becoming increasingly important. These applications are generally concerned with long-term information needs, where a topic is expected to be of interest for an extended period of time. Text categorization systems assign predefined category labels to texts. For example, a text categorization system for computer science might use categories such as operating systems, programming languages, artificial intelligence, or information retrieval. Text routing systems typically accept a set of user profiles and automatically classify texts so that relevant texts can be routed to appropriate users [Harman 1994]. Text filtering systems accept a list of topics that are, or are not, of interest and allow only texts that satisfy the filter to pass through to the user [Belkin and Croft 1992]. Text categorization systems are typically applied to static databases, while text routing and text filtering systems are usually applied to incoming data streams. Information-retrieval systems must grapple with all of the ambiguities and idiosyncrasies inherent in natural language, such as synonymy (e.g., “start”, “begin”, and “initiate” have essentially the same meaning) and polysemy (e.g., “shot” has many different meanings, including the act of shooting, an injection, a quantity of liquor, a photograph, pellets, or an attempt). Phrases also require special attention because multiword expressions often have a composite meaning different from the individual words. For example, a “hot dog” does not usually refer to a warm canine, and an “operating system” does not usually refer to a system that is simply operating. Most information-retrieval systems preprocess a document collection into an inverted file that allows the system to determine quickly which words appear in each document. Stopword lists are commonly used to remove highly frequent words, such as “the” and “of,” under the assumption that they don’t contribute much to the meaning of a text. Stemming algorithms are sometimes used to reduce a word to its root form so that different morphological variations will match [Frakes and Baeza-Yates 1992]. An alternative text-representation scheme uses superimposed codewords to produce a fixed-length vector from the binary representations of words. The fixed-length vector is especially useful for parallel and hardware systems, but this method can sometimes hallucinate words that don’t actually appear in the original document. Traditional information-retrieval methods retrieve documents by searching for relevant words or phrases. Most commercial IR systems allow the user to define a query using keywords and standard Boolean operators. These systems retrieve documents that precisely match the query. The vector-space model [Salton Copyright © 1996, CRC Press. ACM Computing Surveys, Vol. 28, No. 1, March 1996
  • 2. 134 • Ellen Riloff and Lee Hollaar 1971] is a well-known method for automatic indexing that views each document and query as a vector in an N-dimensional space, where N is the number of relevant terms in the database. The query vector is compared to all of the document vectors using a similarity metric. Another retrieval model for automatic indexing uses probability estimates to determine whether a document satisfies a user’s query. For example, Bayesian inference networks have been used to compute the belief associated with a query for each document in a database. Relevance feedback techniques can improve performance by asking the user for feedback about the retrieved texts [Salton 1989; Van Rijsbergen 1979]. The user labels a subset of the retrieved texts as relevant, and this information is fed back into the system to modify the original query, usually by adding new terms or by changing the weights of the original query terms. Relevance feedback has consistently been shown to improve the performance of IR systems. Experiments with richer text representations have also been conducted using natural-language processing (NLP) techniques. Syntactic approaches have been used to generate more complex indexing terms consisting of phrases and head-modifier structures. Knowledge-based NLP systems have been used to generate conceptual meaning representations of queries and documents. Information extraction techniques [Lehnert and Sundheim 1991] have also been shown to be effective for text classification problems, and represent a compromise between word-based techniques and in-depth natural-language processing. The future holds great promise for integrating information-retrieval techniques with natural-language processing systems. The strengths of these methodologies are largely complementary. IR systems use shallow text representations, which allows them to process large amounts of text quickly and efficiently. But the accuracy of these ACM Computing Surveys, Vol. 28, No. 1, March 1996 systems often suffers because of a lack of semantic analysis, especially for complex information requests. Natural-language processing systems, on the other hand, usually perform conceptual analyses, which allows them to produce richer meanings and representations. However, NLP techniques are more computationally expensive and therefore are more difficult to scale up to large text collections. The information-retrieval community is facing new challenges posed by larger and more heterogeneous text databases, which have led to an explosion of new approaches and methodologies. As longer texts become available on-line, new approaches are needed to process texts that discuss multiple topics. A variety of techniques for subtopic identification and passage-based retrieval are actively being explored. Another area of active research is intelligent information retrieval, which draws upon techniques from artificial intelligence to generate richer text representations. Natural-language processing methods (such as information extraction), case-based reasoning techniques, and machine learning algorithms are all being applied to information retrieval tasks in the hopes of building more effective retrieval systems (for example, see ACM [1995]). Intelligent information retrieval is an exciting new direction for IR research. REFERENCES ACM. 1995. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. BELKIN, N. AND CROFT, W. B. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12, 29 –38. FRAKES, W. B. AND BAEZA-YATES, R., EDS. 1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ. HARMAN, D., ED. 1994. The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special Publication 500 –215, Gaithersburg, MD. LEHNERT, W. G. AND SUNDHEIM, B. 1991. A per-
  • 3. Text Databases and Information Retrieval formance evaluation of text analysis technologies. AI Mag. 12, 3, 81–94. SALTON, G., ED. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ. • 135 SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA. VAN RIJSBERGEN, C. J. 1979. Information Retrieval (2nd Ed.). Butterworths, London. ACM Computing Surveys, Vol. 28, No. 1, March 1996