Ir 02

Lecture 02
Information Retrieval

Informatio
n Need
User Task
Query
Formulati
on
Search
Engine
Collection
Results
Query Re-
Formulati
on

Informatio
n Need
User Task
Query
Formulati
on
Search
Engine
Collection
Results
Query Re-
Formulati
on
Misconceptio
n
Misformulatio
n
Mis
Reformulatio
n

Boolean Retrieval Model
An Example of an IR Problem
 Task:
 Suppose you wanted to determine which plays of Shakespeare
contain the words Brutus AND Caesar AND NOT Calpurnia.
 Read through all the text!!! noting for each play whether it
contains Brutus and Caesar and excluding it from consideration
if it contains Calpurnia. !!!
 This sort of linear scan through documents is actually the
simplest form of document retrieval for a computer to do.
 This process is commonly referred to as grepping through text,
after the Unix command grep, which performs this process.
 This can be a very efficient process for wildcard pattern

 In other scenarios, we need more than the “grepping”
function:
1. To process large document collections quickly. The amount
of online data has grown at least as quickly as the speed of
computers, and we would now like to be able to search
collections that total in the order of billions to trillions of words.
2. To allow more flexible matching operations. For example, it
is impractical to perform the query Romans NEAR countrymen
with grep, where NEAR might be defined as “within 5 words” or
“within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best
answer to an information need among many documents that
contain certain words.

 The idea behind building Indexes:
 To avoid linearly scanning the texts for each query, we build
an INDEX for the documents in advance.
 The result for our initial task would be a binary term-
document incidence matrix.

 To answer the query “Brutus AND Caesar AND NOT
Calpurnia”, we take the vectors for Brutus, Caesar
and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer:

 The Boolean retrieval model: is a model for information
retrieval in which we can pose any query which is in the form of a
Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT.
 Among the limitations of this model is that it views each
document as just a set of words.

Further Terminology & Notations
 Documents: whatever units we have decided to build a retrieval system
over.
 Collection: the group of documents over which we perform retrieval.
Sometime it is also referred to as a corpus.
 In our previous example the documents are “Shakespeare’s Collected Works”
 Ad hoc Retrieval: is the most standard IR task. In it, a system aims to
provide documents from within the collection that are relevant to an
arbitrary user information need, communicated to the system by means of
a one-off, user-initiated query. (Temporal Information Need). This model is
a.k.a. Pull Text Access Model
 Recommender Systems: When the user has a stable information need
(research topic / interest in sports news), the system takes the initiative and
recommends topics that are related to the user’s information need. This
model is a.k.a. Push Text Access Model.
 Information need: is the topic about which the user desires to know more.
 Query: is what the user conveys to the computer in an attempt to

 Relevance: A document is relevant if it is one that the user
perceives as containing information of value with respect to their
personal information need.
 We would like to find relevant documents regardless of whether they
precisely use the words expressed in the query or express the concept
we are looking for with other words.
 Effectiveness: of an IR system is the quality of its search results. It
is measured according to the relevance between the set of returned
results to a given query.
 To measure Effectiveness two key statistics about the system’s
returned results are involved:
 Precision: What fraction of the returned results are relevant to the
information need?

 Precision: What fraction of the returned results are relevant to
the information need?
𝑃 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: What fraction of the relevant documents in the collection
were returned by the system?
𝑅 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛

Building a term-document matrix
 In practice, we can not build a Term-Document Incidence
matrix.
Let: d = 1 million, and let each document in d contains 1000
words.
* assume an average of 6 bytes per word including spaces and
punctuation  then this is a document collection about 6 GB in
size.
* Assume we have about M = 500,000 distinct terms in these
documents
A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit
in a computer’s memory.
But the crucial observation is that the matrix is extremely sparse,
that is, it has few non-zero entries.

Building a term-document matrix
 So, a better representation is to record only the things that
do occur, that is, the 1 positions.
 This leads to a central idea in IR that is, the inverted index.
 An Inverted Index or Inverted File: is an index that always
maps back from terms to the parts of a document where
they occur.
  We keep a dictionary of terms (sometimes also referred
to as a vocabulary or lexicon.
  Then for each term, we have a list that records which
documents the term occurs in.

Inverted Index
 Each item in the list – which records that a term appeared in a
document (and, later, often, the positions in the document) – is
conventionally called a posting.
 The list is then called a postings list (or inverted list), and all
the postings lists taken together are referred to as the postings.

Building an inverted index
 The major steps are:
1. Collect the documents to be indexed:
Doc1 = {Friends, Romans, countrymen}
Doc2 = {So let it be with Caesar}
Doc3 = {. . .}
2. Tokenize the text, turning each document into a list of
tokens:
Friends Romans countrymen So . . .

 The major steps are:
3. Do linguistic preprocessing, producing a list of normalized
tokens, which are the indexing terms:
friends romans countrymen so . . .
4. Index the documents that each term occurs in by creating
an inverted index, consisting of a dictionary and postings.

 Within a document collection, we assume that each document
has a unique serial number, known as the document identifier
(docID).
 The input to indexing is a list of normalized tokens for each
document, which we can equally think of as a list of pairs of term
and docID.
 The core indexing step is sorting this list so that the terms are
alphabetical.
 The postings are secondarily sorted by docID. This provides the
basis for efficient query processing.

Storage Requirements
 In the resulting index, we pay for storage of both the
dictionary and the postings lists.
 The dictionary is commonly kept in memory.
 Postings lists are normally kept on disk.
 So, the size of each is important.

What data structure should be used for a postings
list?
 A fixed length array would be wasteful as some words occur
in many documents, and others in very few.
 For an in-memory postings list, two good alternatives:
 Singly linked lists: allow cheap insertion of documents into
postings lists (following updates, such as when recrawling the
web for updated documents). They also naturally extend to more
advanced indexing strategies such as skip lists, which require
additional pointers.
 Variable length arrays: win in space requirements by avoiding
the overhead for pointers and in time requirements because their
use of contiguous memory increases speed on modern
processors with memory caches.

Exercise
 Draw the inverted index that would be built for the
following document collection:
Doc 1 new home sales top forecasts
Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise

Exercise
 Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
A. Draw the term-document incidence matrix for this
document collection
B. What are the returned results for these queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)

Ir 02

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Ir 02 (20)

More from Mohammed Romi (14)

Recently uploaded (20)

Ir 02