1. Boolean and vector-space retrieval models- Term weighting – TF-IDF
weighting- cosine similarity – Preprocessing – Inverted indices – efficient
processing with sparse vectors – Language Model based IR – Probabilistic
IR –Latent Semantic Indexing – Relevance feedback and query expansion.
CO2: Apply the knowledge of data structures and indexing methods in information
retrieval.
UNIT – II
2. Modeling
• Modeling in IR is a complex process aimed at producing a ranking function.
• Ranking function: a function that assigns scores to documents with regard to a given
query.
• This process consists of two main tasks:
• The conception of a logical framework for representing documents and queries
• The definition of a ranking function that allows quantifying the similarities
among documents and queries
• IR systems usually adopt index terms to index and retrieve documents
3. An IR model is a quadruple [D, Q, F, R(qi, dj)] where
1. D is a set of logical views for the documents in the collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and queries
4. R(qi, dj) is a ranking function
Taxonomy of IR Models
• Retrieval models most frequently associated with distinct combinations of a
document logical view and a user task. The users task includes retrieval and
browsing.
4. ii) Filtering
The queries remain relatively static while new
documents come into the system
i) Ad Hoc Retrieval:
The documents in the collection remain relatively static
while new queries are submitted to the system.
Classic IR model:
Each document is described by a set of representative keywords
called index terms. Assign a numerical weight to distinct relevance
between index terms.
Three classic models: Boolean, vector, probabilistic
5. Boolean Model:
• Model for information retrieval in which we can pose any query which is in the
form of a Boolean expression of terms, that is, in which terms are combined with
the operators AND, OR, and NOT.
• The model views each document as just a set of words. Based on a binary decision
criterion without any notion of a grading scale. Boolean expressions have precise
semantics.
Vector Model
• Assign non-binary weights to index terms in queries and in documents. Compute
the similarity between documents and query. More precise than Boolean model.
6. Probabilistic Model
•The probabilistic model tries to estimate the probability that the user
will find the document dj relevant with ratio
•P(dj relevant to q)/P(dj no relevant to q)
•Given a user query q, and the ideal answer set R of the relevant
documents, the problem is to specify the properties for this set.
• Assumption (probabilistic principle): the probability of relevance
depends on the query and document representations only; ideal answer
set R should maximize the overall probability of relevance.
8. •Boolean Retrieval Models
• The Boolean retrieval model is a model for information retrieval in which the query
is in the form of a Boolean expression of terms, combined with the operators AND,
OR, and NOT. The model views each document as just a set of words.
• Simple model based on set theory and Boolean algebra
• The Boolean model predicts that each document is either relevant or non- relevant.
• Example :
• A fat book which many people own is Shakespeare's Collected Works.
• Problem : To determine which plays of Shakespeare contain the words Brutus AND
Caesar AND NOT Calpurnia.
9. • Method1 : Using Grep
• The simplest form of document retrieval is for a computer to do the linear
scan through documents. This process is commonly referred to as grepping
through text, after the Unix command grep. Grepping through text can be a
very effective process, especially given the speed of modern computers, and
often allows useful possibilities for wildcard pattern matching through the use
of regular expressions.
• To Perform simple querying of modest collections , we need :
• To process large document collections quickly.
• To allow more flexible matching operations.
• For example, it is impractical to perform the query “Romans NEAR
countrymen” with grep , where NEAR might be defined as “within 5 words”
or “within the same sentence”.
• To allow ranked retrieval: in many cases we want the best answer to an
information need among many documents that contain certain words. The
way to avoid linearly scanning the texts for each query is to index documents
in advance.
10. • Method2: Using Boolean Retrieval Model
• The Boolean retrieval model is a model for information retrieval in
which we can pose any query which is in the form of a Boolean
expression of terms, that is, in which terms are combined with the
operators AND, OR, and NOT. The model views each document as just
a set of words.
• Terms are the indexed units. we have a vector for each term, which
shows the documents it appears in, or a vector for each document,
showing the terms that occur in it. The result is a binary term-document
incidence matrix, as in Figure.
11. Anthony
and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello. Macbeth
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
A term-document incidence matrix. Matrix element (t, d) is 1 if the play
in column d contains the word in row t, and is 0 otherwise.
1.To answer the query Brutus AND Caesar AND NOT Calpurnia, we
take the vectors for Brutus, Caesar and Calpurnia, complement the
last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
12. Solution: Antony and Cleopatra and Hamlet
Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia.
Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens
On average 6 bytes per token, including spaces and punctuation ⇒ size of document
collection is about 6 ・ 109 = 6 GB
Assume there are M = 500,000 distinct terms in the collection
M = 500,000 × 106 = half a trillion 0s and 1s.
But the matrix has no more than one billion 1s.Matrix is extremely sparse. What is a better
representation? We only record the 1s. (Inverted Index)
13. Term weighting
• Search Engine should return in order the documents most likely to be useful to the
searcher. To achieve this, ordering documents with respect to a query - called Ranking
• Term-Document Incidence Matrix
• A Boolean model only records term presence or absence, assign a score – say in [0, 1]
– to each document, it measures how well document and query “match”
• For One-term query “BRUTUS”, score is 1 if it is present in the document, 0
otherwise, more appearances of term in document have higher score
Anthony
and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello. Macbeth
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
14. • Term Frequency tf
• One of the weighting scheme is Term Frequency and is denoted tft,d, with the
subscripts denoting the term and the document in order.
• Term frequency TF (t, d) of term t in document d = number of times that t occurs
in d
• Ex: Term-Document Count Matrix
• but we would like to give more weight to documents that have a term several
times as opposed to ones that contain it only once. To do this we need term
frequency information the number of times a term occurs in a document.
• Assign a score to represent the number of occurrences
16. Bag of Words Model
The exact ordering of the terms in a document is ignored but the number of
occurrences of each term is important.
Example: two documents with similar bag of words representations are similar in content
“Mary is quicker than John” à“John is quicker than Mary”
This is called the bag of words model. In a sense, step back: The positional
index was able to distinguish these two documents
How to use tf for query-document match scores?
Raw term frequency is not what we want. A document with 10 occurrences of
the term is more relevant than a document with 1 occurrence of the term. But
not 10 times more relevant. We use Log frequency weighting.
17. Log-Frequency Weighting
Log-frequency weight of term t in document d is calculated as
.
Document Frequency & Collection Frequency
Document frequency DF(t): the number of documents in the collection that contain a term t
Collection frequency CF(t): the total number of occurrences of a term t in the collection
TF(do,d1) =2
TF(do,d2) =0
TF(do,d3) =3
TF(do,d4) =3
CF (do) =8
DF(do) =3
tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc
18. Inverse Document Frequency ( idf Weight)
It estimates the rarity of a term in the whole document collection. idft is an inverse measure
of the informativeness of t and idft <= N
dft is the document frequency of t: the number of documents that contain t Informativeness
idf (inverse document frequency) of t:
log (N/dft) is used instead of N/dft to diminish the effect of idf.
N: the total number of documents in the collection (for example: 806,791
documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term
19. • Inverse Document Frequency ( idf Weight)
• It estimates the rarity of a term in the whole document collection. idft is an inverse measure
of the informativeness of t and idft <= N
• dft is the document frequency of t: the number of documents that contain t Informativeness
idf (inverse document frequency) of t:
N = 1, 000, 000
idft = log10 1,000,000
dft
20. TF-IDF Weighting
•tf-idf weight of a term: product of tf weight and idf weight, Best
known weighting scheme in information retrieval. TF(t, d) measures
the importance of a term t in document d , IDF(t) measures the
importance of a term t in the whole collection of documents
•TF/IDF weighting: putting TF and IDF together
•
•TFIDF(t, d) = TF(t, d) x IDF(t)
)
(if log tf is used
1.High if t occurs many times in a small number of documents, i.e., highly
discriminative in those documents
2.Not high if t appears infrequent in a document, or is frequent in many
documents, i.e., not discriminative
3.Low if t occurs in almost all documents, i.e., no discrimination at all
21. •Vector Space Retrieval Model
•The representation of a set of documents as vectors in a common vector space is
known as the vector space model and is fundamental to a host of information
retrieval operations ranging from scoring documents on a query, document
classification and document clustering.
Each document is represented as a binary vector
Anthony and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
. . .
ANTHONY 1 1 0 0 0 1
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER
. . .
1 0 1 1 1 0
22. •Each document is now represented as a count vector
Anthony and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
. . .
ANTHONY 157 73 0 0 0 1
BRUTUS 4 157 0 2 0 0
CAESAR 232 227 0 2 1 0
CALPURNIA 0 10 0 0 0 0
CLEOPATRA 57 0 0 0 0 0
MERCY 2 0 3 8 5 8
WORSER
. . .
2 0 1 1 1 5
24. Preprocessing (from Manning TextBook)
• We need to deal with format and language of each document. What format is it in? pdf,
word, excel, html etc.
• What language is it in?
• What character set is in use? Each of these is a classification problem.
25. Tokenization:
• Task of splitting the document into pieces called tokens.
• Ex:
Hewlett-Packard
State-of-the-art
• Normalization
• Need to “normalize” terms in indexed text as well as query terms into the same form.
• Example: We want to match U.S.A. and USA
• Stop words
• stop words = extremely common words which would appear to be of little value in
helping select documents matching a user need
• Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the,
to, was, were, will, with
• Lemmatization & Stemming
• Reduce inflectional/variant forms to base form
• Example: am, are, is → be
26. • Document pre-processing includes 5 stages:
• Lexical analysis
• Stopword elimination
• Stemming
• Index-term selection
• Construction of thesauri
• Lexical analysis
• Objective: Determine the words of the document. Lexical analysis separates the
input alphabet into
• Word characters (e.g., the letters a-z)
• Word separators (e.g., space, newline, tab)
27. • Stopword Elimination
• Objective: Filter out words that occur in most of the documents.
• Such words have no value for retrieval purposes, these words are r
• Stemming
• Objective: Replace all the variants of a word with the single stem of the word. Variants
include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc.
• Index term selection (indexing)
• Objective: Increase efficiency by extracting from the resulting document a selected set of
terms to be used for indexing the document.
• If full text representation is adopted then all words are used for indexing.
• Indexing is a critical process: User's ability to find documents on a particular subject is
limited by the indexing process having created index terms for this subject
28. Language models
IR approaches
• Boolean retrieval - Boolean constrains of term occurrences in documents
• , no ranking
• Vector space model - Queries and vectors are represented as vectors in a high dimensional
space, Notions of similarity (cosine similarity) implying ranking
• Probabilistic model - Rank documents by the probability P(R|d,q) , Estimate P(R|d,q)
using relevance feedback technique
• Language model approach -A document is a good match to a query, if the document model
is likely to generate the query i.e If document contains query words often.
• A language model is a probability distribution over sequences of words. These probability
distributions are called language models. It is useful in many natural language processing
applications.
• Ex: part-of-speech tagging, speech recognition, machine translation, and information
retrieval
30. • Traditional language model
• The traditional language model uses Finite automata and it is a Generative model.
• This diagram shows a simple finite automaton and some of the strings in the
language it generates. → shows the start state of the automaton and a double circle
indicates a (possible) finishing state.
31. • Types of language models
• Unigram language model:
• The simplest form of language model simply throws away all conditioning
context, and estimates each term independently. Such a model is called a unigram
language model:
• A unigram model used in information retrieval can be treated as the combination
of several one-state finite automata.
• Bigram language models
• There are many more complex kinds of language models, such as bigram language
models, in which the condition is based on the previous term,
32. • LMs vs. vector space model
• LMs have some things in common with vector space models.
• Term frequency is directed in the model.
• But it is not scaled in LMs.
• Probabilities are inherently “length-normalized”.
• Cosine normalization does something similar for vector space.
• Mixing document and collection frequencies has an effect similar to idf.
• Terms rare in the general collection, but common in some documents will have
a greater influence on the ranking.
• LMs vs. vector space model: commonalities
• Term frequency is directly in the model.
• Probabilities are inherently “length-normalized”.
• Mixing document and collection frequencies has an effect similar to idf.
• LMs vs. vector space model: differences
• LMs: based on probability theory
• Vector space: based on similarity, a geometric/ linear algebra notion
• Collection frequency vs. document frequency
• Details of term frequency, length normalization etc.
33. Probabilistic information retrieval
• Given a user query q, and the ideal answer set R of the relevant documents, the problem is
to specify the properties for this set
• Assumption (probabilistic principle): the probability of relevance depends on the query
and document representations only; ideal answer set R should maximize the overall
probability of relevance
• The probabilistic model tries to estimate the probability that the user will find the
document dj relevant with ratio
• P(dj relevant to q)/P(dj no relevant to q)
• Probability theory provides a principled foundation for such reasoning under uncertainty.
This model provides how likely a document is relevant to an information need.
34. Latent semantic indexing
• Latent semantic indexing (LSI) is an indexing and retrieval method that uses
a mathematical technique called singular value decomposition (SVD) to
identify patterns in the relationships between the terms and concepts
contained in an unstructured collection of text. LSI is based on the principle
that words that are used in the same contexts tend to have similar meanings.
Why we use LSI in information retrieval
• LSI takes documents that are semantically similar (= talk about the same
topics), but are not similar in the vector space (because they use different
words) and re- represents them in a reduced vector space in which they have
higher similarity.
• LSI: Comparison to other approaches
• Relevance feedback and query expansion are used to increase recall in
information retrieval – if query and documents have (in the extreme case) no
terms in common.
• LSI increases recall and hurts precision.
• Thus, it addresses the same problems as (pseudo) relevance feedback and
query expansion.
35. Relevance feedback and query expansion
• Interactive relevance feedback: improve initial retrieval results by telling the IR
system which docs are relevant / no relevant.
• Query expansion: improve retrieval results by adding synonyms / related terms to
the query. Sources for related terms: Manual thesauri, automatic thesauri, query
logs. Two ways of improving recall: relevance feedback and query expansion
40. Pseudo relevance feedback / blind relevance feedback
• Pseudo-relevance feedback automates the “manual” part of true relevance feedback.
• Pseudo-relevance algorithm:
• Retrieve a ranked list of hits for the user’s query
• Assume that the top k documents are relevant.
• Do relevance feedback (e.g., Rocchio)
Types of user feedback
• There are two types of feedback
• Feedback on documents - More common in relevance feedback
• Feedback on words or phrases - More common in query expansion
Types of query expansion
• Manual thesaurus (maintained by editors, e.g., PubMed)
• Automatically derived thesaurus (e.g., based on co-occurrence statistics)
• Query-equivalence based on query log mining (common on the web as in the “palm”
example)
41. Automatic thesaurus generation
•Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents
•Fundamental notion: similarity between two words
•Definition 1: Two words are similar if they co-occur with similar words.
•“car” ≈ “motorcycle” because both occur with “road”, “gas” and “license”, so they must be similar.
Example for manual thesaurus: PubMed