SlideShare a Scribd company logo
Boolean and vector-space retrieval models- Term weighting – TF-IDF
weighting- cosine similarity – Preprocessing – Inverted indices – efficient
processing with sparse vectors – Language Model based IR – Probabilistic
IR –Latent Semantic Indexing – Relevance feedback and query expansion.
CO2: Apply the knowledge of data structures and indexing methods in information
retrieval.
UNIT – II
Modeling
• Modeling in IR is a complex process aimed at producing a ranking function.
• Ranking function: a function that assigns scores to documents with regard to a given
query.
• This process consists of two main tasks:
• The conception of a logical framework for representing documents and queries
• The definition of a ranking function that allows quantifying the similarities
among documents and queries
• IR systems usually adopt index terms to index and retrieve documents
An IR model is a quadruple [D, Q, F, R(qi, dj)] where
1. D is a set of logical views for the documents in the collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and queries
4. R(qi, dj) is a ranking function
Taxonomy of IR Models
• Retrieval models most frequently associated with distinct combinations of a
document logical view and a user task. The users task includes retrieval and
browsing.
ii) Filtering
The queries remain relatively static while new
documents come into the system
i) Ad Hoc Retrieval:
The documents in the collection remain relatively static
while new queries are submitted to the system.
Classic IR model:
Each document is described by a set of representative keywords
called index terms. Assign a numerical weight to distinct relevance
between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model:
• Model for information retrieval in which we can pose any query which is in the
form of a Boolean expression of terms, that is, in which terms are combined with
the operators AND, OR, and NOT.
• The model views each document as just a set of words. Based on a binary decision
criterion without any notion of a grading scale. Boolean expressions have precise
semantics.
Vector Model
• Assign non-binary weights to index terms in queries and in documents. Compute
the similarity between documents and query. More precise than Boolean model.
Probabilistic Model
•The probabilistic model tries to estimate the probability that the user
will find the document dj relevant with ratio
•P(dj relevant to q)/P(dj no relevant to q)
•Given a user query q, and the ideal answer set R of the relevant
documents, the problem is to specify the properties for this set.
• Assumption (probabilistic principle): the probability of relevance
depends on the query and document representations only; ideal answer
set R should maximize the overall probability of relevance.
191CSEH IR UNIT - II for an engineering subject
•Boolean Retrieval Models
• The Boolean retrieval model is a model for information retrieval in which the query
is in the form of a Boolean expression of terms, combined with the operators AND,
OR, and NOT. The model views each document as just a set of words.
• Simple model based on set theory and Boolean algebra
• The Boolean model predicts that each document is either relevant or non- relevant.
• Example :
• A fat book which many people own is Shakespeare's Collected Works.
• Problem : To determine which plays of Shakespeare contain the words Brutus AND
Caesar AND NOT Calpurnia.
• Method1 : Using Grep
• The simplest form of document retrieval is for a computer to do the linear
scan through documents. This process is commonly referred to as grepping
through text, after the Unix command grep. Grepping through text can be a
very effective process, especially given the speed of modern computers, and
often allows useful possibilities for wildcard pattern matching through the use
of regular expressions.
• To Perform simple querying of modest collections , we need :
• To process large document collections quickly.
• To allow more flexible matching operations.
• For example, it is impractical to perform the query “Romans NEAR
countrymen” with grep , where NEAR might be defined as “within 5 words”
or “within the same sentence”.
• To allow ranked retrieval: in many cases we want the best answer to an
information need among many documents that contain certain words. The
way to avoid linearly scanning the texts for each query is to index documents
in advance.
• Method2: Using Boolean Retrieval Model
• The Boolean retrieval model is a model for information retrieval in
which we can pose any query which is in the form of a Boolean
expression of terms, that is, in which terms are combined with the
operators AND, OR, and NOT. The model views each document as just
a set of words.
• Terms are the indexed units. we have a vector for each term, which
shows the documents it appears in, or a vector for each document,
showing the terms that occur in it. The result is a binary term-document
incidence matrix, as in Figure.
Anthony
and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello. Macbeth
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
A term-document incidence matrix. Matrix element (t, d) is 1 if the play
in column d contains the word in row t, and is 0 otherwise.
1.To answer the query Brutus AND Caesar AND NOT Calpurnia, we
take the vectors for Brutus, Caesar and Calpurnia, complement the
last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Solution: Antony and Cleopatra and Hamlet
Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia.
Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens
On average 6 bytes per token, including spaces and punctuation ⇒ size of document
collection is about 6 ・ 109 = 6 GB
Assume there are M = 500,000 distinct terms in the collection
M = 500,000 × 106 = half a trillion 0s and 1s.
But the matrix has no more than one billion 1s.Matrix is extremely sparse. What is a better
representation? We only record the 1s. (Inverted Index)
Term weighting
• Search Engine should return in order the documents most likely to be useful to the
searcher. To achieve this, ordering documents with respect to a query - called Ranking
• Term-Document Incidence Matrix
• A Boolean model only records term presence or absence, assign a score – say in [0, 1]
– to each document, it measures how well document and query “match”
• For One-term query “BRUTUS”, score is 1 if it is present in the document, 0
otherwise, more appearances of term in document have higher score
Anthony
and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello. Macbeth
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
• Term Frequency tf
• One of the weighting scheme is Term Frequency and is denoted tft,d, with the
subscripts denoting the term and the document in order.
• Term frequency TF (t, d) of term t in document d = number of times that t occurs
in d
• Ex: Term-Document Count Matrix
• but we would like to give more weight to documents that have a term several
times as opposed to ones that contain it only once. To do this we need term
frequency information the number of times a term occurs in a document.
• Assign a score to represent the number of occurrences
Document represented by count vector Є Nv
Bag of Words Model
The exact ordering of the terms in a document is ignored but the number of
occurrences of each term is important.
Example: two documents with similar bag of words representations are similar in content
“Mary is quicker than John” à“John is quicker than Mary”
This is called the bag of words model. In a sense, step back: The positional
index was able to distinguish these two documents
How to use tf for query-document match scores?
Raw term frequency is not what we want. A document with 10 occurrences of
the term is more relevant than a document with 1 occurrence of the term. But
not 10 times more relevant. We use Log frequency weighting.
Log-Frequency Weighting
Log-frequency weight of term t in document d is calculated as
.
Document Frequency & Collection Frequency
Document frequency DF(t): the number of documents in the collection that contain a term t
Collection frequency CF(t): the total number of occurrences of a term t in the collection
TF(do,d1) =2
TF(do,d2) =0
TF(do,d3) =3
TF(do,d4) =3
CF (do) =8
DF(do) =3
tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc
Inverse Document Frequency ( idf Weight)
It estimates the rarity of a term in the whole document collection. idft is an inverse measure
of the informativeness of t and idft <= N
dft is the document frequency of t: the number of documents that contain t Informativeness
idf (inverse document frequency) of t:
log (N/dft) is used instead of N/dft to diminish the effect of idf.
N: the total number of documents in the collection (for example: 806,791
documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term
• Inverse Document Frequency ( idf Weight)
• It estimates the rarity of a term in the whole document collection. idft is an inverse measure
of the informativeness of t and idft <= N
• dft is the document frequency of t: the number of documents that contain t Informativeness
idf (inverse document frequency) of t:
N = 1, 000, 000
idft = log10 1,000,000
dft
TF-IDF Weighting
•tf-idf weight of a term: product of tf weight and idf weight, Best
known weighting scheme in information retrieval. TF(t, d) measures
the importance of a term t in document d , IDF(t) measures the
importance of a term t in the whole collection of documents
•TF/IDF weighting: putting TF and IDF together
•
•TFIDF(t, d) = TF(t, d) x IDF(t)
)
(if log tf is used
1.High if t occurs many times in a small number of documents, i.e., highly
discriminative in those documents
2.Not high if t appears infrequent in a document, or is frequent in many
documents, i.e., not discriminative
3.Low if t occurs in almost all documents, i.e., no discrimination at all
•Vector Space Retrieval Model
•The representation of a set of documents as vectors in a common vector space is
known as the vector space model and is fundamental to a host of information
retrieval operations ranging from scoring documents on a query, document
classification and document clustering.
Each document is represented as a binary vector
Anthony and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
. . .
ANTHONY 1 1 0 0 0 1
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER
. . .
1 0 1 1 1 0
•Each document is now represented as a count vector
Anthony and
Cleopatra
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
. . .
ANTHONY 157 73 0 0 0 1
BRUTUS 4 157 0 2 0 0
CAESAR 232 227 0 2 1 0
CALPURNIA 0 10 0 0 0 0
CLEOPATRA 57 0 0 0 0 0
MERCY 2 0 3 8 5 8
WORSER
. . .
2 0 1 1 1 5
Document represented by tf-idf weight vector
Binary -> Count -> Weight Matrix
Preprocessing (from Manning TextBook)
• We need to deal with format and language of each document. What format is it in? pdf,
word, excel, html etc.
• What language is it in?
• What character set is in use? Each of these is a classification problem.
Tokenization:
• Task of splitting the document into pieces called tokens.
• Ex:
Hewlett-Packard
State-of-the-art
• Normalization
• Need to “normalize” terms in indexed text as well as query terms into the same form.
• Example: We want to match U.S.A. and USA
• Stop words
• stop words = extremely common words which would appear to be of little value in
helping select documents matching a user need
• Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the,
to, was, were, will, with
• Lemmatization & Stemming
• Reduce inflectional/variant forms to base form
• Example: am, are, is → be
• Document pre-processing includes 5 stages:
• Lexical analysis
• Stopword elimination
• Stemming
• Index-term selection
• Construction of thesauri
• Lexical analysis
• Objective: Determine the words of the document. Lexical analysis separates the
input alphabet into
• Word characters (e.g., the letters a-z)
• Word separators (e.g., space, newline, tab)
• Stopword Elimination
• Objective: Filter out words that occur in most of the documents.
• Such words have no value for retrieval purposes, these words are r
• Stemming
• Objective: Replace all the variants of a word with the single stem of the word. Variants
include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc.
• Index term selection (indexing)
• Objective: Increase efficiency by extracting from the resulting document a selected set of
terms to be used for indexing the document.
• If full text representation is adopted then all words are used for indexing.
• Indexing is a critical process: User's ability to find documents on a particular subject is
limited by the indexing process having created index terms for this subject
Language models
IR approaches
• Boolean retrieval - Boolean constrains of term occurrences in documents
• , no ranking
• Vector space model - Queries and vectors are represented as vectors in a high dimensional
space, Notions of similarity (cosine similarity) implying ranking
• Probabilistic model - Rank documents by the probability P(R|d,q) , Estimate P(R|d,q)
using relevance feedback technique
• Language model approach -A document is a good match to a query, if the document model
is likely to generate the query i.e If document contains query words often.
• A language model is a probability distribution over sequences of words. These probability
distributions are called language models. It is useful in many natural language processing
applications.
• Ex: part-of-speech tagging, speech recognition, machine translation, and information
retrieval
191CSEH IR UNIT - II for an engineering subject
• Traditional language model
• The traditional language model uses Finite automata and it is a Generative model.
• This diagram shows a simple finite automaton and some of the strings in the
language it generates. → shows the start state of the automaton and a double circle
indicates a (possible) finishing state.
• Types of language models
• Unigram language model:
• The simplest form of language model simply throws away all conditioning
context, and estimates each term independently. Such a model is called a unigram
language model:
• A unigram model used in information retrieval can be treated as the combination
of several one-state finite automata.
• Bigram language models
• There are many more complex kinds of language models, such as bigram language
models, in which the condition is based on the previous term,
• LMs vs. vector space model
• LMs have some things in common with vector space models.
• Term frequency is directed in the model.
• But it is not scaled in LMs.
• Probabilities are inherently “length-normalized”.
• Cosine normalization does something similar for vector space.
• Mixing document and collection frequencies has an effect similar to idf.
• Terms rare in the general collection, but common in some documents will have
a greater influence on the ranking.
• LMs vs. vector space model: commonalities
• Term frequency is directly in the model.
• Probabilities are inherently “length-normalized”.
• Mixing document and collection frequencies has an effect similar to idf.
• LMs vs. vector space model: differences
• LMs: based on probability theory
• Vector space: based on similarity, a geometric/ linear algebra notion
• Collection frequency vs. document frequency
• Details of term frequency, length normalization etc.
Probabilistic information retrieval
• Given a user query q, and the ideal answer set R of the relevant documents, the problem is
to specify the properties for this set
• Assumption (probabilistic principle): the probability of relevance depends on the query
and document representations only; ideal answer set R should maximize the overall
probability of relevance
• The probabilistic model tries to estimate the probability that the user will find the
document dj relevant with ratio
• P(dj relevant to q)/P(dj no relevant to q)
• Probability theory provides a principled foundation for such reasoning under uncertainty.
This model provides how likely a document is relevant to an information need.
Latent semantic indexing
• Latent semantic indexing (LSI) is an indexing and retrieval method that uses
a mathematical technique called singular value decomposition (SVD) to
identify patterns in the relationships between the terms and concepts
contained in an unstructured collection of text. LSI is based on the principle
that words that are used in the same contexts tend to have similar meanings.
Why we use LSI in information retrieval
• LSI takes documents that are semantically similar (= talk about the same
topics), but are not similar in the vector space (because they use different
words) and re- represents them in a reduced vector space in which they have
higher similarity.
• LSI: Comparison to other approaches
• Relevance feedback and query expansion are used to increase recall in
information retrieval – if query and documents have (in the extreme case) no
terms in common.
• LSI increases recall and hurts precision.
• Thus, it addresses the same problems as (pseudo) relevance feedback and
query expansion.
Relevance feedback and query expansion
• Interactive relevance feedback: improve initial retrieval results by telling the IR
system which docs are relevant / no relevant.
• Query expansion: improve retrieval results by adding synonyms / related terms to
the query. Sources for related terms: Manual thesauri, automatic thesauri, query
logs. Two ways of improving recall: relevance feedback and query expansion
•Example1 : Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Result of initial Query
User feedback: Select what is relevant
After Relevance Feedback
Pseudo relevance feedback / blind relevance feedback
• Pseudo-relevance feedback automates the “manual” part of true relevance feedback.
• Pseudo-relevance algorithm:
• Retrieve a ranked list of hits for the user’s query
• Assume that the top k documents are relevant.
• Do relevance feedback (e.g., Rocchio)
Types of user feedback
• There are two types of feedback
• Feedback on documents - More common in relevance feedback
• Feedback on words or phrases - More common in query expansion
Types of query expansion
• Manual thesaurus (maintained by editors, e.g., PubMed)
• Automatically derived thesaurus (e.g., based on co-occurrence statistics)
• Query-equivalence based on query log mining (common on the web as in the “palm”
example)
Automatic thesaurus generation
•Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents
•Fundamental notion: similarity between two words
•Definition 1: Two words are similar if they co-occur with similar words.
•“car” ≈ “motorcycle” because both occur with “road”, “gas” and “license”, so they must be similar.
Example for manual thesaurus: PubMed

More Related Content

DOCX
UNIT 3 IRT.docx
thenmozhip8
 
PPTX
IRT Unit_ 2.pptx
thenmozhip8
 
PPT
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
PDF
Chapter 4 IR Models.pdf
Habtamu100
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PPT
lecture6-tfidf.pptiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
RAtna29
 
PPTX
JM Information Retrieval Techniques Unit II
JeyamohanHAsstProfCS
 
UNIT 3 IRT.docx
thenmozhip8
 
IRT Unit_ 2.pptx
thenmozhip8
 
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
Chapter 4 IR Models.pdf
Habtamu100
 
4-IR Models_new.ppt
BereketAraya
 
4-IR Models_new.ppt
BereketAraya
 
lecture6-tfidf.pptiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
RAtna29
 
JM Information Retrieval Techniques Unit II
JeyamohanHAsstProfCS
 

Similar to 191CSEH IR UNIT - II for an engineering subject (20)

PPTX
master prepare seminar for computer science.pptx
mostafaalgendy3
 
PPT
Important topics vector space mathematics lecture9.ppt
kakash77897
 
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
PPT
Information Retrieval 02
Jeet Das
 
PDF
lecture1.pdf
KalaivaniManikandan1
 
PPT
lecture-TFIDF information retrieval .ppt
asmaashalma456
 
PDF
Information Retrieval
rchbeir
 
PPT
IR-lec05-scoring-term-weighting-vector-space.ppt
rupanaveen24
 
PPT
lecture1-intro.ppt
WrushabhShirsat3
 
PPT
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
RAtna29
 
PPT
lecture1-intro.ppt
IshaXogaha
 
PPTX
Boolean IR and Indexing.pptx
Mahsadelavari
 
PDF
flat studies into.pdf
ssuseree3bdd
 
PPT
introduction into IR
ssusere3b1a2
 
PPTX
The vector space model
pkgosh
 
PPTX
Ir 02
Mohammed Romi
 
PDF
IRS-total ppts.pdf which have the detail abt the
MARasheed3
 
PPT
Ir models
Ambreen Angel
 
PPTX
Jarrar: Introduction to Information Retrieval
Mustafa Jarrar
 
PDF
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
kevig
 
master prepare seminar for computer science.pptx
mostafaalgendy3
 
Important topics vector space mathematics lecture9.ppt
kakash77897
 
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
Information Retrieval 02
Jeet Das
 
lecture1.pdf
KalaivaniManikandan1
 
lecture-TFIDF information retrieval .ppt
asmaashalma456
 
Information Retrieval
rchbeir
 
IR-lec05-scoring-term-weighting-vector-space.ppt
rupanaveen24
 
lecture1-intro.ppt
WrushabhShirsat3
 
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
RAtna29
 
lecture1-intro.ppt
IshaXogaha
 
Boolean IR and Indexing.pptx
Mahsadelavari
 
flat studies into.pdf
ssuseree3bdd
 
introduction into IR
ssusere3b1a2
 
The vector space model
pkgosh
 
IRS-total ppts.pdf which have the detail abt the
MARasheed3
 
Ir models
Ambreen Angel
 
Jarrar: Introduction to Information Retrieval
Mustafa Jarrar
 
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
kevig
 
Ad

Recently uploaded (20)

PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PPTX
22PCOAM21 Data Quality Session 3 Data Quality.pptx
Guru Nanak Technical Institutions
 
PPT
SCOPE_~1- technology of green house and poyhouse
bala464780
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PPTX
Introduction of deep learning in cse.pptx
fizarcse
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
22PCOAM21 Data Quality Session 3 Data Quality.pptx
Guru Nanak Technical Institutions
 
SCOPE_~1- technology of green house and poyhouse
bala464780
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
Introduction of deep learning in cse.pptx
fizarcse
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Ad

191CSEH IR UNIT - II for an engineering subject

  • 1. Boolean and vector-space retrieval models- Term weighting – TF-IDF weighting- cosine similarity – Preprocessing – Inverted indices – efficient processing with sparse vectors – Language Model based IR – Probabilistic IR –Latent Semantic Indexing – Relevance feedback and query expansion. CO2: Apply the knowledge of data structures and indexing methods in information retrieval. UNIT – II
  • 2. Modeling • Modeling in IR is a complex process aimed at producing a ranking function. • Ranking function: a function that assigns scores to documents with regard to a given query. • This process consists of two main tasks: • The conception of a logical framework for representing documents and queries • The definition of a ranking function that allows quantifying the similarities among documents and queries • IR systems usually adopt index terms to index and retrieve documents
  • 3. An IR model is a quadruple [D, Q, F, R(qi, dj)] where 1. D is a set of logical views for the documents in the collection 2. Q is a set of logical views for the user queries 3. F is a framework for modeling documents and queries 4. R(qi, dj) is a ranking function Taxonomy of IR Models • Retrieval models most frequently associated with distinct combinations of a document logical view and a user task. The users task includes retrieval and browsing.
  • 4. ii) Filtering The queries remain relatively static while new documents come into the system i) Ad Hoc Retrieval: The documents in the collection remain relatively static while new queries are submitted to the system. Classic IR model: Each document is described by a set of representative keywords called index terms. Assign a numerical weight to distinct relevance between index terms. Three classic models: Boolean, vector, probabilistic
  • 5. Boolean Model: • Model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. • The model views each document as just a set of words. Based on a binary decision criterion without any notion of a grading scale. Boolean expressions have precise semantics. Vector Model • Assign non-binary weights to index terms in queries and in documents. Compute the similarity between documents and query. More precise than Boolean model.
  • 6. Probabilistic Model •The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio •P(dj relevant to q)/P(dj no relevant to q) •Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set. • Assumption (probabilistic principle): the probability of relevance depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance.
  • 8. •Boolean Retrieval Models • The Boolean retrieval model is a model for information retrieval in which the query is in the form of a Boolean expression of terms, combined with the operators AND, OR, and NOT. The model views each document as just a set of words. • Simple model based on set theory and Boolean algebra • The Boolean model predicts that each document is either relevant or non- relevant. • Example : • A fat book which many people own is Shakespeare's Collected Works. • Problem : To determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia.
  • 9. • Method1 : Using Grep • The simplest form of document retrieval is for a computer to do the linear scan through documents. This process is commonly referred to as grepping through text, after the Unix command grep. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expressions. • To Perform simple querying of modest collections , we need : • To process large document collections quickly. • To allow more flexible matching operations. • For example, it is impractical to perform the query “Romans NEAR countrymen” with grep , where NEAR might be defined as “within 5 words” or “within the same sentence”. • To allow ranked retrieval: in many cases we want the best answer to an information need among many documents that contain certain words. The way to avoid linearly scanning the texts for each query is to index documents in advance.
  • 10. • Method2: Using Boolean Retrieval Model • The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words. • Terms are the indexed units. we have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it. The result is a binary term-document incidence matrix, as in Figure.
  • 11. Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello. Macbeth Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 Mercy 1 0 1 1 1 1 Worser 1 0 1 1 1 0 A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. 1.To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100
  • 12. Solution: Antony and Cleopatra and Hamlet Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia. Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about 6 ・ 109 = 6 GB Assume there are M = 500,000 distinct terms in the collection M = 500,000 × 106 = half a trillion 0s and 1s. But the matrix has no more than one billion 1s.Matrix is extremely sparse. What is a better representation? We only record the 1s. (Inverted Index)
  • 13. Term weighting • Search Engine should return in order the documents most likely to be useful to the searcher. To achieve this, ordering documents with respect to a query - called Ranking • Term-Document Incidence Matrix • A Boolean model only records term presence or absence, assign a score – say in [0, 1] – to each document, it measures how well document and query “match” • For One-term query “BRUTUS”, score is 1 if it is present in the document, 0 otherwise, more appearances of term in document have higher score Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello. Macbeth Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 Mercy 1 0 1 1 1 1 Worser 1 0 1 1 1 0
  • 14. • Term Frequency tf • One of the weighting scheme is Term Frequency and is denoted tft,d, with the subscripts denoting the term and the document in order. • Term frequency TF (t, d) of term t in document d = number of times that t occurs in d • Ex: Term-Document Count Matrix • but we would like to give more weight to documents that have a term several times as opposed to ones that contain it only once. To do this we need term frequency information the number of times a term occurs in a document. • Assign a score to represent the number of occurrences
  • 15. Document represented by count vector Є Nv
  • 16. Bag of Words Model The exact ordering of the terms in a document is ignored but the number of occurrences of each term is important. Example: two documents with similar bag of words representations are similar in content “Mary is quicker than John” à“John is quicker than Mary” This is called the bag of words model. In a sense, step back: The positional index was able to distinguish these two documents How to use tf for query-document match scores? Raw term frequency is not what we want. A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant. We use Log frequency weighting.
  • 17. Log-Frequency Weighting Log-frequency weight of term t in document d is calculated as . Document Frequency & Collection Frequency Document frequency DF(t): the number of documents in the collection that contain a term t Collection frequency CF(t): the total number of occurrences of a term t in the collection TF(do,d1) =2 TF(do,d2) =0 TF(do,d3) =3 TF(do,d4) =3 CF (do) =8 DF(do) =3 tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc
  • 18. Inverse Document Frequency ( idf Weight) It estimates the rarity of a term in the whole document collection. idft is an inverse measure of the informativeness of t and idft <= N dft is the document frequency of t: the number of documents that contain t Informativeness idf (inverse document frequency) of t: log (N/dft) is used instead of N/dft to diminish the effect of idf. N: the total number of documents in the collection (for example: 806,791 documents) • IDF(t) is high if t is a rare term • IDF(t) is likely low if t is a frequent term
  • 19. • Inverse Document Frequency ( idf Weight) • It estimates the rarity of a term in the whole document collection. idft is an inverse measure of the informativeness of t and idft <= N • dft is the document frequency of t: the number of documents that contain t Informativeness idf (inverse document frequency) of t: N = 1, 000, 000 idft = log10 1,000,000 dft
  • 20. TF-IDF Weighting •tf-idf weight of a term: product of tf weight and idf weight, Best known weighting scheme in information retrieval. TF(t, d) measures the importance of a term t in document d , IDF(t) measures the importance of a term t in the whole collection of documents •TF/IDF weighting: putting TF and IDF together • •TFIDF(t, d) = TF(t, d) x IDF(t) ) (if log tf is used 1.High if t occurs many times in a small number of documents, i.e., highly discriminative in those documents 2.Not high if t appears infrequent in a document, or is frequent in many documents, i.e., not discriminative 3.Low if t occurs in almost all documents, i.e., no discrimination at all
  • 21. •Vector Space Retrieval Model •The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classification and document clustering. Each document is represented as a binary vector Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY 1 1 0 0 0 1 BRUTUS 1 1 0 1 0 0 CAESAR 1 1 0 1 1 1 CALPURNIA 0 1 0 0 0 0 CLEOPATRA 1 0 0 0 0 0 MERCY 1 0 1 1 1 1 WORSER . . . 1 0 1 1 1 0
  • 22. •Each document is now represented as a count vector Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY 157 73 0 0 0 1 BRUTUS 4 157 0 2 0 0 CAESAR 232 227 0 2 1 0 CALPURNIA 0 10 0 0 0 0 CLEOPATRA 57 0 0 0 0 0 MERCY 2 0 3 8 5 8 WORSER . . . 2 0 1 1 1 5
  • 23. Document represented by tf-idf weight vector Binary -> Count -> Weight Matrix
  • 24. Preprocessing (from Manning TextBook) • We need to deal with format and language of each document. What format is it in? pdf, word, excel, html etc. • What language is it in? • What character set is in use? Each of these is a classification problem.
  • 25. Tokenization: • Task of splitting the document into pieces called tokens. • Ex: Hewlett-Packard State-of-the-art • Normalization • Need to “normalize” terms in indexed text as well as query terms into the same form. • Example: We want to match U.S.A. and USA • Stop words • stop words = extremely common words which would appear to be of little value in helping select documents matching a user need • Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with • Lemmatization & Stemming • Reduce inflectional/variant forms to base form • Example: am, are, is → be
  • 26. • Document pre-processing includes 5 stages: • Lexical analysis • Stopword elimination • Stemming • Index-term selection • Construction of thesauri • Lexical analysis • Objective: Determine the words of the document. Lexical analysis separates the input alphabet into • Word characters (e.g., the letters a-z) • Word separators (e.g., space, newline, tab)
  • 27. • Stopword Elimination • Objective: Filter out words that occur in most of the documents. • Such words have no value for retrieval purposes, these words are r • Stemming • Objective: Replace all the variants of a word with the single stem of the word. Variants include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc. • Index term selection (indexing) • Objective: Increase efficiency by extracting from the resulting document a selected set of terms to be used for indexing the document. • If full text representation is adopted then all words are used for indexing. • Indexing is a critical process: User's ability to find documents on a particular subject is limited by the indexing process having created index terms for this subject
  • 28. Language models IR approaches • Boolean retrieval - Boolean constrains of term occurrences in documents • , no ranking • Vector space model - Queries and vectors are represented as vectors in a high dimensional space, Notions of similarity (cosine similarity) implying ranking • Probabilistic model - Rank documents by the probability P(R|d,q) , Estimate P(R|d,q) using relevance feedback technique • Language model approach -A document is a good match to a query, if the document model is likely to generate the query i.e If document contains query words often. • A language model is a probability distribution over sequences of words. These probability distributions are called language models. It is useful in many natural language processing applications. • Ex: part-of-speech tagging, speech recognition, machine translation, and information retrieval
  • 30. • Traditional language model • The traditional language model uses Finite automata and it is a Generative model. • This diagram shows a simple finite automaton and some of the strings in the language it generates. → shows the start state of the automaton and a double circle indicates a (possible) finishing state.
  • 31. • Types of language models • Unigram language model: • The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model: • A unigram model used in information retrieval can be treated as the combination of several one-state finite automata. • Bigram language models • There are many more complex kinds of language models, such as bigram language models, in which the condition is based on the previous term,
  • 32. • LMs vs. vector space model • LMs have some things in common with vector space models. • Term frequency is directed in the model. • But it is not scaled in LMs. • Probabilities are inherently “length-normalized”. • Cosine normalization does something similar for vector space. • Mixing document and collection frequencies has an effect similar to idf. • Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. • LMs vs. vector space model: commonalities • Term frequency is directly in the model. • Probabilities are inherently “length-normalized”. • Mixing document and collection frequencies has an effect similar to idf. • LMs vs. vector space model: differences • LMs: based on probability theory • Vector space: based on similarity, a geometric/ linear algebra notion • Collection frequency vs. document frequency • Details of term frequency, length normalization etc.
  • 33. Probabilistic information retrieval • Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set • Assumption (probabilistic principle): the probability of relevance depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance • The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio • P(dj relevant to q)/P(dj no relevant to q) • Probability theory provides a principled foundation for such reasoning under uncertainty. This model provides how likely a document is relevant to an information need.
  • 34. Latent semantic indexing • Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. Why we use LSI in information retrieval • LSI takes documents that are semantically similar (= talk about the same topics), but are not similar in the vector space (because they use different words) and re- represents them in a reduced vector space in which they have higher similarity. • LSI: Comparison to other approaches • Relevance feedback and query expansion are used to increase recall in information retrieval – if query and documents have (in the extreme case) no terms in common. • LSI increases recall and hurts precision. • Thus, it addresses the same problems as (pseudo) relevance feedback and query expansion.
  • 35. Relevance feedback and query expansion • Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are relevant / no relevant. • Query expansion: improve retrieval results by adding synonyms / related terms to the query. Sources for related terms: Manual thesauri, automatic thesauri, query logs. Two ways of improving recall: relevance feedback and query expansion
  • 36. •Example1 : Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html
  • 38. User feedback: Select what is relevant
  • 40. Pseudo relevance feedback / blind relevance feedback • Pseudo-relevance feedback automates the “manual” part of true relevance feedback. • Pseudo-relevance algorithm: • Retrieve a ranked list of hits for the user’s query • Assume that the top k documents are relevant. • Do relevance feedback (e.g., Rocchio) Types of user feedback • There are two types of feedback • Feedback on documents - More common in relevance feedback • Feedback on words or phrases - More common in query expansion Types of query expansion • Manual thesaurus (maintained by editors, e.g., PubMed) • Automatically derived thesaurus (e.g., based on co-occurrence statistics) • Query-equivalence based on query log mining (common on the web as in the “palm” example)
  • 41. Automatic thesaurus generation •Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents •Fundamental notion: similarity between two words •Definition 1: Two words are similar if they co-occur with similar words. •“car” ≈ “motorcycle” because both occur with “road”, “gas” and “license”, so they must be similar. Example for manual thesaurus: PubMed