SlideShare a Scribd company logo
Lecture 08
Information Retrieval
The Vector Space Model for Scoring
Introduction
 The representation of a set of documents as vectors in a common
vector space is known as the vector space model and is
fundamental to a host of information retrieval operations ranging
from scoring documents on a query, document classification and
document clustering.
 We first develop the basic ideas underlying vector space scoring; a
pivotal step in this development is the view of queries as vectors
in the same vector space as the document collection.
The Vector Space Model for Scoring
Dot Products
 We denote by 𝑽(𝒅)the vector derived from document 𝒅, with one
component in the vector for each dictionary term.
 Unless otherwise specified, you may assume that the components are
computed using the 𝒕𝒇 − 𝒊𝒅𝒇 weighting scheme, although the
particular weighting scheme is immaterial to the discussion that
follows.
 The set of documents in a collection then may be viewed as a set of
vectors in a vector space, in which there is one axis for each term.
 So we have a |V|- dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional: tens of millions of dimensions when you apply
this to a web search engine
 These are very sparse vectors – most entries are zero
The Vector Space Model for Scoring
Dot Products
 How do we quantify the similarity between two documents in this
vector space?
 A first attempt might consider the magnitude of the vector difference
between two document vectors.
 This measure suffers from a drawback: two documents with very
similar content can have a significant vector difference simply
because one is much longer than the other. Thus the relative
distributions of terms may be identical in the two documents, but the
absolute term frequencies of one may be far larger.
𝑑1
𝑑2
𝑑1
𝑑2
The Vector Space Model for Scoring
Dot Products
 To compensate for the effect of document length, the standard way
of quantifying the similarity between two documents 𝒅𝟏 and 𝒅𝟐 is
to compute the cosine similarity of their vector representations:
 where the numerator represents the dot product (also known as
the inner product) of the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 . The dot product
𝒙 . 𝒚 of two vectors is defined as:
 while the denominator is the product of their Euclidean lengths.
The Vector Space Model for Scoring
Dot Products
 Let 𝑽 𝒅 denote the document vector for 𝒅, with M components
𝑽 𝟏 𝒅 . . . 𝑽 𝑴 𝒅 . The Euclidean length of 𝒅 is defined to be:
 The effect of the denominator of is thus to length-normalize the
vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) =
𝑽 𝒅 𝟏
|𝑽(𝒅 𝟏) |
and 𝒗(𝒅 𝟐) =
𝑽 𝒅 𝟐
|𝑽(𝒅 𝟐) |
 We can then rewrite:
as
The Vector Space Model for Scoring
Dot Products
 The effect of the denominator of is thus to length-normalize the
vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) =
𝑽 𝒅 𝟏
|𝑽(𝒅 𝟏) |
and 𝒗(𝒅 𝟐) =
𝑽 𝒅 𝟐
|𝑽(𝒅 𝟐) |
 Example: for Doc1:
(𝟐𝟕) 𝟐+(𝟑) 𝟐+(𝟎) 𝟐+(𝟏𝟒) 𝟐  𝟗𝟑𝟒  30.56
 27/30.56, 3/30.56, 0/30.56, 14/30.56
The Vector Space Model for Scoring
Cosine Similarity
The Vector Space Model for Scoring
Cosine Similarity
 Example:
The Vector Space Model for Scoring
Cosine Similarity
 Example:
The Vector Space Model for Scoring
Queries as Vectors
 There is a far more compelling reason to represent
documents as vectors: we can also view a query as a vector.
 So, we represent queries as vectors in the space.
 Rank documents according to their proximity to the query in this
space.
 Consider the Query q= jealous gossip
term Query
affection 0
jealous 1
gossip 1
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
 Consider the Query q= jealous gossip
Log frequency weighting After length
normalization
 The key idea now: to assign to each document 𝒅 a score equal to
the dot product
𝑽 𝒒 . 𝑽(𝒅)
term Query
affection 0
jealous 1
gossip 1
wuthering 0
term Query
affection 0
jealous 0.70
gossip 0.70
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
After length normalization
 Recall: We do this because we want to get away from the youʼre-
either-in-or-out Boolean model.
 Instead: rank more relevant documents higher than less relevant
documents
term Query
affection 0
jealous 0.70
gossip 0.70
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
 To summarize, by viewing a query as a “bag of words”, we are able
to treat it as a very short document.
 As a consequence, we can use the cosine similarity between the
query vector and a document vector as a measure of the score of
the document for that query.
 The resulting scores can then be used to select the top-scoring
documents for a query. Thus we have:
The Vector Space Model for Scoring
Queries as Vectors
The Vector Space Model for Scoring
Computing Vector Scores
 In a typical setting we have a collection of documents each
represented by a vector, a free text query represented by a vector,
and a positive integer K.
 We seek the K documents of the collection with the highest vector
space scores on the given query.
 We now initiate the study of determining the K documents with the
highest vector space scores for a query.
 Typically, we seek these K top documents in ordered by
decreasing score; for instance many search engines use K = 10 to
retrieve and rank-order the first page of the ten best results.
The Vector Space Model for Scoring
Computing Vector Scores
 The array Length holds the lengths (normalization factors) for each
of the N documents, whereas the array Scores holds the scores for
each of the documents. When the scores are finally computed in
Step 9, all that remains in Step 10 is to pick off the K documents
with the highest scores
The Vector Space Model for Scoring
Computing Vector Scores
 The outermost loop beginning Step 3 repeats the updating of Scores,
iterating over each query term t in turn.
 In Step 5 we calculate the weight in the query vector for term t.
 Steps 6-8 update the score of each document by adding in the contribution
from term t.
 This process of adding in contributions one query term at a time is
sometimes known as term-at-a-time scoring or accumulation, and the N
elements of the array Scores are therefore known as accumulators.
The Vector Space Model for Scoring
Computing Vector Scores
 It would appear necessary to store, with each postings entry, the weight
𝒘𝒇 𝒕,𝒅 of term t in document d (we have thus far used either tf or tf-idf for
this weight, but leave open the possibility of other functions to be
developed in later sections).
 In fact this is wasteful, since storing this weight may require a floating
point number.
 Two ideas help alleviate this space problem:
 First, if we are using inverse document frequency, we need not precompute
𝒊𝒅𝒇 𝒕 ; it suffices to store N/𝒅𝒇 𝒕 ; at the head of the postings for t.
 Second, we store the term frequency 𝒕𝒇 𝒕,𝒅 for each postings entry.
 Finally, Step 12 extracts the top K scores – this requires a priority queue
data structure, often implemented using a heap. Such a heap takes no
more than 2N comparisons to construct, following which each of the K top

More Related Content

PPTX
Ir 09
Mohammed Romi
 
PPTX
Ir 03
Mohammed Romi
 
PPTX
Ir 02
Mohammed Romi
 
PPT
Textmining Retrieval And Clustering
guest0edcaf
 
PPTX
Document Classification and Clustering
Ankur Shrivastava
 
PPTX
Tdm probabilistic models (part 2)
KU Leuven
 
PDF
Ju3517011704
IJERA Editor
 
PPTX
Probabilistic models (part 1)
KU Leuven
 
Textmining Retrieval And Clustering
guest0edcaf
 
Document Classification and Clustering
Ankur Shrivastava
 
Tdm probabilistic models (part 2)
KU Leuven
 
Ju3517011704
IJERA Editor
 
Probabilistic models (part 1)
KU Leuven
 

What's hot (19)

PPT
Finding Similar Files in Large Document Repositories
feiwin
 
PPT
Ir models
Ambreen Angel
 
PPTX
Text clustering
KU Leuven
 
PPT
Scalable Discovery Of Hidden Emails From Large Folders
feiwin
 
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
PPT
Boolean Retrieval
mghgk
 
PPTX
The vector space model
pkgosh
 
PPTX
Similarity Measurement Preliminary Results
xiaojuzheng
 
PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
PPTX
Duet @ TREC 2019 Deep Learning Track
Bhaskar Mitra
 
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
PPTX
TextRank: Bringing Order into Texts
Shubhangi Tandon
 
PDF
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
PPT
Vsm 벡터공간모델
guesta34d441
 
PDF
G04124041046
IOSR-JEN
 
PPTX
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
 
PDF
A Text Mining Research Based on LDA Topic Modelling
csandit
 
PPTX
Information Retrieval
ssbd6985
 
PPTX
Information retrieval 7 boolean model
Vaibhav Khanna
 
Finding Similar Files in Large Document Repositories
feiwin
 
Ir models
Ambreen Angel
 
Text clustering
KU Leuven
 
Scalable Discovery Of Hidden Emails From Large Folders
feiwin
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
Boolean Retrieval
mghgk
 
The vector space model
pkgosh
 
Similarity Measurement Preliminary Results
xiaojuzheng
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
Duet @ TREC 2019 Deep Learning Track
Bhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
TextRank: Bringing Order into Texts
Shubhangi Tandon
 
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
Vsm 벡터공간모델
guesta34d441
 
G04124041046
IOSR-JEN
 
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
 
A Text Mining Research Based on LDA Topic Modelling
csandit
 
Information Retrieval
ssbd6985
 
Information retrieval 7 boolean model
Vaibhav Khanna
 
Ad

Viewers also liked (19)

PPT
similarity measure
ZHAO Sam
 
PPT
Initial Configuration of Router
Kishore Kumar
 
PDF
Teacher management system guide
nicolasmunozvera
 
PDF
Computer networking short_questions_and_answers
Tarun Thakur
 
PPT
E s switch_v6_ch01
gon77gonzalez
 
PDF
Evaluation in Information Retrieval
Dishant Ailawadi
 
PPTX
Pass4sure 640-864 Questions Answers
Roxycodone Online
 
PPTX
MikroTik Basic Training Class - Online Moduls - English
Adhie Lesmana
 
DOCX
College Network
Prince Kumar
 
PDF
De-Risk Data Center Projects With Cisco Services
Cisco Canada
 
PDF
Document similarity with vector space model
dalal404
 
PPT
Day 11 eigrp
CYBERINTELLIGENTS
 
DOC
Cisco router command configuration overview
3Anetwork com
 
PPT
Lesson 1 slideshow
Arnold Derrick Kinney
 
PPT
Day 25 cisco ios router configuration
CYBERINTELLIGENTS
 
PDF
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
University of Minnesota, Duluth
 
PPT
Router configuration
97148881557
 
PPT
Day 5.3 configuration of router
CYBERINTELLIGENTS
 
PPTX
10 More Quotes for Entrepreneurs
National Entrepreneurship Network
 
similarity measure
ZHAO Sam
 
Initial Configuration of Router
Kishore Kumar
 
Teacher management system guide
nicolasmunozvera
 
Computer networking short_questions_and_answers
Tarun Thakur
 
E s switch_v6_ch01
gon77gonzalez
 
Evaluation in Information Retrieval
Dishant Ailawadi
 
Pass4sure 640-864 Questions Answers
Roxycodone Online
 
MikroTik Basic Training Class - Online Moduls - English
Adhie Lesmana
 
College Network
Prince Kumar
 
De-Risk Data Center Projects With Cisco Services
Cisco Canada
 
Document similarity with vector space model
dalal404
 
Day 11 eigrp
CYBERINTELLIGENTS
 
Cisco router command configuration overview
3Anetwork com
 
Lesson 1 slideshow
Arnold Derrick Kinney
 
Day 25 cisco ios router configuration
CYBERINTELLIGENTS
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
University of Minnesota, Duluth
 
Router configuration
97148881557
 
Day 5.3 configuration of router
CYBERINTELLIGENTS
 
10 More Quotes for Entrepreneurs
National Entrepreneurship Network
 
Ad

Similar to Ir 08 (20)

DOC
TEXT CLUSTERING.doc
naveenchaurasia
 
PDF
A Visual Exploration of Distance, Documents, and Distributions
Rebecca Bilbro
 
PDF
Words in Space - Rebecca Bilbro
PyData
 
PPT
Textmining Retrieval And Clustering
Datamining Tools
 
PPT
Textmining Retrieval And Clustering
DataminingTools Inc
 
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
PPT
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Jonathon Hare
 
PDF
Words in space
Rebecca Bilbro
 
PDF
International Journal of Engineering Research and Development
IJERD Editor
 
PDF
L0261075078
inventionjournals
 
PDF
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
PDF
L0261075078
inventionjournals
 
PPTX
IRT Unit_ 2.pptx
thenmozhip8
 
PPT
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
PDF
Efficient projections
Tomasz Waszczyk
 
PDF
Efficient projections
Tomasz Waszczyk
 
PPTX
Machine learning session8(svm nlp)
Abhimanyu Dwivedi
 
PPTX
SVM - Functional Verification
Sai Kiran Kadam
 
PDF
Mp2420852090
IJERA Editor
 
PDF
A-Study_TopicModeling
Sardhendu Mishra
 
TEXT CLUSTERING.doc
naveenchaurasia
 
A Visual Exploration of Distance, Documents, and Distributions
Rebecca Bilbro
 
Words in Space - Rebecca Bilbro
PyData
 
Textmining Retrieval And Clustering
Datamining Tools
 
Textmining Retrieval And Clustering
DataminingTools Inc
 
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Jonathon Hare
 
Words in space
Rebecca Bilbro
 
International Journal of Engineering Research and Development
IJERD Editor
 
L0261075078
inventionjournals
 
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
L0261075078
inventionjournals
 
IRT Unit_ 2.pptx
thenmozhip8
 
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
Efficient projections
Tomasz Waszczyk
 
Efficient projections
Tomasz Waszczyk
 
Machine learning session8(svm nlp)
Abhimanyu Dwivedi
 
SVM - Functional Verification
Sai Kiran Kadam
 
Mp2420852090
IJERA Editor
 
A-Study_TopicModeling
Sardhendu Mishra
 

More from Mohammed Romi (20)

PDF
Ai 02 intelligent_agents(1)
Mohammed Romi
 
PDF
Ai 01 introduction
Mohammed Romi
 
PDF
Ai 03 solving_problems_by_searching
Mohammed Romi
 
PDF
Ch2020
Mohammed Romi
 
PDF
Swiching
Mohammed Romi
 
PPTX
Ch8
Mohammed Romi
 
PDF
Ch19 network layer-logical add
Mohammed Romi
 
PDF
Ch12
Mohammed Romi
 
PPTX
Ir 01
Mohammed Romi
 
PPT
Angel6 e05
Mohammed Romi
 
PPTX
Chapter02 graphics-programming
Mohammed Romi
 
DOCX
Swe notes
Mohammed Romi
 
PPTX
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th Edition Ch2
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th Edition Ch1
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th EditionCh 8
Mohammed Romi
 
PPTX
Ch7
Mohammed Romi
 
PPT
Ch 6
Mohammed Romi
 
PPTX
Ch 4 software engineering
Mohammed Romi
 
Ai 02 intelligent_agents(1)
Mohammed Romi
 
Ai 01 introduction
Mohammed Romi
 
Ai 03 solving_problems_by_searching
Mohammed Romi
 
Swiching
Mohammed Romi
 
Ch19 network layer-logical add
Mohammed Romi
 
Angel6 e05
Mohammed Romi
 
Chapter02 graphics-programming
Mohammed Romi
 
Swe notes
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch2
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th EditionCh 8
Mohammed Romi
 
Ch 4 software engineering
Mohammed Romi
 

Recently uploaded (20)

PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 

Ir 08

  • 2. The Vector Space Model for Scoring Introduction  The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classification and document clustering.  We first develop the basic ideas underlying vector space scoring; a pivotal step in this development is the view of queries as vectors in the same vector space as the document collection.
  • 3. The Vector Space Model for Scoring Dot Products  We denote by 𝑽(𝒅)the vector derived from document 𝒅, with one component in the vector for each dictionary term.  Unless otherwise specified, you may assume that the components are computed using the 𝒕𝒇 − 𝒊𝒅𝒇 weighting scheme, although the particular weighting scheme is immaterial to the discussion that follows.  The set of documents in a collection then may be viewed as a set of vectors in a vector space, in which there is one axis for each term.  So we have a |V|- dimensional vector space  Terms are axes of the space  Documents are points or vectors in this space  Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine  These are very sparse vectors – most entries are zero
  • 4. The Vector Space Model for Scoring Dot Products  How do we quantify the similarity between two documents in this vector space?  A first attempt might consider the magnitude of the vector difference between two document vectors.  This measure suffers from a drawback: two documents with very similar content can have a significant vector difference simply because one is much longer than the other. Thus the relative distributions of terms may be identical in the two documents, but the absolute term frequencies of one may be far larger. 𝑑1 𝑑2 𝑑1 𝑑2
  • 5. The Vector Space Model for Scoring Dot Products  To compensate for the effect of document length, the standard way of quantifying the similarity between two documents 𝒅𝟏 and 𝒅𝟐 is to compute the cosine similarity of their vector representations:  where the numerator represents the dot product (also known as the inner product) of the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 . The dot product 𝒙 . 𝒚 of two vectors is defined as:  while the denominator is the product of their Euclidean lengths.
  • 6. The Vector Space Model for Scoring Dot Products  Let 𝑽 𝒅 denote the document vector for 𝒅, with M components 𝑽 𝟏 𝒅 . . . 𝑽 𝑴 𝒅 . The Euclidean length of 𝒅 is defined to be:  The effect of the denominator of is thus to length-normalize the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) = 𝑽 𝒅 𝟏 |𝑽(𝒅 𝟏) | and 𝒗(𝒅 𝟐) = 𝑽 𝒅 𝟐 |𝑽(𝒅 𝟐) |  We can then rewrite: as
  • 7. The Vector Space Model for Scoring Dot Products  The effect of the denominator of is thus to length-normalize the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) = 𝑽 𝒅 𝟏 |𝑽(𝒅 𝟏) | and 𝒗(𝒅 𝟐) = 𝑽 𝒅 𝟐 |𝑽(𝒅 𝟐) |  Example: for Doc1: (𝟐𝟕) 𝟐+(𝟑) 𝟐+(𝟎) 𝟐+(𝟏𝟒) 𝟐  𝟗𝟑𝟒  30.56  27/30.56, 3/30.56, 0/30.56, 14/30.56
  • 8. The Vector Space Model for Scoring Cosine Similarity
  • 9. The Vector Space Model for Scoring Cosine Similarity  Example:
  • 10. The Vector Space Model for Scoring Cosine Similarity  Example:
  • 11. The Vector Space Model for Scoring Queries as Vectors  There is a far more compelling reason to represent documents as vectors: we can also view a query as a vector.  So, we represent queries as vectors in the space.  Rank documents according to their proximity to the query in this space.  Consider the Query q= jealous gossip term Query affection 0 jealous 1 gossip 1 wuthering 0
  • 12. The Vector Space Model for Scoring Queries as Vectors  Consider the Query q= jealous gossip Log frequency weighting After length normalization  The key idea now: to assign to each document 𝒅 a score equal to the dot product 𝑽 𝒒 . 𝑽(𝒅) term Query affection 0 jealous 1 gossip 1 wuthering 0 term Query affection 0 jealous 0.70 gossip 0.70 wuthering 0
  • 13. The Vector Space Model for Scoring Queries as Vectors After length normalization  Recall: We do this because we want to get away from the youʼre- either-in-or-out Boolean model.  Instead: rank more relevant documents higher than less relevant documents term Query affection 0 jealous 0.70 gossip 0.70 wuthering 0
  • 14. The Vector Space Model for Scoring Queries as Vectors  To summarize, by viewing a query as a “bag of words”, we are able to treat it as a very short document.  As a consequence, we can use the cosine similarity between the query vector and a document vector as a measure of the score of the document for that query.  The resulting scores can then be used to select the top-scoring documents for a query. Thus we have:
  • 15. The Vector Space Model for Scoring Queries as Vectors
  • 16. The Vector Space Model for Scoring Computing Vector Scores  In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K.  We seek the K documents of the collection with the highest vector space scores on the given query.  We now initiate the study of determining the K documents with the highest vector space scores for a query.  Typically, we seek these K top documents in ordered by decreasing score; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
  • 17. The Vector Space Model for Scoring Computing Vector Scores  The array Length holds the lengths (normalization factors) for each of the N documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the K documents with the highest scores
  • 18. The Vector Space Model for Scoring Computing Vector Scores  The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term t in turn.  In Step 5 we calculate the weight in the query vector for term t.  Steps 6-8 update the score of each document by adding in the contribution from term t.  This process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known as accumulators.
  • 19. The Vector Space Model for Scoring Computing Vector Scores  It would appear necessary to store, with each postings entry, the weight 𝒘𝒇 𝒕,𝒅 of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in later sections).  In fact this is wasteful, since storing this weight may require a floating point number.  Two ideas help alleviate this space problem:  First, if we are using inverse document frequency, we need not precompute 𝒊𝒅𝒇 𝒕 ; it suffices to store N/𝒅𝒇 𝒕 ; at the head of the postings for t.  Second, we store the term frequency 𝒕𝒇 𝒕,𝒅 for each postings entry.  Finally, Step 12 extracts the top K scores – this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top