SlideShare a Scribd company logo
Lecture 02
Information Retrieval
Informatio
n Need
User Task
Query
Formulati
on
Search
Engine
Collection
Results
Query Re-
Formulati
on
Informatio
n Need
User Task
Query
Formulati
on
Search
Engine
Collection
Results
Query Re-
Formulati
on
Misconceptio
n
Misformulatio
n
Mis
Reformulatio
n
Boolean Retrieval Model
An Example of an IR Problem
 Task:
 Suppose you wanted to determine which plays of Shakespeare
contain the words Brutus AND Caesar AND NOT Calpurnia.
 Read through all the text!!! noting for each play whether it
contains Brutus and Caesar and excluding it from consideration
if it contains Calpurnia. !!!
 This sort of linear scan through documents is actually the
simplest form of document retrieval for a computer to do.
 This process is commonly referred to as grepping through text,
after the Unix command grep, which performs this process.
 This can be a very efficient process for wildcard pattern
Boolean Retrieval Model
An Example of an IR Problem
 In other scenarios, we need more than the “grepping”
function:
1. To process large document collections quickly. The amount
of online data has grown at least as quickly as the speed of
computers, and we would now like to be able to search
collections that total in the order of billions to trillions of words.
2. To allow more flexible matching operations. For example, it
is impractical to perform the query Romans NEAR countrymen
with grep, where NEAR might be defined as “within 5 words” or
“within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best
answer to an information need among many documents that
contain certain words.
Boolean Retrieval Model
An Example of an IR Problem
 The idea behind building Indexes:
 To avoid linearly scanning the texts for each query, we build
an INDEX for the documents in advance.
 The result for our initial task would be a binary term-
document incidence matrix.
Boolean Retrieval Model
An Example of an IR Problem
Boolean Retrieval Model
An Example of an IR Problem
 To answer the query “Brutus AND Caesar AND NOT
Calpurnia”, we take the vectors for Brutus, Caesar
and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer:
Boolean Retrieval Model
An Example of an IR Problem
 The Boolean retrieval model: is a model for information
retrieval in which we can pose any query which is in the form of a
Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT.
 Among the limitations of this model is that it views each
document as just a set of words.
Boolean Retrieval Model
Further Terminology & Notations
 Documents: whatever units we have decided to build a retrieval system
over.
 Collection: the group of documents over which we perform retrieval.
Sometime it is also referred to as a corpus.
 In our previous example the documents are “Shakespeare’s Collected Works”
 Ad hoc Retrieval: is the most standard IR task. In it, a system aims to
provide documents from within the collection that are relevant to an
arbitrary user information need, communicated to the system by means of
a one-off, user-initiated query. (Temporal Information Need). This model is
a.k.a. Pull Text Access Model
 Recommender Systems: When the user has a stable information need
(research topic / interest in sports news), the system takes the initiative and
recommends topics that are related to the user’s information need. This
model is a.k.a. Push Text Access Model.
 Information need: is the topic about which the user desires to know more.
 Query: is what the user conveys to the computer in an attempt to
Boolean Retrieval Model
Further Terminology & Notations
 Relevance: A document is relevant if it is one that the user
perceives as containing information of value with respect to their
personal information need.
 We would like to find relevant documents regardless of whether they
precisely use the words expressed in the query or express the concept
we are looking for with other words.
 Effectiveness: of an IR system is the quality of its search results. It
is measured according to the relevance between the set of returned
results to a given query.
 To measure Effectiveness two key statistics about the system’s
returned results are involved:
 Precision: What fraction of the returned results are relevant to the
information need?
Boolean Retrieval Model
Further Terminology & Notations
 Precision: What fraction of the returned results are relevant to
the information need?
𝑃 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: What fraction of the relevant documents in the collection
were returned by the system?
𝑅 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
Boolean Retrieval Model
Building a term-document matrix
 In practice, we can not build a Term-Document Incidence
matrix.
Let: d = 1 million, and let each document in d contains 1000
words.
* assume an average of 6 bytes per word including spaces and
punctuation  then this is a document collection about 6 GB in
size.
* Assume we have about M = 500,000 distinct terms in these
documents
A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit
in a computer’s memory.
But the crucial observation is that the matrix is extremely sparse,
that is, it has few non-zero entries.
Boolean Retrieval Model
Building a term-document matrix
 So, a better representation is to record only the things that
do occur, that is, the 1 positions.
 This leads to a central idea in IR that is, the inverted index.
 An Inverted Index or Inverted File: is an index that always
maps back from terms to the parts of a document where
they occur.
  We keep a dictionary of terms (sometimes also referred
to as a vocabulary or lexicon.
  Then for each term, we have a list that records which
documents the term occurs in.
Boolean Retrieval Model
Inverted Index
 Each item in the list – which records that a term appeared in a
document (and, later, often, the positions in the document) – is
conventionally called a posting.
 The list is then called a postings list (or inverted list), and all
the postings lists taken together are referred to as the postings.
Boolean Retrieval Model
Building an inverted index
 The major steps are:
1. Collect the documents to be indexed:
Doc1 = {Friends, Romans, countrymen}
Doc2 = {So let it be with Caesar}
Doc3 = {. . .}
2. Tokenize the text, turning each document into a list of
tokens:
Friends Romans countrymen So . . .
Boolean Retrieval Model
Building an inverted index
 The major steps are:
3. Do linguistic preprocessing, producing a list of normalized
tokens, which are the indexing terms:
friends romans countrymen so . . .
4. Index the documents that each term occurs in by creating
an inverted index, consisting of a dictionary and postings.
Boolean Retrieval Model
Building an inverted index
 Within a document collection, we assume that each document
has a unique serial number, known as the document identifier
(docID).
 The input to indexing is a list of normalized tokens for each
document, which we can equally think of as a list of pairs of term
and docID.
 The core indexing step is sorting this list so that the terms are
alphabetical.
 The postings are secondarily sorted by docID. This provides the
basis for efficient query processing.
Ir   02
Boolean Retrieval Model
Storage Requirements
 In the resulting index, we pay for storage of both the
dictionary and the postings lists.
 The dictionary is commonly kept in memory.
 Postings lists are normally kept on disk.
 So, the size of each is important.
Boolean Retrieval Model
What data structure should be used for a postings
list?
 A fixed length array would be wasteful as some words occur
in many documents, and others in very few.
 For an in-memory postings list, two good alternatives:
 Singly linked lists: allow cheap insertion of documents into
postings lists (following updates, such as when recrawling the
web for updated documents). They also naturally extend to more
advanced indexing strategies such as skip lists, which require
additional pointers.
 Variable length arrays: win in space requirements by avoiding
the overhead for pointers and in time requirements because their
use of contiguous memory increases speed on modern
processors with memory caches.
Boolean Retrieval Model
Exercise
 Draw the inverted index that would be built for the
following document collection:
Doc 1 new home sales top forecasts
Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise
Boolean Retrieval Model
Exercise
 Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
A. Draw the term-document incidence matrix for this
document collection
B. What are the returned results for these queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)

More Related Content

PDF
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
PPT
Natural language processing
Basha Chand
 
PPTX
Natural language processing
Saurav Aryal
 
PPTX
Natural Language Processing
Adarsh Saxena
 
PDF
Natural language processing
National Institute of Technology Durgapur
 
PPT
Information Retrieval Models
Nisha Arankandath
 
PDF
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
Natural language processing
Basha Chand
 
Natural language processing
Saurav Aryal
 
Natural Language Processing
Adarsh Saxena
 
Natural language processing
National Institute of Technology Durgapur
 
Information Retrieval Models
Nisha Arankandath
 
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 

What's hot (20)

PPTX
NLTK
Girish Khanzode
 
PPTX
Introduction to Natural Language Processing
Mercy Rani
 
PPT
Natural Language Processing
Yasir Khan
 
PDF
Natural Language Processing
Jaganadh Gopinadhan
 
PDF
Introduction to NLTK
Sreejith Sasidharan
 
PDF
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
PDF
Natural language processing (NLP) introduction
Robert Lujo
 
PPTX
Information retrieval (introduction)
Primya Tamil
 
PPT
Boolean Retrieval
mghgk
 
PDF
Challenges in nlp
Zareen Syed
 
PDF
Evaluation in Information Retrieval
Dishant Ailawadi
 
PPTX
Artificial intelligence and its application
Mohammed Abdel Razek
 
PDF
Natural language processing
Aanchal Chaurasia
 
PDF
A* Search Algorithm
vikas dhakane
 
PPTX
Language models
Maryam Khordad
 
PPTX
Natural Language Processing
VeenaSKumar2
 
PPTX
Natural language processing
Yogendra Tamang
 
PPT
OpenNLP demo
Gagan Gowda
 
PPT
Problems, Problem spaces and Search
BMS Institute of Technology and Management
 
PPTX
Artificial Intelligence: Natural Language Processing
Frank Cunha
 
Introduction to Natural Language Processing
Mercy Rani
 
Natural Language Processing
Yasir Khan
 
Natural Language Processing
Jaganadh Gopinadhan
 
Introduction to NLTK
Sreejith Sasidharan
 
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Natural language processing (NLP) introduction
Robert Lujo
 
Information retrieval (introduction)
Primya Tamil
 
Boolean Retrieval
mghgk
 
Challenges in nlp
Zareen Syed
 
Evaluation in Information Retrieval
Dishant Ailawadi
 
Artificial intelligence and its application
Mohammed Abdel Razek
 
Natural language processing
Aanchal Chaurasia
 
A* Search Algorithm
vikas dhakane
 
Language models
Maryam Khordad
 
Natural Language Processing
VeenaSKumar2
 
Natural language processing
Yogendra Tamang
 
OpenNLP demo
Gagan Gowda
 
Problems, Problem spaces and Search
BMS Institute of Technology and Management
 
Artificial Intelligence: Natural Language Processing
Frank Cunha
 
Ad

Viewers also liked (13)

PDF
Usage and impact of controlled vocabularies in a subject repository for index...
redsys
 
DOCX
Bab ii
Najiebud Dien
 
PPTX
Ir 01
Mohammed Romi
 
PPTX
Ch8
Mohammed Romi
 
PPTX
Ir 09
Mohammed Romi
 
PPTX
Ir 03
Mohammed Romi
 
PPTX
Ir 08
Mohammed Romi
 
PDF
Ch2020
Mohammed Romi
 
PPTX
Ch7
Mohammed Romi
 
PDF
Ai 02 intelligent_agents(1)
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th EditionCh 8
Mohammed Romi
 
PPTX
Artifical intelligance
Gangasailakshmi Tellakula
 
PDF
SlideShare 101
Amit Ranjan
 
Usage and impact of controlled vocabularies in a subject repository for index...
redsys
 
Ai 02 intelligent_agents(1)
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th EditionCh 8
Mohammed Romi
 
Artifical intelligance
Gangasailakshmi Tellakula
 
SlideShare 101
Amit Ranjan
 
Ad

Similar to Ir 02 (20)

PPTX
Boolean IR and Indexing.pptx
Mahsadelavari
 
PPT
introduction into IR
ssusere3b1a2
 
PPT
lecture1-intro.ppt
WrushabhShirsat3
 
PPT
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
RAtna29
 
PPT
lecture1-intro.ppt
IshaXogaha
 
PPT
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
SurabhiChahar
 
PPTX
Model of information retrieval (3)
9866825059
 
PPTX
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
RAtna29
 
PPTX
Basics of IR: Web Information Systems class
Artificial Intelligence Institute at UofSC
 
PPTX
01 IRS-1 (1) document upload the link to
tiggu56
 
PPTX
01 IRS to upload the data according to the.pptx
tiggu56
 
PDF
lecture1.pdf
KalaivaniManikandan1
 
PDF
191CSEH IR UNIT - II for an engineering subject
philipsmohan
 
PPTX
Information retrieval 7 boolean model
Vaibhav Khanna
 
PPTX
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
PPT
Slides
butest
 
PPTX
Introduction to Information Retrieval (concepts and principles)
ImtithalSaeed1
 
PDF
Information Retrieval Fundamentals - An introduction
Grace Hui Yang
 
PPT
3392413.ppt information retreival systems
MARasheed3
 
Boolean IR and Indexing.pptx
Mahsadelavari
 
introduction into IR
ssusere3b1a2
 
lecture1-intro.ppt
WrushabhShirsat3
 
lecture1-intro.pptbbbbbbbbbbbbbbbbbbbbbbbbbb
RAtna29
 
lecture1-intro.ppt
IshaXogaha
 
lbn,mnmnm,n,mnmn,mnkjkhjkhhijihihecture1-intro.ppt
SurabhiChahar
 
Model of information retrieval (3)
9866825059
 
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
RAtna29
 
Basics of IR: Web Information Systems class
Artificial Intelligence Institute at UofSC
 
01 IRS-1 (1) document upload the link to
tiggu56
 
01 IRS to upload the data according to the.pptx
tiggu56
 
lecture1.pdf
KalaivaniManikandan1
 
191CSEH IR UNIT - II for an engineering subject
philipsmohan
 
Information retrieval 7 boolean model
Vaibhav Khanna
 
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
Slides
butest
 
Introduction to Information Retrieval (concepts and principles)
ImtithalSaeed1
 
Information Retrieval Fundamentals - An introduction
Grace Hui Yang
 
3392413.ppt information retreival systems
MARasheed3
 

More from Mohammed Romi (14)

PDF
Ai 01 introduction
Mohammed Romi
 
PDF
Ai 03 solving_problems_by_searching
Mohammed Romi
 
PDF
Swiching
Mohammed Romi
 
PDF
Ch19 network layer-logical add
Mohammed Romi
 
PDF
Ch12
Mohammed Romi
 
PPT
Angel6 e05
Mohammed Romi
 
PPTX
Chapter02 graphics-programming
Mohammed Romi
 
DOCX
Swe notes
Mohammed Romi
 
PPTX
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th Edition Ch2
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th Edition Ch1
Mohammed Romi
 
PPT
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Mohammed Romi
 
PPT
Ch 6
Mohammed Romi
 
PPTX
Ch 4 software engineering
Mohammed Romi
 
Ai 01 introduction
Mohammed Romi
 
Ai 03 solving_problems_by_searching
Mohammed Romi
 
Swiching
Mohammed Romi
 
Ch19 network layer-logical add
Mohammed Romi
 
Angel6 e05
Mohammed Romi
 
Chapter02 graphics-programming
Mohammed Romi
 
Swe notes
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch2
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Mohammed Romi
 
Ch 4 software engineering
Mohammed Romi
 

Recently uploaded (20)

PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
Basics and rules of probability with real-life uses
ravatkaran694
 

Ir 02

  • 3. Informatio n Need User Task Query Formulati on Search Engine Collection Results Query Re- Formulati on Misconceptio n Misformulatio n Mis Reformulatio n
  • 4. Boolean Retrieval Model An Example of an IR Problem  Task:  Suppose you wanted to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia.  Read through all the text!!! noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. !!!  This sort of linear scan through documents is actually the simplest form of document retrieval for a computer to do.  This process is commonly referred to as grepping through text, after the Unix command grep, which performs this process.  This can be a very efficient process for wildcard pattern
  • 5. Boolean Retrieval Model An Example of an IR Problem  In other scenarios, we need more than the “grepping” function: 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”. 3. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words.
  • 6. Boolean Retrieval Model An Example of an IR Problem  The idea behind building Indexes:  To avoid linearly scanning the texts for each query, we build an INDEX for the documents in advance.  The result for our initial task would be a binary term- document incidence matrix.
  • 7. Boolean Retrieval Model An Example of an IR Problem
  • 8. Boolean Retrieval Model An Example of an IR Problem  To answer the query “Brutus AND Caesar AND NOT Calpurnia”, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100 Answer:
  • 9. Boolean Retrieval Model An Example of an IR Problem  The Boolean retrieval model: is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT.  Among the limitations of this model is that it views each document as just a set of words.
  • 10. Boolean Retrieval Model Further Terminology & Notations  Documents: whatever units we have decided to build a retrieval system over.  Collection: the group of documents over which we perform retrieval. Sometime it is also referred to as a corpus.  In our previous example the documents are “Shakespeare’s Collected Works”  Ad hoc Retrieval: is the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. (Temporal Information Need). This model is a.k.a. Pull Text Access Model  Recommender Systems: When the user has a stable information need (research topic / interest in sports news), the system takes the initiative and recommends topics that are related to the user’s information need. This model is a.k.a. Push Text Access Model.  Information need: is the topic about which the user desires to know more.  Query: is what the user conveys to the computer in an attempt to
  • 11. Boolean Retrieval Model Further Terminology & Notations  Relevance: A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need.  We would like to find relevant documents regardless of whether they precisely use the words expressed in the query or express the concept we are looking for with other words.  Effectiveness: of an IR system is the quality of its search results. It is measured according to the relevance between the set of returned results to a given query.  To measure Effectiveness two key statistics about the system’s returned results are involved:  Precision: What fraction of the returned results are relevant to the information need?
  • 12. Boolean Retrieval Model Further Terminology & Notations  Precision: What fraction of the returned results are relevant to the information need? 𝑃 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑  Recall: What fraction of the relevant documents in the collection were returned by the system? 𝑅 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
  • 13. Boolean Retrieval Model Building a term-document matrix  In practice, we can not build a Term-Document Incidence matrix. Let: d = 1 million, and let each document in d contains 1000 words. * assume an average of 6 bytes per word including spaces and punctuation  then this is a document collection about 6 GB in size. * Assume we have about M = 500,000 distinct terms in these documents A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few non-zero entries.
  • 14. Boolean Retrieval Model Building a term-document matrix  So, a better representation is to record only the things that do occur, that is, the 1 positions.  This leads to a central idea in IR that is, the inverted index.  An Inverted Index or Inverted File: is an index that always maps back from terms to the parts of a document where they occur.   We keep a dictionary of terms (sometimes also referred to as a vocabulary or lexicon.   Then for each term, we have a list that records which documents the term occurs in.
  • 15. Boolean Retrieval Model Inverted Index  Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the document) – is conventionally called a posting.  The list is then called a postings list (or inverted list), and all the postings lists taken together are referred to as the postings.
  • 16. Boolean Retrieval Model Building an inverted index  The major steps are: 1. Collect the documents to be indexed: Doc1 = {Friends, Romans, countrymen} Doc2 = {So let it be with Caesar} Doc3 = {. . .} 2. Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So . . .
  • 17. Boolean Retrieval Model Building an inverted index  The major steps are: 3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friends romans countrymen so . . . 4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
  • 18. Boolean Retrieval Model Building an inverted index  Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID).  The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID.  The core indexing step is sorting this list so that the terms are alphabetical.  The postings are secondarily sorted by docID. This provides the basis for efficient query processing.
  • 20. Boolean Retrieval Model Storage Requirements  In the resulting index, we pay for storage of both the dictionary and the postings lists.  The dictionary is commonly kept in memory.  Postings lists are normally kept on disk.  So, the size of each is important.
  • 21. Boolean Retrieval Model What data structure should be used for a postings list?  A fixed length array would be wasteful as some words occur in many documents, and others in very few.  For an in-memory postings list, two good alternatives:  Singly linked lists: allow cheap insertion of documents into postings lists (following updates, such as when recrawling the web for updated documents). They also naturally extend to more advanced indexing strategies such as skip lists, which require additional pointers.  Variable length arrays: win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with memory caches.
  • 22. Boolean Retrieval Model Exercise  Draw the inverted index that would be built for the following document collection: Doc 1 new home sales top forecasts Doc 2 home sales rise in July Doc 3 increase in home sales in July Doc 4 July new home sales rise
  • 23. Boolean Retrieval Model Exercise  Consider these documents: Doc 1 breakthrough drug for schizophrenia Doc 2 new schizophrenia drug Doc 3 new approach for treatment of schizophrenia Doc 4 new hopes for schizophrenia patients A. Draw the term-document incidence matrix for this document collection B. What are the returned results for these queries: a. schizophrenia AND drug b. for AND NOT(drug OR approach)