SlideShare a Scribd company logo
COURSE CONTENTS
Unit I Introduction to Information Retrieval ( 06 hrs )
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm
Text & Reference Book
Yates & Neto, Modern Information Retrieval, Pearson
Education, ISBN:81-297-0274-6
C.J. Rijsbergen, Information Retrieval,
(www.dcs.gla.ac.uk).,2ndISBN:978- 408709293
CO 1:
Understand the concept of Information retrieval and apply
clustering in information retrieval.
Prepared By : Prof. Datta S. Shingate
• Retrieval - “Fetch something”
• Data - raw alphanumeric values.
• Information – Processed data.
• Knowledge – What we know.
• Types of Information
• Text
• Images
• Audio
• Video
• Source Code
• Applications/Web services
• XML and structured documents
Definition of IR
Defining Data, Information, Knowledge & Wisdom
Definition of IR
• Goal
Find the documents most relevant to user Query.
• Information Retrieval (IR)
Information retrieval (IR) may be defined as a software program that
deals with the organization, storage, retrieval and evaluation of
information from document repositories particularly textual information.
Data Retrieval Vs Information Retrieval
Data Retrieval Information Retrieval
• Retrieves data based on the keywords in the query
entered by the user.
• Retrieves information based on the similarity
between the query and the document.
• There is no room for errors since it results in
complete system failure.
• Small errors are tolerated and will likely go
unnoticed.
• It has a defined structure with respect to
semantics.
• It is ambiguous and doesn’t have a defined
structure.
• Provides solutions to the user of the database
system.
• Does not provide a solution to the user of the
database system.
• Data Retrieval system produces exact results. • Information Retrieval system produces
approximate results
• Displayed results are not sorted by relevance. • Displayed results are sorted by relevance
• Eg : SQL • Eg : Google Search Engine
Text mining and Information Retrieval (IR)
Text mining is a process of extracting useful information and patterns from
a large volume of text databases.
IR System Block Diagram
Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process
Evaluation Criteria
• Recall – is defined as the portion of the total relevant document that is
retrieved.
Recall =
No of Relevant document retrieved
* 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
• Precision - is defined as the portion of the document retrieved that is
relevant.
Precision =
No of Relevant document retrieved
* 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
Automatic Text Analysis
1. Document Representative
2. Text Summarization
3. Luhn’s Idea
Document
Document
Representative
Predictions from
Frequency of
Words
Conflation
Algorithm
Luhn’s Idea
Stop
words
The Luhn’s Idea Says:
->Too low frequent words are not significant.
-> Too high frequent words are also not significant
(e.g. “is”, “and”).
-> Removing low frequent words is easy.
-- Set a minimum frequency-threshold
-> Removing common (high frequent) words:
--Setting a maximum frequency threshold
(statistically obtained)
-- Comparing to a common-word list
-> Used for summarizing technical documents.
Conflation Algorithm
1. Open and read each input file and create a single index file.
2. Remove high frequency words (stop words) .
3. Remove all suffixes/affixes from each word if present.
4. Detecting equivalent stems.
5. Store in index file.
{Compute, Computer, Computing} → Comput
{Walks, Walking, Walker} → Walk
{develop, developing, development, developments } → develop
High frequency words
Indexing Subsystem
Clustering in Information Retrieval
Medical Legal Financial
Documents Collection
Clustering in Information Retrieval
Similarity matrix
Objects: {1,2,3,4,5,6}
Threshold: .89
Graph TheoreticApproach
C1 :{1,4,5,6}
C2 :{2}
C3 : {3}
Similarity Measures
Jaccard’s Similarity Example
Single Pass Clustering Algorithm
1. Assign the first document D1 as the representative for C1.
2. For Di, calculate the similarity S with the representative for each existing cluster.
3. If Smax is greater than a threshold value ST, add the item to the corresponding cluster
and recalculate the cluster representative; otherwise, use Di to initiate a new cluster.
4. If an item Di remains to be clustered, return to step 2.
Example of Single Pass Clustering Technique
Suppose that we have the following set of documents and terms, and
that we are interested in clustering the terms using the single pass
method. Threshold value is 10
Example of Single Pass Clustering Technique
Example of Single Pass Clustering Technique
Example of Single Pass Clustering Technique
Example of Single Pass Clustering Technique
Single Link Clustering Algorithm
Dissimilarity
matrix:
Thanks

More Related Content

PPTX
Data science.chapter-1,2,3
varshakumar21
 
PPTX
Signature files
Deepali Raikar
 
PPTX
Boolean,vector space retrieval Models
Primya Tamil
 
PPTX
Text mining
Koshy Geoji
 
PPTX
Data Mining: Text and web mining
DataminingTools Inc
 
PPTX
Basic of python for data analysis
Pramod Toraskar
 
PPTX
Vector space model in information retrieval
Tharuka Vishwajith Sarathchandra
 
PPT
Textmining Introduction
Datamining Tools
 
Data science.chapter-1,2,3
varshakumar21
 
Signature files
Deepali Raikar
 
Boolean,vector space retrieval Models
Primya Tamil
 
Text mining
Koshy Geoji
 
Data Mining: Text and web mining
DataminingTools Inc
 
Basic of python for data analysis
Pramod Toraskar
 
Vector space model in information retrieval
Tharuka Vishwajith Sarathchandra
 
Textmining Introduction
Datamining Tools
 

What's hot (20)

PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
PPTX
Association rule mining.pptx
maha797959
 
PPT
5.1 mining data streams
Krish_ver2
 
PPT
2.3 bayesian classification
Krish_ver2
 
PPTX
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
amiteshg
 
PPT
2.4 rule based classification
Krish_ver2
 
PPTX
Introduction to data science
Sampath Kumar
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PPTX
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
PDF
Internet Of things
Komal Kotak
 
PPTX
Data mining: Classification and prediction
DataminingTools Inc
 
PDF
Tools and techniques for data science
Ajay Ohri
 
PPTX
Exploratory data analysis
Peter Reimann
 
PPTX
Information retrieval introduction
nimmyjans4
 
PPTX
Suffix Tree and Suffix Array
Harshit Agarwal
 
PPT
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Salah Amean
 
PPTX
Apriori algorithm
Gaurav Aggarwal
 
PPTX
Classification in data mining
Sulman Ahmed
 
PPTX
Vector space model of information retrieval
Nanthini Dominique
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Association rule mining.pptx
maha797959
 
5.1 mining data streams
Krish_ver2
 
2.3 bayesian classification
Krish_ver2
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
amiteshg
 
2.4 rule based classification
Krish_ver2
 
Introduction to data science
Sampath Kumar
 
Hive(ppt)
Abhinav Tyagi
 
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
Internet Of things
Komal Kotak
 
Data mining: Classification and prediction
DataminingTools Inc
 
Tools and techniques for data science
Ajay Ohri
 
Exploratory data analysis
Peter Reimann
 
Information retrieval introduction
nimmyjans4
 
Suffix Tree and Suffix Array
Harshit Agarwal
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Salah Amean
 
Apriori algorithm
Gaurav Aggarwal
 
Classification in data mining
Sulman Ahmed
 
Vector space model of information retrieval
Nanthini Dominique
 
Ad

Similar to Unit 1 Information Storage and Retrieval (20)

PPTX
Introduction to Information Retrieval (concepts and principles)
ImtithalSaeed1
 
PPTX
Text Mining.pptx
vrundadevani
 
PDF
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
siddiquitanveer1
 
PPTX
Ir 01
Mohammed Romi
 
PDF
Text databases and information retrieval
unyil96
 
PDF
Chapter 1 Introduction to ISR (1).pdf
JemalNesre1
 
PDF
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
PPT
Chapter 10 Data Mining Techniques
Houw Liong The
 
PPT
Copy of 10text (2)
Uma Se
 
PPTX
Chapter 1.pptx
Habtamu100
 
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
shumawakjira26
 
PDF
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Habtamu100
 
PPTX
Chapter 1 Intro Information Rerieval.pptx
bekidea
 
PPTX
IRT Unit_I.pptx
thenmozhip8
 
DOCX
unit 1 INTRODUCTION
karthiksmart21
 
PPTX
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
PPT
Information retrival system it is part and parcel
VAIBHAVEPAWAR
 
PPT
information retirval system,search info insights in unsturtcured data
VAIBHAVEPAWAR
 
Introduction to Information Retrieval (concepts and principles)
ImtithalSaeed1
 
Text Mining.pptx
vrundadevani
 
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
siddiquitanveer1
 
Text databases and information retrieval
unyil96
 
Chapter 1 Introduction to ISR (1).pdf
JemalNesre1
 
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Uma Se
 
Chapter 1.pptx
Habtamu100
 
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
shumawakjira26
 
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Habtamu100
 
Chapter 1 Intro Information Rerieval.pptx
bekidea
 
IRT Unit_I.pptx
thenmozhip8
 
unit 1 INTRODUCTION
karthiksmart21
 
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
Information retrival system it is part and parcel
VAIBHAVEPAWAR
 
information retirval system,search info insights in unsturtcured data
VAIBHAVEPAWAR
 
Ad

Recently uploaded (20)

PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PPT
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
Information Retrieval and Extraction - Module 7
premSankar19
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 

Unit 1 Information Storage and Retrieval

  • 1. COURSE CONTENTS Unit I Introduction to Information Retrieval ( 06 hrs ) Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing, Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm Text & Reference Book Yates & Neto, Modern Information Retrieval, Pearson Education, ISBN:81-297-0274-6 C.J. Rijsbergen, Information Retrieval, (www.dcs.gla.ac.uk).,2ndISBN:978- 408709293 CO 1: Understand the concept of Information retrieval and apply clustering in information retrieval. Prepared By : Prof. Datta S. Shingate
  • 2. • Retrieval - “Fetch something” • Data - raw alphanumeric values. • Information – Processed data. • Knowledge – What we know. • Types of Information • Text • Images • Audio • Video • Source Code • Applications/Web services • XML and structured documents Definition of IR
  • 3. Defining Data, Information, Knowledge & Wisdom
  • 4. Definition of IR • Goal Find the documents most relevant to user Query. • Information Retrieval (IR) Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information.
  • 5. Data Retrieval Vs Information Retrieval Data Retrieval Information Retrieval • Retrieves data based on the keywords in the query entered by the user. • Retrieves information based on the similarity between the query and the document. • There is no room for errors since it results in complete system failure. • Small errors are tolerated and will likely go unnoticed. • It has a defined structure with respect to semantics. • It is ambiguous and doesn’t have a defined structure. • Provides solutions to the user of the database system. • Does not provide a solution to the user of the database system. • Data Retrieval system produces exact results. • Information Retrieval system produces approximate results • Displayed results are not sorted by relevance. • Displayed results are sorted by relevance • Eg : SQL • Eg : Google Search Engine
  • 6. Text mining and Information Retrieval (IR) Text mining is a process of extracting useful information and patterns from a large volume of text databases.
  • 7. IR System Block Diagram Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process
  • 8. Evaluation Criteria • Recall – is defined as the portion of the total relevant document that is retrieved. Recall = No of Relevant document retrieved * 100 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 • Precision - is defined as the portion of the document retrieved that is relevant. Precision = No of Relevant document retrieved * 100 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
  • 9. Automatic Text Analysis 1. Document Representative 2. Text Summarization 3. Luhn’s Idea Document Document Representative Predictions from Frequency of Words Conflation Algorithm
  • 10. Luhn’s Idea Stop words The Luhn’s Idea Says: ->Too low frequent words are not significant. -> Too high frequent words are also not significant (e.g. “is”, “and”). -> Removing low frequent words is easy. -- Set a minimum frequency-threshold -> Removing common (high frequent) words: --Setting a maximum frequency threshold (statistically obtained) -- Comparing to a common-word list -> Used for summarizing technical documents.
  • 11. Conflation Algorithm 1. Open and read each input file and create a single index file. 2. Remove high frequency words (stop words) . 3. Remove all suffixes/affixes from each word if present. 4. Detecting equivalent stems. 5. Store in index file. {Compute, Computer, Computing} → Comput {Walks, Walking, Walker} → Walk {develop, developing, development, developments } → develop
  • 14. Clustering in Information Retrieval Medical Legal Financial Documents Collection
  • 15. Clustering in Information Retrieval Similarity matrix Objects: {1,2,3,4,5,6} Threshold: .89 Graph TheoreticApproach C1 :{1,4,5,6} C2 :{2} C3 : {3}
  • 18. Single Pass Clustering Algorithm 1. Assign the first document D1 as the representative for C1. 2. For Di, calculate the similarity S with the representative for each existing cluster. 3. If Smax is greater than a threshold value ST, add the item to the corresponding cluster and recalculate the cluster representative; otherwise, use Di to initiate a new cluster. 4. If an item Di remains to be clustered, return to step 2.
  • 19. Example of Single Pass Clustering Technique Suppose that we have the following set of documents and terms, and that we are interested in clustering the terms using the single pass method. Threshold value is 10
  • 20. Example of Single Pass Clustering Technique
  • 21. Example of Single Pass Clustering Technique
  • 22. Example of Single Pass Clustering Technique
  • 23. Example of Single Pass Clustering Technique
  • 24. Single Link Clustering Algorithm Dissimilarity matrix: