SlideShare a Scribd company logo
Retrieval and clustering of documents
Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
Cosine similarity for retrieval Cosine similarity  is a measure of similarity between two vectors of  n  dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes,  A  and  B , the cosine similarity,  θ , is represented using a dot product and magnitude as Similarity =cos(ᶿ)=A.B/||A||||B||
Cosine similarity for retrieval For text matching, the attribute vectors  A  and  B  are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web sites, and various other problems related to web information retrieval.
Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
Application Ranking query results.(page Rank) crawling fi nding  related pages,  computing web page reputations  geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.
Document matching Document matching  is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  k-means clustering Given a set of observations ( x 1 ,  x 2 , …,  x n ), where each observation is a  d -dimensional real vector, then  k -means clustering aims to partition the  n  observations into  k  sets ( k  <  n )  S ={ S 1 ,  S 2 , …,  S k } so as to minimize the within-cluster sum of squares
K-Means algorithm 0. Input :  D ::={d 1 ,d 2 ,…d n  };  k ::=the cluster number; 1.  Select k document vectors as the initial centriods of k clusters  2 . Repeat 3.   Select one vector  d  in remaining documents 4.   Compute similarities between d and  k  centroids 5.  Put  d  in the closest cluster and recompute the centroid  6.  Until the centroids don’t change 7. Output: k  clusters of documents
Pros and Cons Advantage: linear time complexity  works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial  k  clusters affect the quality of clusters
Hierarchical clustering Input :  D ::={d 1 ,d 2 ,…d n  }; 1.  Calculate similarity matrix SIM[i,j]  2 . Repeat 3.   Merge the most similar two clusters, K and L, to form a new cluster KL 4.   Compute similarities between KL and each of the remaining  cluster and update SIM[i,j] 5.  Until there is a single(or specified number) cluster 6 . Output:  dendogram of clusters
Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity
The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete  ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
Evaluation of clustering What Is A Good  Clustering ? Internal criterion: A good  clustering  will produce high quality clusters in which  the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a  clustering  depends on both the document representation and the similarity measured used.
conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PPT
Textmining Retrieval And Clustering
guest0edcaf
 
PPTX
Document Classification and Clustering
Ankur Shrivastava
 
PDF
L0261075078
inventionjournals
 
DOCX
Clustering sentence level text using a novel fuzzy relational clustering algo...
JPINFOTECH JAYAPRAKASH
 
PPTX
Scoring, term weighting and the vector space
Ujjawal
 
PDF
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
PDF
Bl24409420
IJERA Editor
 
PPT
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Textmining Retrieval And Clustering
guest0edcaf
 
Document Classification and Clustering
Ankur Shrivastava
 
L0261075078
inventionjournals
 
Clustering sentence level text using a novel fuzzy relational clustering algo...
JPINFOTECH JAYAPRAKASH
 
Scoring, term weighting and the vector space
Ujjawal
 
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
Bl24409420
IJERA Editor
 
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 

What's hot (16)

PDF
Clustering sentence level text using a novel fuzzy relational clustering algo...
Ecway Technologies
 
PDF
Av33274282
IJERA Editor
 
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
PPT
Ghost
Jhih-Ming Chen
 
PPTX
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
PPTX
Text clustering
KU Leuven
 
PDF
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
PDF
Analysis of different similarity measures: Simrank
Abhishek Mungoli
 
PPTX
Similarity Measurement Preliminary Results
xiaojuzheng
 
PDF
Search: Probabilistic Information Retrieval
Vipul Munot
 
PPTX
Document clustering for forensic analysis
srinivasa teja
 
PPTX
A presentation on the comparison on complexity between
Jubayer Hasan
 
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
PDF
Slides distancecovariance
Shrey Nishchal
 
PDF
M phil-computer-science-machine-language-and-pattern-analysis-projects
Vijay Karan
 
Clustering sentence level text using a novel fuzzy relational clustering algo...
Ecway Technologies
 
Av33274282
IJERA Editor
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Text clustering
KU Leuven
 
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
Analysis of different similarity measures: Simrank
Abhishek Mungoli
 
Similarity Measurement Preliminary Results
xiaojuzheng
 
Search: Probabilistic Information Retrieval
Vipul Munot
 
Document clustering for forensic analysis
srinivasa teja
 
A presentation on the comparison on complexity between
Jubayer Hasan
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
Slides distancecovariance
Shrey Nishchal
 
M phil-computer-science-machine-language-and-pattern-analysis-projects
Vijay Karan
 
Ad

Viewers also liked (9)

PDF
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
Dataconomy Media
 
PDF
Deep Learning and Text Mining
Will Stanton
 
PPTX
ElasticSearch for data mining
William Simms
 
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
PPT
Textmining Introduction
DataminingTools Inc
 
PPTX
Textmining Information Extraction
DataminingTools Inc
 
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
PDF
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
DOCX
Best topics for seminar
shilpi nagpal
 
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
Dataconomy Media
 
Deep Learning and Text Mining
Will Stanton
 
ElasticSearch for data mining
William Simms
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Textmining Introduction
DataminingTools Inc
 
Textmining Information Extraction
DataminingTools Inc
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Best topics for seminar
shilpi nagpal
 
Ad

Similar to Textmining Retrieval And Clustering (20)

DOC
TEXT CLUSTERING.doc
naveenchaurasia
 
PDF
L0261075078
inventionjournals
 
PDF
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
E1062530
IJERD Editor
 
PPT
20IT501_DWDM_PPT_Unit_IV.ppt
Premkumar R
 
PPT
20IT501_DWDM_PPT_Unit_IV.ppt
PalaniKumarR2
 
PPTX
Ir 08
Mohammed Romi
 
PDF
Av33274282
IJERA Editor
 
PPT
Cluster
guest1babda
 
PDF
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
mlaij
 
PDF
FinalReportFoxMelle
Fridtjof Melle
 
PDF
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
PPTX
Barzilay & Lapata 2008 presentation
Richard Littauer
 
PDF
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Waqas Tariq
 
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
PPT
Lec2_Information Integration.ppt
NaglaaFathy42
 
PDF
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya
 
PDF
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Saeedeh Shekarpour
 
PDF
Learning from similarity and information extraction from structured documents...
Infrrd
 
TEXT CLUSTERING.doc
naveenchaurasia
 
L0261075078
inventionjournals
 
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
E1062530
IJERD Editor
 
20IT501_DWDM_PPT_Unit_IV.ppt
Premkumar R
 
20IT501_DWDM_PPT_Unit_IV.ppt
PalaniKumarR2
 
Av33274282
IJERA Editor
 
Cluster
guest1babda
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
mlaij
 
FinalReportFoxMelle
Fridtjof Melle
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
Barzilay & Lapata 2008 presentation
Richard Littauer
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Waqas Tariq
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
Lec2_Information Integration.ppt
NaglaaFathy42
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Saeedeh Shekarpour
 
Learning from similarity and information extraction from structured documents...
Infrrd
 

More from Datamining Tools (20)

PPTX
Data Mining: Text and web mining
Datamining Tools
 
PPTX
Data Mining: Outlier analysis
Datamining Tools
 
PPTX
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
PPTX
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
PPTX
Data Mining: Graph mining and social network analysis
Datamining Tools
 
PPTX
Data Mining: Data warehouse and olap technology
Datamining Tools
 
PPTX
Data MIning: Data processing
Datamining Tools
 
PPTX
Data Mining: clustering and analysis
Datamining Tools
 
PPTX
Data mining: Classification and Prediction
Datamining Tools
 
PPTX
Data Mining: Data mining classification and analysis
Datamining Tools
 
PPTX
Data Mining: Data mining and key definitions
Datamining Tools
 
PPTX
Data Mining: Data cube computation and data generalization
Datamining Tools
 
PPTX
Data Mining: Applying data mining
Datamining Tools
 
PPTX
Data Mining: Application and trends in data mining
Datamining Tools
 
PPTX
AI: Planning and AI
Datamining Tools
 
PPTX
AI: Logic in AI 2
Datamining Tools
 
PPTX
AI: Logic in AI
Datamining Tools
 
PPTX
AI: Learning in AI 2
Datamining Tools
 
PPTX
AI: Learning in AI
Datamining Tools
 
PPTX
AI: Introduction to artificial intelligence
Datamining Tools
 
Data Mining: Text and web mining
Datamining Tools
 
Data Mining: Outlier analysis
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Data Mining: Graph mining and social network analysis
Datamining Tools
 
Data Mining: Data warehouse and olap technology
Datamining Tools
 
Data MIning: Data processing
Datamining Tools
 
Data Mining: clustering and analysis
Datamining Tools
 
Data mining: Classification and Prediction
Datamining Tools
 
Data Mining: Data mining classification and analysis
Datamining Tools
 
Data Mining: Data mining and key definitions
Datamining Tools
 
Data Mining: Data cube computation and data generalization
Datamining Tools
 
Data Mining: Applying data mining
Datamining Tools
 
Data Mining: Application and trends in data mining
Datamining Tools
 
AI: Planning and AI
Datamining Tools
 
AI: Logic in AI 2
Datamining Tools
 
AI: Logic in AI
Datamining Tools
 
AI: Learning in AI 2
Datamining Tools
 
AI: Learning in AI
Datamining Tools
 
AI: Introduction to artificial intelligence
Datamining Tools
 

Recently uploaded (20)

PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Software Development Methodologies in 2025
KodekX
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Software Development Methodologies in 2025
KodekX
 

Textmining Retrieval And Clustering

  • 2. Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
  • 3. Cosine similarity for retrieval Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as Similarity =cos(ᶿ)=A.B/||A||||B||
  • 4. Cosine similarity for retrieval For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
  • 5. Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
  • 6. Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web sites, and various other problems related to web information retrieval.
  • 7. Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
  • 8. Application Ranking query results.(page Rank) crawling fi nding related pages, computing web page reputations geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.
  • 9. Document matching Document matching is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
  • 10. Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  • 11. k-means clustering Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares
  • 12. K-Means algorithm 0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2 . Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output: k clusters of documents
  • 13. Pros and Cons Advantage: linear time complexity works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial k clusters affect the quality of clusters
  • 14. Hierarchical clustering Input : D ::={d 1 ,d 2 ,…d n }; 1. Calculate similarity matrix SIM[i,j] 2 . Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6 . Output: dendogram of clusters
  • 15. Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity
  • 16. The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
  • 17. The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
  • 18. The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
  • 19. Evaluation of clustering What Is A Good Clustering ? Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used.
  • 20. conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering
  • 21. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net