SlideShare a Scribd company logo
CSMR: A Scalable Algorithm for 
Text Clustering with Cosine 
Similarity and MapReduce 
Giannakouris – Salalidis Victor - Undergraduate Student 
Plerou Antonia - PhD Candidate 
Sioutas Spyros - Associate Professor
Introduction 
• Big Data: Massive amount of data as a result of the huge 
rate of growth 
• Big Data need to be faced in various domains: Business 
Intelligence, Bioinformatics, Social Media Analytics etc. 
• Text Mining: Classification/Clustering in digital libraries, 
e-mail, Sentiment Analysis on Social Media 
• CSMR: Performs pairwise text similarity, represents text 
data in a vector space and measures similarity in parallel 
manner using MapReduce
Background 
• Vector Space Model: An algebraic model for representing 
text documents as vectors 
• Efficient method for text similarity measurement
TF-IDF 
• Term Frequency – Inverse Document Frequency 
• A numerical statistic that reflects the significance of a 
term in a corpus of documents 
• Usually used in search engines, text mining, text 
similarity in the vector space 
푇퐹 × 퐼퐷퐹 = 
푛푖,푗 
푡 ∈ 푑푗 
× 푙표푔 
|퐷| 
|푑 ∈ 퐷: 푡 ∈ 푑|
Cosine Similarity 
• Cosine Similarity: A measure of similarity between two 
documents represented as vector 
• Measuring of the angle between two vectors 
A  B A  
B 
  
1 
1 2 2 
A  
B 
1 1 
cos(A,B) 
|| A|| || B|| 
( ) ( ) 
n 
i i 
n 
i 
i i 
i i 
 
  
 
 
Hadoop 
• Framework developed by Apache 
• Large-Scale Data Processing and Analytics 
• Scalable and parallel processing of data on large 
computer clusters using MapReduce 
• Runs on commodity, low-end hardware 
• Main Components: HDFS (Hadoop Distributed File 
System), MapReduce 
• Currently used by: Adobe, Yahoo!, Amazon, eBay, 
Facebook and many other companies
MapReduce 
• Programming Paradigm running on Apache Hadoop 
• The main component of Hadoop 
• Useful for processing of large data-sets 
• Breaks the data into key-value pairs 
• Model derived from map and reduce functions of 
Functional Programming 
• Every MR program constitutes of Mappers and Reducers
MapReduce Diagram
CSMR 
• The purposed method, CSMR combines all the above 
mentioned techniques 
• Scalable Algorithm for text clustering using MapReduce model 
• Applies MR model on TF-IDF and Cosine Similarity 
• 4 Phases: 
1. Word Counting 
2. Text Vectorization using term frequencies 
3. Apply TF-IDF on document vectors 
4. Cosine Similarity Measurement
Phase 1: Word Counting 
Algorithm 1: Word Count 
1: class Mapper 
2: method Map( document ) 
3: for each term ∈ document 
4: write ( ( term , docId ) , 1 ) 
5: 
6: class Reducer 
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 
8: sum = 0 
9: for each one ∈ ones do 
10: sum = sum +1 
11: return ( ( term , docId ) , o ) 
12: 
13: /* { o ∈ N : the number of occurrences } */
Phase 2: Term Frequency 
Algorithm 2: Term Frequency 
1: class Mapper 
2: method Map( ( term , docId ) , o ) 
3: for each element ∈ ( term , docId ) 
4: write ( docId, ( term, o ) ) 
5: 
6: class Reducer 
7: method Reduce( docId, (term, o) ) 
8: N = 0 
9: for each tuple ∈ ( term, o ) do 
10: N = N + o 
return ( (docId, N), (term, o) )
Phase 3: TF-IDF 
Algorithm 3: Tf-Idf 
1: class Mapper 
2: method Map( ( docId , N ), ( term , o ) ) 
3: for each element ∈ ( term , o ) 
4: write ( term, ( docId, o, N ) ) 
5: 
6: class Reducer 
7: method Reduce( term, ( docId , o , N ) ) 
8: n = 0 
9: for each element ∈ ( docId , o , N ) do 
10: n = n + 1 
11: tf = o / N 
12: idf = log|D| /(1n) 
13: return ( docId, ( term , tf×idf ) ) 
14: 
15: /* Where |D| is the number of documents in the corpus */
Phase 4: Cosine Similarity 
Algorithm 4: Cosine Similarity 
1: class Mapper 
2: method Map( docs ) 
3: n = docs.length 
4: 
5: for i = 0 to docs.length 
6: for j = i+1 to docs.length 
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 
8: 
9: class Reducer 
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 
11: A = docA.tfidf 
12: B = docB.tfidf 
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 
14: return ( (docId_A, docId_B), cosine )
Phase 4: Diagram 
Map 
Doc1,Doc2 
[Doc1 TF-IDF], [Doc2 TF-IDF] 
Doc1,Doc3 
[Doc1 TF-IDF], [Doc3 TF-IDF] 
Doc1,Doc4 
Input [Doc1 TF-IDF], [Doc4 TF-IDF] 
Output 
Doc4,Doc10 
[Doc4 TF-IDF], [Doc10 TF-IDF] 
DocM,DocN 
[DocM TF-IDF], [DocN TF-IDF] 
Reduce 
Doc1,Doc3 
Cosine(Doc1, Doc3) 
Doc1,Doc4 
Cosine(Doc1 ,Doc4) 
Doc4,Doc10 
Cosine(Doc4, Doc10) 
DocM,DocN 
Cosine(DocM, DocN) 
Doc1,Doc2 
Cosine(Doc1, Doc2)
Conclusions & Future Work 
• Finalized proposed method 
• Implementation of the method 
• Experimental tests on real data and computer clusters 
• Deployment of an open-source project 
• Additional implementation using more efficient tools such 
as Apache Spark and Scala 
• Publication of test results

More Related Content

What's hot (20)

PPTX
Ir 08
Mohammed Romi
 
PDF
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
PPTX
Scoring, term weighting and the vector space
Ujjawal
 
PDF
Web clustering engines
Yash Darak
 
PPT
3.5 model based clustering
Krish_ver2
 
DOCX
Final proj 2 (1)
Praveen Kumar
 
PPT
Web clustring engine
factscomputersoftware
 
PDF
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
PPT
Lect4
sumit621
 
PDF
Current clustering techniques
Poonam Kshirsagar
 
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
PPT
Ghost
Jhih-Ming Chen
 
PPT
3.2 partitioning methods
Krish_ver2
 
PPT
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
PDF
A survey of web clustering engines
unyil96
 
PDF
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
PPTX
Algorithm Name Detection & Extraction
Deeksha thakur
 
PPTX
Introduction to Clustering algorithm
hadifar
 
PPT
3.6 constraint based cluster analysis
Krish_ver2
 
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Scoring, term weighting and the vector space
Ujjawal
 
Web clustering engines
Yash Darak
 
3.5 model based clustering
Krish_ver2
 
Final proj 2 (1)
Praveen Kumar
 
Web clustring engine
factscomputersoftware
 
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Lect4
sumit621
 
Current clustering techniques
Poonam Kshirsagar
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
3.2 partitioning methods
Krish_ver2
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
A survey of web clustering engines
unyil96
 
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
Algorithm Name Detection & Extraction
Deeksha thakur
 
Introduction to Clustering algorithm
hadifar
 
3.6 constraint based cluster analysis
Krish_ver2
 

Viewers also liked (20)

PDF
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
PDF
Optimization for iterative queries on Mapreduce
makoto onizuka
 
PDF
MachineLearning_MPI_vs_Spark
Xudong Brandon Liang
 
PDF
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
PPTX
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
PDF
Spark Bi-Clustering - OW2 Big Data Initiative, altic
ALTIC Altic
 
PPT
Lec4 Clustering
mobius.cn
 
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
PPT
Information retreival, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
PDF
Data clustering using map reduce
Varad Meru
 
PDF
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
PDF
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
PPTX
Temporal Pattern Mining
Prakhar Dhama
 
PDF
IntelliGO semantic similarity measure for Gene Ontology annotations
European Institute for Systems Biology & Medicine.
 
PDF
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
PDF
How many citations are there in the Data Citation Index?
Nicolas Robinson-Garcia
 
PDF
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Seattle DAML meetup
 
PDF
Cloud Deployments with Apache Hadoop and Apache HBase
DATAVERSITY
 
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
Optimization for iterative queries on Mapreduce
makoto onizuka
 
MachineLearning_MPI_vs_Spark
Xudong Brandon Liang
 
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
ALTIC Altic
 
Lec4 Clustering
mobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Information retreival, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
05 k-means clustering
Subhas Kumar Ghosh
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
Data clustering using map reduce
Varad Meru
 
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Temporal Pattern Mining
Prakhar Dhama
 
IntelliGO semantic similarity measure for Gene Ontology annotations
European Institute for Systems Biology & Medicine.
 
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
How many citations are there in the Data Citation Index?
Nicolas Robinson-Garcia
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Seattle DAML meetup
 
Cloud Deployments with Apache Hadoop and Apache HBase
DATAVERSITY
 
Ad

Similar to CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce (20)

PDF
Mapreduce Algorithms
Amund Tveit
 
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
KEY
Getting Started on Hadoop
Paco Nathan
 
PDF
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
PPTX
Algorithms on Hadoop at Last.fm
Mark Levy
 
PPT
MapReduceAlgorithms.ppt
CheeWeiTan10
 
PPTX
Hadoop
Bhushan Kulkarni
 
PPTX
Expressiveness, Simplicity and Users
greenwop
 
PDF
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
PDF
CityLABS Workshop: Working with large tables
Enrico Daga
 
PPTX
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
Matthäus Zloch
 
PDF
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
PDF
Understanding Hadoop through examples
Yoshitomo Matsubara
 
PDF
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
PPTX
Unit 2
vishal choudhary
 
PPT
MapReduce in cgrid and cloud computinge.ppt
gvlbcy
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PPT
Information technology Researhc Tools in IT
AhamedShibly
 
PDF
Data Science
Subhajit75
 
PPT
Introduction to Data Structures Sorting and searching
Mvenkatarao
 
Mapreduce Algorithms
Amund Tveit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Getting Started on Hadoop
Paco Nathan
 
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
Algorithms on Hadoop at Last.fm
Mark Levy
 
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Expressiveness, Simplicity and Users
greenwop
 
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
CityLABS Workshop: Working with large tables
Enrico Daga
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
Matthäus Zloch
 
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
Understanding Hadoop through examples
Yoshitomo Matsubara
 
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
MapReduce in cgrid and cloud computinge.ppt
gvlbcy
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Information technology Researhc Tools in IT
AhamedShibly
 
Data Science
Subhajit75
 
Introduction to Data Structures Sorting and searching
Mvenkatarao
 
Ad

Recently uploaded (20)

PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Climate Action.pptx action plan for climate
justfortalabat
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • 1. CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce Giannakouris – Salalidis Victor - Undergraduate Student Plerou Antonia - PhD Candidate Sioutas Spyros - Associate Professor
  • 2. Introduction • Big Data: Massive amount of data as a result of the huge rate of growth • Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc. • Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media • CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce
  • 3. Background • Vector Space Model: An algebraic model for representing text documents as vectors • Efficient method for text similarity measurement
  • 4. TF-IDF • Term Frequency – Inverse Document Frequency • A numerical statistic that reflects the significance of a term in a corpus of documents • Usually used in search engines, text mining, text similarity in the vector space 푇퐹 × 퐼퐷퐹 = 푛푖,푗 푡 ∈ 푑푗 × 푙표푔 |퐷| |푑 ∈ 퐷: 푡 ∈ 푑|
  • 5. Cosine Similarity • Cosine Similarity: A measure of similarity between two documents represented as vector • Measuring of the angle between two vectors A  B A  B   1 1 2 2 A  B 1 1 cos(A,B) || A|| || B|| ( ) ( ) n i i n i i i i i      
  • 6. Hadoop • Framework developed by Apache • Large-Scale Data Processing and Analytics • Scalable and parallel processing of data on large computer clusters using MapReduce • Runs on commodity, low-end hardware • Main Components: HDFS (Hadoop Distributed File System), MapReduce • Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies
  • 7. MapReduce • Programming Paradigm running on Apache Hadoop • The main component of Hadoop • Useful for processing of large data-sets • Breaks the data into key-value pairs • Model derived from map and reduce functions of Functional Programming • Every MR program constitutes of Mappers and Reducers
  • 9. CSMR • The purposed method, CSMR combines all the above mentioned techniques • Scalable Algorithm for text clustering using MapReduce model • Applies MR model on TF-IDF and Cosine Similarity • 4 Phases: 1. Word Counting 2. Text Vectorization using term frequencies 3. Apply TF-IDF on document vectors 4. Cosine Similarity Measurement
  • 10. Phase 1: Word Counting Algorithm 1: Word Count 1: class Mapper 2: method Map( document ) 3: for each term ∈ document 4: write ( ( term , docId ) , 1 ) 5: 6: class Reducer 7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0 9: for each one ∈ ones do 10: sum = sum +1 11: return ( ( term , docId ) , o ) 12: 13: /* { o ∈ N : the number of occurrences } */
  • 11. Phase 2: Term Frequency Algorithm 2: Term Frequency 1: class Mapper 2: method Map( ( term , docId ) , o ) 3: for each element ∈ ( term , docId ) 4: write ( docId, ( term, o ) ) 5: 6: class Reducer 7: method Reduce( docId, (term, o) ) 8: N = 0 9: for each tuple ∈ ( term, o ) do 10: N = N + o return ( (docId, N), (term, o) )
  • 12. Phase 3: TF-IDF Algorithm 3: Tf-Idf 1: class Mapper 2: method Map( ( docId , N ), ( term , o ) ) 3: for each element ∈ ( term , o ) 4: write ( term, ( docId, o, N ) ) 5: 6: class Reducer 7: method Reduce( term, ( docId , o , N ) ) 8: n = 0 9: for each element ∈ ( docId , o , N ) do 10: n = n + 1 11: tf = o / N 12: idf = log|D| /(1n) 13: return ( docId, ( term , tf×idf ) ) 14: 15: /* Where |D| is the number of documents in the corpus */
  • 13. Phase 4: Cosine Similarity Algorithm 4: Cosine Similarity 1: class Mapper 2: method Map( docs ) 3: n = docs.length 4: 5: for i = 0 to docs.length 6: for j = i+1 to docs.length 7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 8: 9: class Reducer 10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 11: A = docA.tfidf 12: B = docB.tfidf 13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 14: return ( (docId_A, docId_B), cosine )
  • 14. Phase 4: Diagram Map Doc1,Doc2 [Doc1 TF-IDF], [Doc2 TF-IDF] Doc1,Doc3 [Doc1 TF-IDF], [Doc3 TF-IDF] Doc1,Doc4 Input [Doc1 TF-IDF], [Doc4 TF-IDF] Output Doc4,Doc10 [Doc4 TF-IDF], [Doc10 TF-IDF] DocM,DocN [DocM TF-IDF], [DocN TF-IDF] Reduce Doc1,Doc3 Cosine(Doc1, Doc3) Doc1,Doc4 Cosine(Doc1 ,Doc4) Doc4,Doc10 Cosine(Doc4, Doc10) DocM,DocN Cosine(DocM, DocN) Doc1,Doc2 Cosine(Doc1, Doc2)
  • 15. Conclusions & Future Work • Finalized proposed method • Implementation of the method • Experimental tests on real data and computer clusters • Deployment of an open-source project • Additional implementation using more efficient tools such as Apache Spark and Scala • Publication of test results