SlideShare a Scribd company logo
Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @analyticsseo
@norhustla
Hierarchical
Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Metrics & performance
My use case
Python libraries
Example
Static
Interactive
Further ideas
All opinions expressed are my own
Who am I?
All opinions expressed are my own
Attribution: www.alexmaclean.com
Clustering: a recap
Clustering is an unsupervised learning
problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg
based on some
notion of similarity.
whereby we aim to
group subsets of
entities with one
another
Origins
1930s:
Anthropology
&
Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
Two
main
purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning
pipeline (in which distinct classifiers or
regression models are trained for each
cluster).
(Machine Learning)
Clustering considerations
Partitioning
criteria
(single /
multi level)
Separation
Exclusive /
non-
exclusive
Clustering
space
(Full-space /
sub-space)
Similarity
measure
(distance /
connectivity)
Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The
competition!
KW
KW
CP
CD
You
Opportunity!
CD = Competing domains
CP = Competitor’s pages
RD = Ranking domain
P = Your page
KW = Keyword
….x 100,000 !!
Use case: search keywords
KW…so we have found 100,000 new ‘s – now what?
How do we summarise and present these to a client?
Clients’ questions…
• Do search categories in general
align with my website structure?
• Which categories of opportunity
keywords have the highest
search volume, bring the most
visitors, revenue etc.?
• Which keywords are not
relevant?
Website-like structure
Requirements
• Need: visual insights;
structure
• Allow targeting of
problem in hand
• May develop into a
semi- supervised
solution
• High-dimensional and sparse
data set
• Values correspond to word
frequencies
• Recommended methods
include: hierarchical
clustering, Kmeans with an
appropriate distance measure,
topic modelling (LDA, LSI),
co-clustering
Options for text
clustering?
Hierarchical Clustering
bringing structure
2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia
Agglomerative
Start with many
“singleton” clusters
…
Merge 2 at a time
continuously
…
Build a hierarchy
Divisive
Start with a huge “macro”
cluster
…
Iteratively split into 2
groups
…
Build a hierarchy
Agglomerative method:
Linkage types
• Single (similarity between
most similar – based on nearest
neighbour - two elements)
• Complete (similarity between
most dissimilar two elements)
Attribution: https://www.coursera.org/course/clusteranalysis
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Agglomerative method:
Linkage types
Average link
( avg. of similarity between
all inter-cluster pairs )
Computationally expensive (Na*Nb)
Trick: Centroid link (similarity
between centroid of two clusters)
Attribution: https://www.coursera.org/course/clusteranalysis
Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will increase
(cluster becomes bigger) by:
https://en.wikipedia.org/wiki/Ward's_method
Divisive clustering
• Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determine
the termination criteria
Attribution: https://www.coursera.org/course/clusteranalysis
Similarity measures
This will certainly influence the shape of the
clusters!
• Numerical: Use a variation of the Manhattan
distance (e.g. City block, Euclidean)
• Binary: Manhattan, Jaccard co-efficient,
Hamming
• Text: Cosine similarity.
Cosine similarity
Represent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
If d1 and d2 are two term vectors,
…can thus calculate the similarity between them
Attribution: https://www.coursera.org/course/clusteranalysis
Hierarchical clustering in Python and beyond
Gather word documents = keyword phrases
Aggregate search words with URL “words”
Text clustering:
preparations
• Add features where possible
o I added URL words to my word set
• Stem words
o Choose the right stemmer – too severe can be bad
• Stop words
o NLTK tokeniser
o Scikit learn TF-IDF tokeniser
• Low frequency cut-off
o 2 => words appearing less than twice in whole corpus
• High frequency cut-off
o 0.5 => words that appear in more than 50% of documents
• N-grams
o Single words, bi-grams, tri-grams
• Beware of foreign languages
o Separate datasets if possible
Text preparation
Dimensionality
• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensions
o PCA
o Spectral clustering
• The “curse” of dimensionality
Hierarchical clustering in Python and beyond
Results: reduced dimensions
Results: reduced dimensions
The
dendrogram
Assess the quality of your
clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normalised
Information index
Topic labelling
Hierarchical Clustering
Beyond Python (!?)
Life on the inside:
Elasticsearch
• Why not perform pre-processing and clustering
inside elasticsearch?
• Document store
• TF-IDF and other
• Stop words
• Language specific analysers
Elasticsearch
- try it ! -
• https://www.elastic.co/
• NoSQL document store
• Aggregations and stats
• Fast, distributed
• Quick to set up
Document storage in ES
Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf
• Built-in part of speech (POS)
• User-defined word/synonym/label dictionaries
• Built-in stemmer / word inflection database
• Multi-lingual support, advanced tuning
• Commercial: costs attached
http://download.carrotsearch.com/lingo3g/manual/#section.es
http://project.carrot2.org/algorithms.html
Elasticsearch with
clustering – Utopia?
Carrot2’s Lingo3G in action :
http://search.carrot2.org/stable/search
Foamtree visualisation example
Visualisation of hierarchical structure possible for
large datasets via “lazy loading”
http://get.carrotsearch.com/foamtree/demo/demos/large.html
Limitations of hierarchical
clustering
• Can’t undo what’s done (divisive method, work
on sub clusters, cannot re-merge). Even true for
agglomerative (once merged will never split it
again)
• Every split or merge must be refined
• Methods may not scale well, checking all possible
pairs, complexity goes high
There are extensions: BIRCH, CURE and
CHAMELEON
Thank you!
A decent introductory course to clustering;
https://www.coursera.org/course/clusteranalysis
Hierarchical (agglomerative) clustering in Python:
http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc
Visualisation: http://carrotsearch.com/foamtree-overview
Clustering elsewhere (Lingo, Lingo3G) with
Carrot2:http://download.carrotsearch.com/
Elasticsearch: https://www.elastic.co/
Analytics SEO: http://www.analyticsseo.com/
Me: @norhustla / frank.kelly@cantab.net
Attribution: http://wynway.com/
Extra slide: Why work
inside the database?
1. Sharing data (management of)
Support concurrent access by multiple readers and writers
2. Data Model Enforcement
Make sure all applications see clean, organised data
3. Scale
Work with datasets too large to fit in memory (over a certain size,
need specialised algorithms to deal with the data -> bottleneck)
The database organises and exposes algorithms for you
conveniently
4. Flexibility
Use the data in new, unanticipated ways -> anticipate a broad set
of ways of accessing the data

More Related Content

What's hot (20)

PPTX
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Simplilearn
 
PPTX
Clustering, k-means clustering
Megha Sharma
 
PDF
Density Based Clustering
SSA KPI
 
PDF
Clustering - Machine Learning Techniques
Kush Kulshrestha
 
PDF
Principal Component Analysis and Clustering
Usha Vijay
 
PPT
3.7 outlier analysis
Krish_ver2
 
PPTX
Exploratory data analysis
Peter Reimann
 
PPTX
Cluster Analysis
DataminingTools Inc
 
PDF
Hierarchical clustering
Ashek Farabi
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PDF
Mask R-CNN
Chanuk Lim
 
PPTX
ML - Multiple Linear Regression
Andrew Ferlitsch
 
PPTX
Graph Representation Learning
Jure Leskovec
 
PPT
Data Mining: Concepts and techniques: Chapter 13 trend
Salah Amean
 
PDF
Introduction to R Programming
izahn
 
PDF
Outlier detection method introduction
DaeJin Kim
 
PDF
Bayesian networks
Massimiliano Patacchiola
 
PPTX
Machine Learning with R
Barbara Fusinska
 
PPTX
Bayesian network
Ahmad El Tawil
 
PPTX
Exploratory data analysis in R - Data Science Club
Martin Bago
 
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Simplilearn
 
Clustering, k-means clustering
Megha Sharma
 
Density Based Clustering
SSA KPI
 
Clustering - Machine Learning Techniques
Kush Kulshrestha
 
Principal Component Analysis and Clustering
Usha Vijay
 
3.7 outlier analysis
Krish_ver2
 
Exploratory data analysis
Peter Reimann
 
Cluster Analysis
DataminingTools Inc
 
Hierarchical clustering
Ashek Farabi
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Mask R-CNN
Chanuk Lim
 
ML - Multiple Linear Regression
Andrew Ferlitsch
 
Graph Representation Learning
Jure Leskovec
 
Data Mining: Concepts and techniques: Chapter 13 trend
Salah Amean
 
Introduction to R Programming
izahn
 
Outlier detection method introduction
DaeJin Kim
 
Bayesian networks
Massimiliano Patacchiola
 
Machine Learning with R
Barbara Fusinska
 
Bayesian network
Ahmad El Tawil
 
Exploratory data analysis in R - Data Science Club
Martin Bago
 

Viewers also liked (20)

PDF
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
PPTX
Hierarchical clustering
ishmecse13
 
PDF
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Pier Luca Lanzi
 
PDF
K-means and Hierarchical Clustering
guestfee8698
 
PPTX
Text clustering
KU Leuven
 
PPTX
Cluster analysis
Jewel Refran
 
KEY
NLTK in 20 minutes
Jacob Perkins
 
PPTX
Scaling Document Clustering in the Cloud
Rob Gillen
 
PDF
Alz Hack II
Frank Kelly
 
PDF
28 Machine Learning Unsupervised Hierarchical Clustering
Andres Mendez-Vazquez
 
PDF
Weka_Manual_Sagar
Sagar Kumar
 
PDF
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Pier Luca Lanzi
 
DOCX
Data mining techniques using weka
rathorenitin87
 
PDF
The Open-Source Monitoring Landscape
Mike Merideth
 
PDF
Data Mining using Weka
Shashidhar Shenoy
 
PPTX
Document Classification and Clustering
Ankur Shrivastava
 
PPTX
Document clustering and classification
Mahmoud Alfarra
 
PDF
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
Shuyo Nakatani
 
PDF
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
Hierarchical clustering
ishmecse13
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Pier Luca Lanzi
 
K-means and Hierarchical Clustering
guestfee8698
 
Text clustering
KU Leuven
 
Cluster analysis
Jewel Refran
 
NLTK in 20 minutes
Jacob Perkins
 
Scaling Document Clustering in the Cloud
Rob Gillen
 
Alz Hack II
Frank Kelly
 
28 Machine Learning Unsupervised Hierarchical Clustering
Andres Mendez-Vazquez
 
Weka_Manual_Sagar
Sagar Kumar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Pier Luca Lanzi
 
Data mining techniques using weka
rathorenitin87
 
The Open-Source Monitoring Landscape
Mike Merideth
 
Data Mining using Weka
Shashidhar Shenoy
 
Document Classification and Clustering
Ankur Shrivastava
 
Document clustering and classification
Mahmoud Alfarra
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
Shuyo Nakatani
 
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Ad

Similar to Hierarchical clustering in Python and beyond (20)

PDF
04 open source_tools
Marco Quartulli
 
PPT
clustering_classification.ppt
HODECE21
 
PDF
Knowledge graph construction with a façade - The SPARQL Anything Project
Enrico Daga
 
PPTX
AAT LOD Microthesauri
Marcia Zeng
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PPT
Metadata first, ontologies second
Joseba Abaitua
 
PPTX
Elastic pivorak
Pivorak MeetUp
 
PPTX
DITA's New Thang: Going Mapless!
dclsocialmedia
 
PDF
A Clean Slate?
Herbert Van de Sompel
 
PDF
Linking Folksonomies to Knowledge Organization Systems
Jakob .
 
PPTX
The Rhetoric of Research Objects
Carole Goble
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PDF
Collaborations in the Extreme: 
The rise of open code development in the scie...
Kelle Cruz
 
PPTX
Research Objects for improved sharing and reproducibility
Oscar Corcho
 
PDF
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
PDF
Recommending Semantic Nearest Neighbors Using Storm and Dato
Ashok Venkatesan
 
PPTX
UCIAD overview
Mathieu d'Aquin
 
PPT
Open Archives Initiative Object Reuse and Exchange
lagoze
 
PPTX
Reduce Query Time Up to 60% with Selective Search
Lucidworks
 
PPTX
Keynote at AImWD
Stefan Schlobach
 
04 open source_tools
Marco Quartulli
 
clustering_classification.ppt
HODECE21
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Enrico Daga
 
AAT LOD Microthesauri
Marcia Zeng
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Metadata first, ontologies second
Joseba Abaitua
 
Elastic pivorak
Pivorak MeetUp
 
DITA's New Thang: Going Mapless!
dclsocialmedia
 
A Clean Slate?
Herbert Van de Sompel
 
Linking Folksonomies to Knowledge Organization Systems
Jakob .
 
The Rhetoric of Research Objects
Carole Goble
 
Introduction to Machine Learning
Rahul Jain
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Kelle Cruz
 
Research Objects for improved sharing and reproducibility
Oscar Corcho
 
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Ashok Venkatesan
 
UCIAD overview
Mathieu d'Aquin
 
Open Archives Initiative Object Reuse and Exchange
lagoze
 
Reduce Query Time Up to 60% with Selective Search
Lucidworks
 
Keynote at AImWD
Stefan Schlobach
 
Ad

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 

Hierarchical clustering in Python and beyond

  • 1. Hierarchical clustering in Python & elsewhere For @PyDataConf London, June 2015, by Frank Kelly Data Scientist, Engineer @analyticsseo @norhustla
  • 2. Hierarchical Clustering Theory Practice Visualisation Origins & definitions Methods & considerations Hierachical theory Metrics & performance My use case Python libraries Example Static Interactive Further ideas All opinions expressed are my own
  • 3. Who am I? All opinions expressed are my own
  • 5. Clustering is an unsupervised learning problem "SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg based on some notion of similarity. whereby we aim to group subsets of entities with one another
  • 7. Diverse applications Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
  • 8. Two main purposes Exploratory analysis – standalone tool (Data mining) As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). (Machine Learning)
  • 9. Clustering considerations Partitioning criteria (single / multi level) Separation Exclusive / non- exclusive Clustering space (Full-space / sub-space) Similarity measure (distance / connectivity)
  • 10. Use case: search keywords RD P P P KW KW KW KW KW CP CP KW KW KW The competition! KW KW CP CD You Opportunity! CD = Competing domains CP = Competitor’s pages RD = Ranking domain P = Your page KW = Keyword
  • 12. Use case: search keywords KW…so we have found 100,000 new ‘s – now what? How do we summarise and present these to a client?
  • 13. Clients’ questions… • Do search categories in general align with my website structure? • Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.? • Which keywords are not relevant?
  • 15. Requirements • Need: visual insights; structure • Allow targeting of problem in hand • May develop into a semi- supervised solution
  • 16. • High-dimensional and sparse data set • Values correspond to word frequencies • Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering Options for text clustering?
  • 18. 2 types Agglomerative Divisive Deterministic algorithms! Attribution: Wikipedia
  • 19. Agglomerative Start with many “singleton” clusters … Merge 2 at a time continuously … Build a hierarchy Divisive Start with a huge “macro” cluster … Iteratively split into 2 groups … Build a hierarchy
  • 20. Agglomerative method: Linkage types • Single (similarity between most similar – based on nearest neighbour - two elements) • Complete (similarity between most dissimilar two elements) Attribution: https://www.coursera.org/course/clusteranalysis
  • 23. Agglomerative method: Linkage types Average link ( avg. of similarity between all inter-cluster pairs ) Computationally expensive (Na*Nb) Trick: Centroid link (similarity between centroid of two clusters) Attribution: https://www.coursera.org/course/clusteranalysis
  • 24. Ward’s criterion • Minimise a function: total in-cluster variance • As defined by, e.g.: • Once merged, then the SSE will increase (cluster becomes bigger) by: https://en.wikipedia.org/wiki/Ward's_method
  • 25. Divisive clustering • Top-down approach • Criterion to split: Ward’s criterion • Handling noise: Use a threshold to determine the termination criteria Attribution: https://www.coursera.org/course/clusteranalysis
  • 26. Similarity measures This will certainly influence the shape of the clusters! • Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean) • Binary: Manhattan, Jaccard co-efficient, Hamming • Text: Cosine similarity.
  • 27. Cosine similarity Represent a document by a bag of terms Record the frequency of a particular term (word/ topic/ phrase) If d1 and d2 are two term vectors, …can thus calculate the similarity between them Attribution: https://www.coursera.org/course/clusteranalysis
  • 29. Gather word documents = keyword phrases
  • 30. Aggregate search words with URL “words”
  • 31. Text clustering: preparations • Add features where possible o I added URL words to my word set • Stem words o Choose the right stemmer – too severe can be bad • Stop words o NLTK tokeniser o Scikit learn TF-IDF tokeniser • Low frequency cut-off o 2 => words appearing less than twice in whole corpus • High frequency cut-off o 0.5 => words that appear in more than 50% of documents • N-grams o Single words, bi-grams, tri-grams • Beware of foreign languages o Separate datasets if possible
  • 33. Dimensionality • Get a sparse matrix o Mostly zeros • Reduce the number of dimensions o PCA o Spectral clustering • The “curse” of dimensionality
  • 38. Assess the quality of your clusters • Internal: Purity, completeness & homogeneity • External: Adjusted Rand index, Normalised Information index
  • 41. Life on the inside: Elasticsearch • Why not perform pre-processing and clustering inside elasticsearch? • Document store • TF-IDF and other • Stop words • Language specific analysers
  • 42. Elasticsearch - try it ! - • https://www.elastic.co/ • NoSQL document store • Aggregations and stats • Fast, distributed • Quick to set up
  • 44. Lingo 3G algorithm • Lingo 3G: Hierarchical clustering off-the-shelf • Built-in part of speech (POS) • User-defined word/synonym/label dictionaries • Built-in stemmer / word inflection database • Multi-lingual support, advanced tuning • Commercial: costs attached http://download.carrotsearch.com/lingo3g/manual/#section.es http://project.carrot2.org/algorithms.html
  • 45. Elasticsearch with clustering – Utopia? Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search Foamtree visualisation example Visualisation of hierarchical structure possible for large datasets via “lazy loading” http://get.carrotsearch.com/foamtree/demo/demos/large.html
  • 46. Limitations of hierarchical clustering • Can’t undo what’s done (divisive method, work on sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again) • Every split or merge must be refined • Methods may not scale well, checking all possible pairs, complexity goes high There are extensions: BIRCH, CURE and CHAMELEON
  • 47. Thank you! A decent introductory course to clustering; https://www.coursera.org/course/clusteranalysis Hierarchical (agglomerative) clustering in Python: http://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc Visualisation: http://carrotsearch.com/foamtree-overview Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/ Elasticsearch: https://www.elastic.co/ Analytics SEO: http://www.analyticsseo.com/ Me: @norhustla / [email protected] Attribution: http://wynway.com/
  • 48. Extra slide: Why work inside the database? 1. Sharing data (management of) Support concurrent access by multiple readers and writers 2. Data Model Enforcement Make sure all applications see clean, organised data 3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck) The database organises and exposes algorithms for you conveniently 4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data

Editor's Notes

  • #7: http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
  • #38: https://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png