SlideShare a Scribd company logo
Clustering for New Discovery in Data
Houston Machine Learning Meetup
2
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning
– Convolutional neural network
– Train deep nets with open-source tools
3
SCR©
Roadmap: Application
• Business analytics
• Recommendation system
• Natural language processing
• Computer vision
• Energy industry
4
SCR©
Agenda
• Introduction
• Application of clustering
• K-means
• DBSCAN
• Cluster validation
5
SCR©
What is clustering
Clustering: to discover the natural groupings of a set of objects/patterns in the
unlabeled data
6
SCR©
Application: Recommendation
7
SCR©
Application: Document Clustering
https://www.noggle.online/knowledgebase/document-clustering/
8
SCR©
Application: Pizza Hut Center
Delivery locations
9
SCR©
Application: Discovering Gene functions
Important to discover diseases
and treatment
10
SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……
11
SCR©
• K-Means
• DBSCAN
12
SCR©
Cluster Validation
13
SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the
“goodness” of the resulting clusters?
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To determine the optimal number of clusters
14
SCR©
Cluster Validity
• Numerical measures:
– External: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal: Used to measure the goodness of a clustering structure without
respect to external information.
• Sum of Squared Error (SSE)
– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization
15
SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a
cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
 


i Cx
i
i
mxWSS 2
)(
 
i
ii mmCBSS 2
)(
16
SCR©
Internal Measures: WSE and BSE
• Example: SSE
– BSS + WSS = constant
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222



Total
BSS
WSS
1 2 3 4 5
 
m1 m2
m
K=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222



Total
BSS
WSSK=1 cluster:
17
SCR©
Internal Measures: WSE and BSE
• Can be used to estimate the number of clusters
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
KSSE5 10 15
-6
-4
-2
0
2
4
6
WSS
18
SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a
cluster.
• Cluster separation is the sum of the weights between nodes in the
cluster and nodes outside the cluster.
cohesion separation
19
SCR©
Correlation between affinity matrix and
incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering
• Correlation r between D and C is given by








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
20
SCR©
Correlation with Incidence matrix








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810
21
SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and
inspect visually.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22
SCR©
• Clusters in random data are not so crisp
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix
23
SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part
of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art
accessible only to those true believers who have experience and great
courage.”
Algorithms for Clustering Data, Jain and Dubes
24
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Hierarchical clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning - Yan
– Convolutional neural network
– Train deep nets with open-source tools
25
SCR©
Thank you
Slides will be posted on slide share:
http://www.slideshare.net/xuyangela

More Related Content

What's hot (20)

PDF
CNNs: from the Basics to Recent Advances
Dmytro Mishkin
 
PDF
Understanding Convolutional Neural Networks
Jeremy Nixon
 
PDF
Convolutional Neural Networks : Popular Architectures
ananth
 
PDF
Introduction to Diffusion Models
Sangwoo Mo
 
PPTX
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
ivaderivader
 
PDF
Deep learning and image analytics using Python by Dr Sanparit
BAINIDA
 
PPTX
Clustering on database systems rkm
Vahid Mirjalili
 
PPTX
Image classification with Deep Neural Networks
Yogendra Tamang
 
PDF
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
BAINIDA
 
PPTX
K-means Clustering
Anna Fensel
 
PPTX
Invertible Denoising Network: A Light Solution for Real Noise Removal
ivaderivader
 
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
PPTX
Image classification using cnn
Debarko De
 
PPTX
Deep Learning Tutorial
Ligeng Zhu
 
PPTX
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
PDF
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 
PDF
Birch
ngocdiem87
 
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
PPTX
Machine Learning - Introduction to Convolutional Neural Networks
Andrew Ferlitsch
 
PDF
Deep Learning behind Prisma
lostleaves
 
CNNs: from the Basics to Recent Advances
Dmytro Mishkin
 
Understanding Convolutional Neural Networks
Jeremy Nixon
 
Convolutional Neural Networks : Popular Architectures
ananth
 
Introduction to Diffusion Models
Sangwoo Mo
 
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
ivaderivader
 
Deep learning and image analytics using Python by Dr Sanparit
BAINIDA
 
Clustering on database systems rkm
Vahid Mirjalili
 
Image classification with Deep Neural Networks
Yogendra Tamang
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
BAINIDA
 
K-means Clustering
Anna Fensel
 
Invertible Denoising Network: A Light Solution for Real Noise Removal
ivaderivader
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Image classification using cnn
Debarko De
 
Deep Learning Tutorial
Ligeng Zhu
 
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 
Birch
ngocdiem87
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
Machine Learning - Introduction to Convolutional Neural Networks
Andrew Ferlitsch
 
Deep Learning behind Prisma
lostleaves
 

Viewers also liked (20)

PDF
K means and dbscan
Yan Xu
 
PDF
Kernel Bayes Rule
Yan Xu
 
PPTX
Cloud-based Storage, Processing and Rendering for Gegabytes 3D Biomedical Images
Yan Xu
 
PDF
Mean shift and Hierarchical clustering
Yan Xu
 
PPTX
Visualization using tSNE
Yan Xu
 
PPTX
machine learning - Clustering in R
Sudhakar Chavan
 
DOC
Clustering overview
Vinod Hanumantharayappa
 
PDF
Yoursalespitchsuckspdf 140121071847-phpapp02
สุรพงศ์ นุสดิน
 
PDF
Yoursalespitchsuckspdf 140121071847-phpapp02
สุรพงศ์ นุสดิน
 
PDF
Unidad 9.
felipe991107
 
PPTX
my fabourite house
Laura Montero Paredes
 
PDF
Unidad 5.
felipe991107
 
PPTX
Transporte em nanoestruturas_3_algumas_consideracoes_fisicas
REGIANE APARECIDA RAGI PEREIRA
 
PPSX
Water conservation
Aswin Shenoy
 
PPTX
Contato Metal-semicondutor
REGIANE APARECIDA RAGI PEREIRA
 
PPTX
Ekologi
Asbal Khairi
 
PPTX
Evaluation question 6..
Georgii_Kelly
 
DOCX
Asbal
Asbal Khairi
 
PPTX
O modelo básico dos MOSFETs - 3
REGIANE APARECIDA RAGI PEREIRA
 
PPTX
my fabourite house
Laura Montero Paredes
 
K means and dbscan
Yan Xu
 
Kernel Bayes Rule
Yan Xu
 
Cloud-based Storage, Processing and Rendering for Gegabytes 3D Biomedical Images
Yan Xu
 
Mean shift and Hierarchical clustering
Yan Xu
 
Visualization using tSNE
Yan Xu
 
machine learning - Clustering in R
Sudhakar Chavan
 
Clustering overview
Vinod Hanumantharayappa
 
Yoursalespitchsuckspdf 140121071847-phpapp02
สุรพงศ์ นุสดิน
 
Yoursalespitchsuckspdf 140121071847-phpapp02
สุรพงศ์ นุสดิน
 
Unidad 9.
felipe991107
 
my fabourite house
Laura Montero Paredes
 
Unidad 5.
felipe991107
 
Transporte em nanoestruturas_3_algumas_consideracoes_fisicas
REGIANE APARECIDA RAGI PEREIRA
 
Water conservation
Aswin Shenoy
 
Contato Metal-semicondutor
REGIANE APARECIDA RAGI PEREIRA
 
Ekologi
Asbal Khairi
 
Evaluation question 6..
Georgii_Kelly
 
O modelo básico dos MOSFETs - 3
REGIANE APARECIDA RAGI PEREIRA
 
my fabourite house
Laura Montero Paredes
 
Ad

Similar to Clustering introduction (20)

PPTX
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
vigneshmatta2004
 
PDF
PPT s10-machine vision-s2
Binus Online Learning
 
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
1052LaxmanrajS
 
PDF
Cluster Analysis for Dummies
Venkata Reddy Konasani
 
PPT
26-Clustering MTech-2017.ppt
vikassingh569137
 
PDF
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
 
PPT
Fuzzy c means clustering protocol for wireless sensor networks
mourya chandra
 
PPTX
Instance Based Learning in machine learning
tanishqgujari
 
PDF
clustering using different methods in .pdf
officialnovice7
 
PPTX
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
PPTX
Unsupervised learning (clustering)
Pravinkumar Landge
 
PPTX
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
PPTX
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
NANDHINIS900805
 
PPT
4 DM Clustering ifor computerscience.ppt
arewho557
 
PPTX
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
PPTX
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
PPTX
Unsupervised learning clustering
Arshad Farhad
 
PPTX
Clustering techniques
talktoharry
 
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
vigneshmatta2004
 
PPT s10-machine vision-s2
Binus Online Learning
 
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
1052LaxmanrajS
 
Cluster Analysis for Dummies
Venkata Reddy Konasani
 
26-Clustering MTech-2017.ppt
vikassingh569137
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
 
Fuzzy c means clustering protocol for wireless sensor networks
mourya chandra
 
Instance Based Learning in machine learning
tanishqgujari
 
clustering using different methods in .pdf
officialnovice7
 
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Unsupervised learning (clustering)
Pravinkumar Landge
 
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
NANDHINIS900805
 
4 DM Clustering ifor computerscience.ppt
arewho557
 
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
Unsupervised learning clustering
Arshad Farhad
 
Clustering techniques
talktoharry
 
Ad

More from Yan Xu (20)

PPTX
Kaggle winning solutions: Retail Sales Forecasting
Yan Xu
 
PDF
Basics of Dynamic programming
Yan Xu
 
PPTX
Walking through Tensorflow 2.0
Yan Xu
 
PPTX
Practical contextual bandits for business
Yan Xu
 
PDF
Introduction to Multi-armed Bandits
Yan Xu
 
PDF
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
Yan Xu
 
PDF
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
PDF
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Yan Xu
 
PDF
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Yan Xu
 
PDF
Introduction to Autoencoders
Yan Xu
 
PPTX
State of enterprise data science
Yan Xu
 
PDF
Long Short Term Memory
Yan Xu
 
PPTX
Linear algebra and probability (Deep Learning chapter 2&3)
Yan Xu
 
PPTX
HML: Historical View and Trends of Deep Learning
Yan Xu
 
PDF
Secrets behind AlphaGo
Yan Xu
 
PPTX
Optimization in Deep Learning
Yan Xu
 
PDF
Introduction to Recurrent Neural Network
Yan Xu
 
PDF
Convolutional neural network
Yan Xu
 
PDF
Introduction to Neural Network
Yan Xu
 
PPTX
Introduction to data integration in bioinformatics
Yan Xu
 
Kaggle winning solutions: Retail Sales Forecasting
Yan Xu
 
Basics of Dynamic programming
Yan Xu
 
Walking through Tensorflow 2.0
Yan Xu
 
Practical contextual bandits for business
Yan Xu
 
Introduction to Multi-armed Bandits
Yan Xu
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
Yan Xu
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Yan Xu
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Yan Xu
 
Introduction to Autoencoders
Yan Xu
 
State of enterprise data science
Yan Xu
 
Long Short Term Memory
Yan Xu
 
Linear algebra and probability (Deep Learning chapter 2&3)
Yan Xu
 
HML: Historical View and Trends of Deep Learning
Yan Xu
 
Secrets behind AlphaGo
Yan Xu
 
Optimization in Deep Learning
Yan Xu
 
Introduction to Recurrent Neural Network
Yan Xu
 
Convolutional neural network
Yan Xu
 
Introduction to Neural Network
Yan Xu
 
Introduction to data integration in bioinformatics
Yan Xu
 

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Data base management system Transactions.ppt
gandhamcharan2006
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 

Clustering introduction

  • 1. Clustering for New Discovery in Data Houston Machine Learning Meetup
  • 2. 2 SCR© Roadmap: Method • Tour of machine learning algorithms (1 session) • Feature engineering (1 session) – Feature selection - Yan • Supervised learning (4 sessions) – Regression models -Yan – SVM and kernel SVM - Yan – Tree-based models - Dario – Bayesian method - Xiaoyang – Ensemble models - Yan • Unsupervised learning (3 sessions) – K-means clustering – DBSCAN - Cheng – Mean shift – Agglomerative clustering - Kunal – Dimension reduction for data visualization - Yan • Deep learning (4 sessions) _ Neural network – From neural network to deep learning – Convolutional neural network – Train deep nets with open-source tools
  • 3. 3 SCR© Roadmap: Application • Business analytics • Recommendation system • Natural language processing • Computer vision • Energy industry
  • 4. 4 SCR© Agenda • Introduction • Application of clustering • K-means • DBSCAN • Cluster validation
  • 5. 5 SCR© What is clustering Clustering: to discover the natural groupings of a set of objects/patterns in the unlabeled data
  • 8. 8 SCR© Application: Pizza Hut Center Delivery locations
  • 9. 9 SCR© Application: Discovering Gene functions Important to discover diseases and treatment
  • 10. 10 SCR© Clustering Algorithm • K-Means (King of clustering, many variants) • DBSCAN (group neighboring points) • Mean shift (locating the maxima of density) • Spectral clustering (cares about connectivity instead of proximity) • Hierarchical clustering (a hierarchical structure, multiple levels) • Expectation Maximization (k-means is a variant of EM) • Latent Dirichlet Allocation (natural language processing) ……
  • 13. 13 SCR© Cluster Validity • For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters? • Then why do we want to evaluate them? – To avoid finding patterns in noise – To compare clustering algorithms – To determine the optimal number of clusters
  • 14. 14 SCR© Cluster Validity • Numerical measures: – External: Used to measure the extent to which cluster labels match externally supplied class labels. • Entropy – Internal: Used to measure the goodness of a clustering structure without respect to external information. • Sum of Squared Error (SSE) – Relative: Used to compare two different clusterings. • Often an external or internal measurement is used for this function, e.g., SSE or entropy • Visualization
  • 15. 15 SCR© Internal Measures: WSE and BSE • Cluster Cohesion: Measures how closely related are objects in a cluster – Example: SSE • Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters • Example: Squared Error – Cohesion is measured by the within cluster sum of squares (SSE) – Separation is measured by the between cluster sum of squares – Where |Ci| is the size of cluster i     i Cx i i mxWSS 2 )(   i ii mmCBSS 2 )(
  • 16. 16 SCR© Internal Measures: WSE and BSE • Example: SSE – BSS + WSS = constant 1091 9)35.4(2)5.13(2 1)5.45()5.44()5.12()5.11( 22 2222    Total BSS WSS 1 2 3 4 5   m1 m2 m K=2 clusters: 10010 0)33(4 10)35()34()32()31( 2 2222    Total BSS WSSK=1 cluster:
  • 17. 17 SCR© Internal Measures: WSE and BSE • Can be used to estimate the number of clusters 2 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 KSSE5 10 15 -6 -4 -2 0 2 4 6 WSS
  • 18. 18 SCR© Internal Measures: Proximity graph measures • Cluster cohesion is the sum of the weight of all links within a cluster. • Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation
  • 19. 19 SCR© Correlation between affinity matrix and incidence matrix • Given affinity distance matrix D = {d11,d12, …, dnn } Incidence matrix C= { c11, c12,…, cnn } from clustering • Correlation r between D and C is given by         n ji ij n ji ij n ji ijij ccdd ccdd r 1,1 2 _ 1,1 2 _ 1,1 __ )()( ))((
  • 20. 20 SCR© Correlation with Incidence matrix         n ji ij n ji ij n ji ijij ccdd ccdd r 1,1 2 _ 1,1 2 _ 1,1 __ )()( ))(( 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y r = -0.9235 r = -0.5810
  • 21. 21 SCR© Visualization of similarity matrix • Order the similarity matrix with respect to cluster labels and inspect visually. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Points Points 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Similarity 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 22. 22 SCR© • Clusters in random data are not so crisp Points Points 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Similarity 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Visualization of similarity matrix
  • 23. 23 SCR© Final Comment on Cluster Validity “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes
  • 24. 24 SCR© Roadmap: Method • Tour of machine learning algorithms (1 session) • Feature engineering (1 session) – Feature selection - Yan • Supervised learning (4 sessions) – Regression models -Yan – SVM and kernel SVM - Yan – Tree-based models - Dario – Bayesian method - Xiaoyang – Ensemble models - Yan • Unsupervised learning (3 sessions) – K-means clustering – DBSCAN - Cheng – Mean shift – Hierarchical clustering - Kunal – Dimension reduction for data visualization - Yan • Deep learning (4 sessions) _ Neural network – From neural network to deep learning - Yan – Convolutional neural network – Train deep nets with open-source tools
  • 25. 25 SCR© Thank you Slides will be posted on slide share: http://www.slideshare.net/xuyangela