SlideShare a Scribd company logo
Clustering
• Clustering is an unsupervised learning method: there is no target value
(class label) to be predicted, the goal is finding common patterns or
grouping similar examples.
• Differences between models/algorithms for clustering:
– Conceptual (model-based) vs. partitioning
– Exclusive vs. overlapping
– Deterministic vs. probabilistic
– Hierarchical vs. flat
– Incremental vs. batch learning
• Evaluating clustering quality: subjective approaches, objective functions
(e.g. category utility, entropy).
• Major approaches:
– Cluster/2: flat, conceptual (model-based), batch learning, possibly
overlapping, deterministic.
– Partitioning methods: flat, batch learning, exclusive, deterministic
or probabilistic. Algorithms: k-means, probability-based clustering
(EM)
– Hierarchical clustering
∗ Partitioning: agglomerative (bottom-up) or divisible (top-down).
∗ Conceptual: Cobweb, category utility function.
1
1 CLUSTER/2
• One of the first conceptual clustering approaches [Michalski, 83].
• Works as a meta-learning sheme - uses a learning algorithm in its inner
loop to form categories.
• Has no practical value, but introduces important ideas and techniques,
used in current appraoches to conceptual clustering.
The CLUSTER/2 algorithm forms k categories by constructing individual
objects grouped around k seed objects. It works as follows:
1. Select k objects (seeds) from the set of observed objects (randomly or
using some selection function).
2. For each seed, using it as a positive example the all the other seeds as
negative examples, find a maximally general description that covers all
positive and none of the negative examples.
3. Classify all objects form the sample in categories according to these de-
scriptions. Then replace each maximally general descriptions with a max-
imally specific one, that cover all objects in the category. (This possibly
avoids category overlapping.)
4. If there are still overlapping categories, then using some metric (e.g. eu-
clidean distance) find central objects in each category and repeat steps
1-3 using these objects as seeds.
5. Stop when some quality criterion for the category descriptions is satisfied.
Such a criterion might be the complexity of the descriptions (e.g. the
number of conjuncts)
2
6. If there is no improvement of the categories after several steps, then
choose new seeds using another criterion (e.g. the objects near the edge
of the category).
3
2 Partitioning methods – k-means
• Iterative distance-based clustering.
• Used by statisticians for decades.
• Similarly to Cluster/2 uses k seeds (predefined k), but is based on a
distance measure:
1. Select k instances (cluster centers) from the sample (usually at ran-
dom).
2. Assign instances to clusters according to their distance to the cluster
centers.
3. Find new cluster centers and go to step 2 until the process converges
(i.e. the same instances are assigned to each cluster in two consecutive
passes).
• The clustering depends greatly on the initial choice of cluster centers –
the algorithm may fall in a local minimum.
• Example of bad chioce of cluster centers: four instances at the vertices of
a rectangle, two initial cluster centers – midpoints of the long sides of the
rectangle. This is a stable configuration, however not a good clustering.
• Solution to the local minimum problem: restart the algorithm with an-
other set of cluster centers.
• Hierarchical k-means: apply k = 2 recursively to the resulting clusters.
4
3 Probabilty-based clustering
Why probabilities?
• Restricted amount of evidence implies probabilistic reasoning.
• From a probabilistic perspective, we want to find the most likely clusters
given the data.
• An instance only has certain probability of belonging to a particular
cluster.
5
4 Probabilty-based clustering – mixture models
• For a single attribute: three parameters - mean, standard deviation and
sampling probability.
• Each cluster A is defined by a mean (µA) and a standard deviation (σA).
• Samples are taken from each cluster A with a specified probability of
sampling P(A).
• Finite mixture problem: given a dataset, find the mean, standard devia-
tion and the probability of sampling for each cluster.
• If we know the classification of each instance, then:
– mean (average), µ = 1
n
Pn
1 xi;
– standard deviation, σ2
= 1
n−1
Pn
1 (xi − µ)2
;
– probability of sampling for class A, P(A) = proportion of instances
in it.
• If we know the three parameters, the probability that an instance x
belongs to cluster A is:
P(A|x) = P(x|A)P(A)
P(x) ,
where P(x|A) is the density function for A, f(x; µA, σA) = 1
√
2πσA
e
−(x−µA)2
2σ2
A .
P(x) is not necessary as we calculate the numerators for all clusters and
normalize them by dividing by their sum.
⇒ In fact, this is exactly the Naive Bayes approach.
• For more attributes: naive Bayes assumption – independence between
attributes. The joint probabilities of an instance are calculated as a
product of the probabilities of all attributes.
6
5 EM (expectation maximization)
• Similarly to k-means, first select the cluster parameters (µA, σA and
P(A)) or guess the classes of the instances, then iterate.
• Adjustment needed: we know cluster probabilities, not actual clusters for
each instance. So, we use these probabilities as weights.
• For cluster A:
µA =
Pn
1 wixi
Pn
1 wi
, where wi is the probability that xi belongs to cluster A;
σ2
A =
Pn
1 wi(xi−µ)2
Pn
1 wi
.
• When to stop iteration - maximizing overall likelihood that the data come
form the dataset with the given parameters (”goodness” of clustering):
Log-likelihood =
P
i log(
P
A P(A)P(xi|A) )
Stop when the difference between two successive iteration becomes neg-
ligible (i.e. there is no improvement of clustering quality).
7
6 Criterion functions for clustering
1. Distance-based functions.
• Sum of squared error:
mA =
1
n
X
x∈A
x
J =
X
A
X
x∈A
||x − mA||2
• Optimal clustering minilizes J: minimal variance clustering.
2. Probability (entropy) based functions.
• Probability of instance P(xi) =
P
A P(A)P(xi|A)
• Probability of sample x1, ..., xn:
Πn
i (
X
A
P(A)P(xi|A) )
• Log-likelihood:
n
X
i
log(
X
A
P(A)P(xi|A) )
3. Category utility (Cobweb):
CU =
1
n
X
C
P(C)
X
A
X
v
[P(A = v|C)2
− P(A = v)2
]
4. Error-based evaluation: evaluate clusters with respect to classes using
preclassified examples (Classes to clusters evaluation mode in Weka).
8
7 Hierarchical partitioning methods
• Bottom-up (agglomerative): at each step merge the two closest clusters.
• Top-down (divisible): split the current set into two clusters and proceed
recursively with the subsets.
• Distance function between instances (e.g. Euclidean distance).
• Distance function between clusters (e.g. distance between centers, mini-
mal distance, average distance).
• Criteria for stopping merging or splitting:
– desired number of clusters;
– distance between the closest clusters is above (top-down) or below
(bottom-up) a threshold.
• Algorithms:
– Nearest neighbor (single-linkage) agglomerative clustering: cluster
distance = minimal distance between elements. Merging stops when
distance > threshold. In fact, this is an algorithm for generating a
minimal spanning tree.
– Farthest neighbor (complete-linkage) agglomerative clustering: cluster
distance = maximal distance between elements. Merging stops when
distance > threshold. The algorithm computes the complete subgraph
for every cluster.
• Problems: greedy algorithm (local minimum), once created a subtree
cannot be restructured.
9
8 Example of agglomerative hierarchical clustering
8.1 Data
Day Outlook Temperature Humidity Wind Play
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
4 rain mild high weak yes
5 rain cool normal weak yes
6 rain cool normal strong no
7 overcast cool normal strong yes
8 sunny mild high weak no
9 sunny cool normal weak yes
10 rain mild normal weak yes
11 sunny mild normal strong yes
12 overcast mild high strong yes
13 overcast hot normal weak yes
14 rain mild high strong no
10
8.2 Farthest neighbor (complete-linkage) agglomerative clustering,
threshold = 3
+
+
+
+
4 - yes
8 - no
+
12 - yes
14 - no
+
1 - no
2 - no
+
+
3 - yes
13 - yes
+
5 - yes
10 - yes
+
+
9 - yes
11 - yes
+
6 - no
7 - yes
Classes to clusters evaluation for the three top level classes:
• {4, 8, 12, 14, 1, 2} – majority class=no, 2 incorrectly classified instances.
• {3, 13, 5, 10} – majority class=yes, 0 incorrectly classified instances.
• {9, 11, 6, 7} – majority class=yes, 1 incorrectly classified instance.
Total number of incorrectly classified instances = 3, Error = 3/14
11
9 Hierarchical conceptual clustering: Cobweb
• Incremental clustering algorithm, which builds a taxonomy of clusters
without having a predefined number of clusters.
• The clusters are represented probabilistically by conditional probability
P(A = v|C) with which attribute A has value v, given that the instance
belongs to class C.
• The algorithm starts with an empty root node.
• Instances are added one by one.
• For each instance the following options (operators) are considered:
– classifying the instance into an existing class;
– creating a new class and placing the instance into it;
– combining two classes into a single class (merging) and placing the
new instance in the resulting hierarchy;
– dividing a class into two classes (splitting) and placing the new in-
stance in the resulting hierarchy.
• The algorithm searches the space of possible hierarchies by applying the
above operators and an evaluation function based on the category utility.
12
10 Measuring quality of clustering – Category utility (CU) func-
tion
• CU attempts to maximize both the probability that two instances in the
same category have attribute values in common and the probability that
instances from different categories have different attribute values.
CU =
X
C
X
A
X
v
P(A = v)P(A = v|C)P(C|A = v)
• P(A = v|C) is the probability that an instance has value v for its at-
tribute A, given that it belongs to category C. The higher this probabil-
ity, the more likely two instances in a category share the same attribute
values.
• P(C|A = v) is the probability that an instance belongs to category C,
given that it has value v for its attribute A. The greater this probability,
the less likely instances from different categories will have attribute values
in common.
• P(A = v) is a weight, assuring that frequently occurring attribute values
will have stronger influence on the evaluation.
13
11 Category utility
• After applying Bayes rule we get
CU =
X
C
X
A
X
v
P(C)P(A = v|C)2
•
P
A
P
v P(A = v|C)2
is the expected number of attribute values that one
can correctly guess for an arbitrary member of class C. This expecta-
tion assumes a probability matching strategy, in which one guesses an
attribute value with a probability equal to its probability of occurring.
• Without knowing the cluster structure the above term is
P
A
P
v P(A =
v)2
.
• The final CU is defined as the increase in the expected number of at-
tribute values that can be correctly guessed, given a set of n categories,
over the expected number of correct guesses without such knowledge.
That is:
CU =
1
n
X
C
P(C)
X
A
X
v
[P(A = v|C)2
− P(A = v)2
]
• The above expression is divided by n to allow comparing different size
clusterings.
• Handling numeric attributes (Classit): assuming normal distribution and
using probability density function (based on mean and standard devia-
tion).
14
12 Control parameters
• Acuity: a single instance in a cluster results in a zero variance, which in
turn produces infinite value for CU. The acuity parameter is the minimum
value for the variance (can be interpreted as a measurement error).
• Cutoff: the minimum increase of CU to add a new node to the hierarchy,
otherwise the new node is cut off.
15
13 Example
color nuclei tails
white 1 1
white 2 2
black 2 2
black 3 1
16
14 Cobweb’s clustering hierarchy
C5 P(C5) = 0.25
attr val P
tails 1 0.0
2 1.0
color white 1.0
black 0.0
nuclei 1.0 0.0
2 1.0
3 0.0
C6 P(C6) = 0.25
attr val P
tails 1 0.0
2 1.0
color white 0.0
black 1.0
nuclei 1.0 0.0
2 1.0
3 0.0
C2 P(C2) = 0.25
attr val P
tails 1 1.0
2 0.0
color white 1.0
black 0.0
nuclei 1.0 1.0
2 0.0
3 0.0
C3 P(C3) = 0.5
attr val P
tails 1 0.0
2 1.0
color white 0.5
black 0.5
nuclei 1.0 0.0
2 1.0
3 0.0
C4 P(C1) = 0.25
attr val P
tails 1 1.0
2 0.0
color white 0.0
black 1.0
nuclei 1.0 0.0
2 0.0
3 1.0
C1 P(C1) = 1.0
attr val P
tails 1 0.5
2 0.5
color white 0.5
black 0.5
nuclei 1 0.25
2 0.50
3 0.25






Q
Q
Q
Q
Q
Q










PPPPPPPPP
P
17
15 Cobweb algorithm
cobweb(Node,Instance)
begin
• If Node is a leaf then begin
Create two children of Node - L1 and L2;
Set the probabilities of L1 to those of Node;
Set the probabilities of L2 to those of Insnatce;
Add Instance to Node, updating Node’s probabilities.
end
• else begin
Add Instance to Node, updating Node’s probabilities; For each child C of Node, com-
pute the category utility of clustering achieved by placing Instance in C;
Calculate:
S1 = the score for the best categorization (Instance is placed in C1);
S2 = the score for the second best categorization (Instance is placed in C2);
S3 = the score for placing Instance in a new category;
S4 = the score for merging C1 and C2 into one category;
S5 = the score for splitting C1 (replacing it with its child categories.
end
• If S1 is the best score then call cobweb(C1, Instance).
• If S3 is the best score then set the new category’s probabilities to those of Instance.
• If S4 is the best score then call cobweb(Cm, Instance), where Cm is the result of merging
C1 and C2.
• If S5 is the best score then split C1 and call cobweb(Node, Instance).
end
18

More Related Content

Similar to clustering in DataMining and differences in models/ clustering in data mining (20)

PDF
clustering using different methods in .pdf
officialnovice7
 
PPT
Chapter 11 cluster advanced, Han & Kamber
Houw Liong The
 
PDF
Clustering.pdf
nadimhossain24
 
PPTX
Clustering on DSS
Enaam Alotaibi
 
PDF
A0310112
iosrjournals
 
PPTX
Unsupervised Learning.pptx
GandhiMathy6
 
PPT
Chapter 11 cluster advanced : web and text mining
Houw Liong The
 
PPT
3.3 hierarchical methods
Krish_ver2
 
PPT
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Salah Amean
 
PPTX
Cluster Analysis.pptx
Rvishnupriya2
 
PPTX
Clusters techniques
rajshreemuthiah
 
PDF
PPT s10-machine vision-s2
Binus Online Learning
 
PPTX
hierarchical methods
rajshreemuthiah
 
PDF
Algorithm for mining cluster and association patterns
ReginoBalogoJr1
 
PDF
Literature Survey On Clustering Techniques
IOSR Journals
 
PDF
12. Clustering.pdf for the students of aktu.
tanyasingh3130
 
PPT
11ClusAdvanced.ppt
SueMiu
 
PPTX
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
NANDHINIS900805
 
PPTX
clustering ppt.pptx
chmeghana1
 
clustering using different methods in .pdf
officialnovice7
 
Chapter 11 cluster advanced, Han & Kamber
Houw Liong The
 
Clustering.pdf
nadimhossain24
 
Clustering on DSS
Enaam Alotaibi
 
A0310112
iosrjournals
 
Unsupervised Learning.pptx
GandhiMathy6
 
Chapter 11 cluster advanced : web and text mining
Houw Liong The
 
3.3 hierarchical methods
Krish_ver2
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Salah Amean
 
Cluster Analysis.pptx
Rvishnupriya2
 
Clusters techniques
rajshreemuthiah
 
PPT s10-machine vision-s2
Binus Online Learning
 
hierarchical methods
rajshreemuthiah
 
Algorithm for mining cluster and association patterns
ReginoBalogoJr1
 
Literature Survey On Clustering Techniques
IOSR Journals
 
12. Clustering.pdf for the students of aktu.
tanyasingh3130
 
11ClusAdvanced.ppt
SueMiu
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
NANDHINIS900805
 
clustering ppt.pptx
chmeghana1
 

Recently uploaded (20)

PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPTX
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
Geographical diversity of India short notes by sandeep swamy
Sandeep Swamy
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PPTX
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
Geographical diversity of India short notes by sandeep swamy
Sandeep Swamy
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
Ad

clustering in DataMining and differences in models/ clustering in data mining

  • 1. Clustering • Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. • Differences between models/algorithms for clustering: – Conceptual (model-based) vs. partitioning – Exclusive vs. overlapping – Deterministic vs. probabilistic – Hierarchical vs. flat – Incremental vs. batch learning • Evaluating clustering quality: subjective approaches, objective functions (e.g. category utility, entropy). • Major approaches: – Cluster/2: flat, conceptual (model-based), batch learning, possibly overlapping, deterministic. – Partitioning methods: flat, batch learning, exclusive, deterministic or probabilistic. Algorithms: k-means, probability-based clustering (EM) – Hierarchical clustering ∗ Partitioning: agglomerative (bottom-up) or divisible (top-down). ∗ Conceptual: Cobweb, category utility function. 1
  • 2. 1 CLUSTER/2 • One of the first conceptual clustering approaches [Michalski, 83]. • Works as a meta-learning sheme - uses a learning algorithm in its inner loop to form categories. • Has no practical value, but introduces important ideas and techniques, used in current appraoches to conceptual clustering. The CLUSTER/2 algorithm forms k categories by constructing individual objects grouped around k seed objects. It works as follows: 1. Select k objects (seeds) from the set of observed objects (randomly or using some selection function). 2. For each seed, using it as a positive example the all the other seeds as negative examples, find a maximally general description that covers all positive and none of the negative examples. 3. Classify all objects form the sample in categories according to these de- scriptions. Then replace each maximally general descriptions with a max- imally specific one, that cover all objects in the category. (This possibly avoids category overlapping.) 4. If there are still overlapping categories, then using some metric (e.g. eu- clidean distance) find central objects in each category and repeat steps 1-3 using these objects as seeds. 5. Stop when some quality criterion for the category descriptions is satisfied. Such a criterion might be the complexity of the descriptions (e.g. the number of conjuncts) 2
  • 3. 6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of the category). 3
  • 4. 2 Partitioning methods – k-means • Iterative distance-based clustering. • Used by statisticians for decades. • Similarly to Cluster/2 uses k seeds (predefined k), but is based on a distance measure: 1. Select k instances (cluster centers) from the sample (usually at ran- dom). 2. Assign instances to clusters according to their distance to the cluster centers. 3. Find new cluster centers and go to step 2 until the process converges (i.e. the same instances are assigned to each cluster in two consecutive passes). • The clustering depends greatly on the initial choice of cluster centers – the algorithm may fall in a local minimum. • Example of bad chioce of cluster centers: four instances at the vertices of a rectangle, two initial cluster centers – midpoints of the long sides of the rectangle. This is a stable configuration, however not a good clustering. • Solution to the local minimum problem: restart the algorithm with an- other set of cluster centers. • Hierarchical k-means: apply k = 2 recursively to the resulting clusters. 4
  • 5. 3 Probabilty-based clustering Why probabilities? • Restricted amount of evidence implies probabilistic reasoning. • From a probabilistic perspective, we want to find the most likely clusters given the data. • An instance only has certain probability of belonging to a particular cluster. 5
  • 6. 4 Probabilty-based clustering – mixture models • For a single attribute: three parameters - mean, standard deviation and sampling probability. • Each cluster A is defined by a mean (µA) and a standard deviation (σA). • Samples are taken from each cluster A with a specified probability of sampling P(A). • Finite mixture problem: given a dataset, find the mean, standard devia- tion and the probability of sampling for each cluster. • If we know the classification of each instance, then: – mean (average), µ = 1 n Pn 1 xi; – standard deviation, σ2 = 1 n−1 Pn 1 (xi − µ)2 ; – probability of sampling for class A, P(A) = proportion of instances in it. • If we know the three parameters, the probability that an instance x belongs to cluster A is: P(A|x) = P(x|A)P(A) P(x) , where P(x|A) is the density function for A, f(x; µA, σA) = 1 √ 2πσA e −(x−µA)2 2σ2 A . P(x) is not necessary as we calculate the numerators for all clusters and normalize them by dividing by their sum. ⇒ In fact, this is exactly the Naive Bayes approach. • For more attributes: naive Bayes assumption – independence between attributes. The joint probabilities of an instance are calculated as a product of the probabilities of all attributes. 6
  • 7. 5 EM (expectation maximization) • Similarly to k-means, first select the cluster parameters (µA, σA and P(A)) or guess the classes of the instances, then iterate. • Adjustment needed: we know cluster probabilities, not actual clusters for each instance. So, we use these probabilities as weights. • For cluster A: µA = Pn 1 wixi Pn 1 wi , where wi is the probability that xi belongs to cluster A; σ2 A = Pn 1 wi(xi−µ)2 Pn 1 wi . • When to stop iteration - maximizing overall likelihood that the data come form the dataset with the given parameters (”goodness” of clustering): Log-likelihood = P i log( P A P(A)P(xi|A) ) Stop when the difference between two successive iteration becomes neg- ligible (i.e. there is no improvement of clustering quality). 7
  • 8. 6 Criterion functions for clustering 1. Distance-based functions. • Sum of squared error: mA = 1 n X x∈A x J = X A X x∈A ||x − mA||2 • Optimal clustering minilizes J: minimal variance clustering. 2. Probability (entropy) based functions. • Probability of instance P(xi) = P A P(A)P(xi|A) • Probability of sample x1, ..., xn: Πn i ( X A P(A)P(xi|A) ) • Log-likelihood: n X i log( X A P(A)P(xi|A) ) 3. Category utility (Cobweb): CU = 1 n X C P(C) X A X v [P(A = v|C)2 − P(A = v)2 ] 4. Error-based evaluation: evaluate clusters with respect to classes using preclassified examples (Classes to clusters evaluation mode in Weka). 8
  • 9. 7 Hierarchical partitioning methods • Bottom-up (agglomerative): at each step merge the two closest clusters. • Top-down (divisible): split the current set into two clusters and proceed recursively with the subsets. • Distance function between instances (e.g. Euclidean distance). • Distance function between clusters (e.g. distance between centers, mini- mal distance, average distance). • Criteria for stopping merging or splitting: – desired number of clusters; – distance between the closest clusters is above (top-down) or below (bottom-up) a threshold. • Algorithms: – Nearest neighbor (single-linkage) agglomerative clustering: cluster distance = minimal distance between elements. Merging stops when distance > threshold. In fact, this is an algorithm for generating a minimal spanning tree. – Farthest neighbor (complete-linkage) agglomerative clustering: cluster distance = maximal distance between elements. Merging stops when distance > threshold. The algorithm computes the complete subgraph for every cluster. • Problems: greedy algorithm (local minimum), once created a subtree cannot be restructured. 9
  • 10. 8 Example of agglomerative hierarchical clustering 8.1 Data Day Outlook Temperature Humidity Wind Play 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no 10
  • 11. 8.2 Farthest neighbor (complete-linkage) agglomerative clustering, threshold = 3 + + + + 4 - yes 8 - no + 12 - yes 14 - no + 1 - no 2 - no + + 3 - yes 13 - yes + 5 - yes 10 - yes + + 9 - yes 11 - yes + 6 - no 7 - yes Classes to clusters evaluation for the three top level classes: • {4, 8, 12, 14, 1, 2} – majority class=no, 2 incorrectly classified instances. • {3, 13, 5, 10} – majority class=yes, 0 incorrectly classified instances. • {9, 11, 6, 7} – majority class=yes, 1 incorrectly classified instance. Total number of incorrectly classified instances = 3, Error = 3/14 11
  • 12. 9 Hierarchical conceptual clustering: Cobweb • Incremental clustering algorithm, which builds a taxonomy of clusters without having a predefined number of clusters. • The clusters are represented probabilistically by conditional probability P(A = v|C) with which attribute A has value v, given that the instance belongs to class C. • The algorithm starts with an empty root node. • Instances are added one by one. • For each instance the following options (operators) are considered: – classifying the instance into an existing class; – creating a new class and placing the instance into it; – combining two classes into a single class (merging) and placing the new instance in the resulting hierarchy; – dividing a class into two classes (splitting) and placing the new in- stance in the resulting hierarchy. • The algorithm searches the space of possible hierarchies by applying the above operators and an evaluation function based on the category utility. 12
  • 13. 10 Measuring quality of clustering – Category utility (CU) func- tion • CU attempts to maximize both the probability that two instances in the same category have attribute values in common and the probability that instances from different categories have different attribute values. CU = X C X A X v P(A = v)P(A = v|C)P(C|A = v) • P(A = v|C) is the probability that an instance has value v for its at- tribute A, given that it belongs to category C. The higher this probabil- ity, the more likely two instances in a category share the same attribute values. • P(C|A = v) is the probability that an instance belongs to category C, given that it has value v for its attribute A. The greater this probability, the less likely instances from different categories will have attribute values in common. • P(A = v) is a weight, assuring that frequently occurring attribute values will have stronger influence on the evaluation. 13
  • 14. 11 Category utility • After applying Bayes rule we get CU = X C X A X v P(C)P(A = v|C)2 • P A P v P(A = v|C)2 is the expected number of attribute values that one can correctly guess for an arbitrary member of class C. This expecta- tion assumes a probability matching strategy, in which one guesses an attribute value with a probability equal to its probability of occurring. • Without knowing the cluster structure the above term is P A P v P(A = v)2 . • The final CU is defined as the increase in the expected number of at- tribute values that can be correctly guessed, given a set of n categories, over the expected number of correct guesses without such knowledge. That is: CU = 1 n X C P(C) X A X v [P(A = v|C)2 − P(A = v)2 ] • The above expression is divided by n to allow comparing different size clusterings. • Handling numeric attributes (Classit): assuming normal distribution and using probability density function (based on mean and standard devia- tion). 14
  • 15. 12 Control parameters • Acuity: a single instance in a cluster results in a zero variance, which in turn produces infinite value for CU. The acuity parameter is the minimum value for the variance (can be interpreted as a measurement error). • Cutoff: the minimum increase of CU to add a new node to the hierarchy, otherwise the new node is cut off. 15
  • 16. 13 Example color nuclei tails white 1 1 white 2 2 black 2 2 black 3 1 16
  • 17. 14 Cobweb’s clustering hierarchy C5 P(C5) = 0.25 attr val P tails 1 0.0 2 1.0 color white 1.0 black 0.0 nuclei 1.0 0.0 2 1.0 3 0.0 C6 P(C6) = 0.25 attr val P tails 1 0.0 2 1.0 color white 0.0 black 1.0 nuclei 1.0 0.0 2 1.0 3 0.0 C2 P(C2) = 0.25 attr val P tails 1 1.0 2 0.0 color white 1.0 black 0.0 nuclei 1.0 1.0 2 0.0 3 0.0 C3 P(C3) = 0.5 attr val P tails 1 0.0 2 1.0 color white 0.5 black 0.5 nuclei 1.0 0.0 2 1.0 3 0.0 C4 P(C1) = 0.25 attr val P tails 1 1.0 2 0.0 color white 0.0 black 1.0 nuclei 1.0 0.0 2 0.0 3 1.0 C1 P(C1) = 1.0 attr val P tails 1 0.5 2 0.5 color white 0.5 black 0.5 nuclei 1 0.25 2 0.50 3 0.25 Q Q Q Q Q Q PPPPPPPPP P 17
  • 18. 15 Cobweb algorithm cobweb(Node,Instance) begin • If Node is a leaf then begin Create two children of Node - L1 and L2; Set the probabilities of L1 to those of Node; Set the probabilities of L2 to those of Insnatce; Add Instance to Node, updating Node’s probabilities. end • else begin Add Instance to Node, updating Node’s probabilities; For each child C of Node, com- pute the category utility of clustering achieved by placing Instance in C; Calculate: S1 = the score for the best categorization (Instance is placed in C1); S2 = the score for the second best categorization (Instance is placed in C2); S3 = the score for placing Instance in a new category; S4 = the score for merging C1 and C2 into one category; S5 = the score for splitting C1 (replacing it with its child categories. end • If S1 is the best score then call cobweb(C1, Instance). • If S3 is the best score then set the new category’s probabilities to those of Instance. • If S4 is the best score then call cobweb(Cm, Instance), where Cm is the result of merging C1 and C2. • If S5 is the best score then split C1 and call cobweb(Node, Instance). end 18