SlideShare a Scribd company logo
4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 5-16, 2009




     Methods from Mathematical Data Mining
                         (Supported by Optimization)


             Gerhard-Wilhelm Weber * and Başak Akteke-Öztürk
             Gerhard-                          Akteke-
                               Institute of Applied Mathematics
                       Middle East Technical University, Ankara, Turkey

            * Faculty of Economics, Management and Law,   University of Siegen, Germany
             Center for Research on Optimization and Control, University of Aveiro, Portugal



                                                     1
        EURO CBBM
        EURO            EURO ORD
                        EURO                                         CE*OC                     August 8, 2009
4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 5-16, 2009

                                  Clustering Theory

              Cluster Number and Cluster Stability Estimation
                                          Z. Volkovich
     Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel

                                            Z. Barzily
     Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel

                                          G.-W. Weber
          Departments of Scientific Computing, Financial Mathematics and Actuarial Sciences,
      Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey

                                       D. Toledano-Kitai
     Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel
                                                  2
                                                                                    August 8, 2009
Clustering
• An essential tool for “unsupervised” learning is
  cluster analysis which suggests categorizing data
  (objects, instances) into groups such that the
  likeness within a group is much higher than the one
  between the groups.

• This resemblance is often described by a
  distance function.


                            3
                                                August 8, 2009
Clustering

For a given set S ⊂ IR d a clustering algorithm CL
constructs a clustered set:
   CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)),
such that CL(x) = CL(y) = i, if x and y are similar:
   x, y ∈ πi(S), for some i=1,…,k;
and CL(x) ≠ CL(y), if x and y are dissimilar.

                              4
                                                August 8, 2009
Clustering

The disjoint subsets   πi (S), i=1,…,k, are named
clusters:
      k

     U π (S )
     i =1
            i   = S , and π i ∩ π j = ∅ for i ≠ j.




                               5
                                                     August 8, 2009
Clustering




  CL(x) = CL(y)       CL(x) ≠ CL(y)


                  6
                              August 8, 2009
Clustering
The iterative clustering process is usually carried out in two phases:
a partitioning phase and a quality assessment phase.
In the partitioning phase, a label is assigned to each element
in view of the assumption that, in addition to the observed features,
for each data item, there is a hidden, unobserved feature
representing cluster membership.
The quality assessment phase measures the grouping quality.
The outcome of the clustering process is a partition that acquires
the highest quality score.
Except for the data itself, two essential input parameters are
typically required: an initial partition and a suggested number of
clusters. Here, the parameters are denoted as
• int-part ;
• k.                                   7
                                                             August 8, 2009
The Problem
Partitions generated by the iterative algorithms are commonly
sensitive to initial partitions fed in as an input parameter.
Selection of “good” initial partitions is an essential
clustering problem.
Another problem arising here is choosing the right number of the
clusters. It is well known that this key task of the cluster analysis
is ill posed. For instance, the “correct” number of clusters in a
data set can depend on the scale in which the data are measured.

In this talk, we address to the last problem concerning
determination of the number of clusters.

                                   8
                                                            August 8, 2009
The Problem
Partitions generated by the iterative algorithms are commonly
sensitive to initial partitions fed in as an input parameter.
Selection of “good” initial partitions is an essential
clustering problem.
Another problem arising here is choosing the right number of the
clusters. It is well known that this key task of the cluster analysis
is ill posed. For instance, the “correct” number of clusters in a
data set can depend on the scale in which the data are measured.




                                   9
                                                            August 8, 2009
The Problem
Many approaches to this problem exploit the within-cluster
dispersion matrix (defined according to the pattern of a
covariance matrix). The span of this matrix (column space)
usually decreases as the number of groups rises, and may have
a point in which it “falls”. Such an “elbow” on the graph locates,
in several known methods, the “true” number of clusters.
Stability based approaches, for the cluster validation problem,
evaluate the partitions’ variability under repeated applications
of a clustering algorithm. Low variability is understood as
high consistency in the result obtained, and the number of clusters
that maximizes cluster stability is accepted as an estimate for the
“true” number of clusters.
                                 10
                                                          August 8, 2009
The Concept
In the current talk, the problem of determining the
true number of clusters is addressed by the cluster
stability approach.
We propose a method for the study of cluster stability.
This method suggests a geometrical stability of a
partition.
• We draw samples from the source data and estimate
   the clusters by means of each of the drawn samples.
• We compare pairs of the partitions obtained.
• A pair is considered to be consistent if the obtained
   division is close.
                            11
                                                 August 8, 2009
The Concept
• We quantify this closeness by the number of edges
  connecting points from different samples in a
  minimal spanning tree (MST) constructed for each one
  of the clusters.
• We use the Friedman and Rafsky two sample test
  statistic which measures these quantities. Under the
  null hypothesis on the homogeneity of the source data,
  this statistic is approximately normally distributed.
  So, the case of well mingled samples within the clusters
  leads to normal distribution of the considered statistic.
                            12
                                                 August 8, 2009
The Concept
Examples of MST produced by samples within a cluster:




                          13
                                             August 8, 2009
The Concept
The left-side picture is an example of “a good cluster”
where the quantity of edges connecting points from
different samples (marked by solid red lines) is
relatively big.
The right-side picture images a “poor situation” when
only one (and long) edge connects the (sub-) clusters.




                            14
                                                 August 8, 2009
The Two-Sample MST-Test
Henze and Penrose (1979) considered the asymptotic behavior of
Rmn :
the number of edges of V which connect a point of S to a point of T .
Suppose that |S|=m → ∞ and |T|=n → ∞ such that
m /(m+n) → p∈ (0, 1).
              ∈
Introducing q = 1 − p and r = 2pq, they obtained:
                      1 
                          Rmn −
                                 2mn 
                                 m+n
                                              (
                                      → N 0, σ d
                                                2
                                                    ),
                     m+n 
                                                    2
where the convergence is in distribution and N(0, σ d ) denotes
the normal distribution with a 0 expectation and a variance
  2
σ d := r (r + Cd (1 − 2r)), for some constant Cd
depending only on the space’s dimension d.
                                    15
                                                            August 8, 2009
Concept
• Resting upon this fact, the standard score
                        2K        m
                 Y j :=       Rj − 
                         m        K
  of the mentioned edges quantity is calculated
  for each cluster j=1,…, K ,
  where m is the sample size and
  K denotes the number of clusters.
                         %
• The partition quality Y is represented by the
  worst cluster corresponding to the
  minimal standard score value obtained.
                              16
                                                  August 8, 2009
Concept
• It is natural to expect that the true number of
  clusters can be characterized by the empirical
  distribution of the partition standard score
  having the shortest left tail.
• The proposed methodology is expressed as a
  sequential creation of the described distribution
  with its left-asymmetry estimation.




                           17
                                                August 8, 2009
Concept
One of important problems appearing here is the
so-called clusters coordination problem.
Actually, the same cluster can be differently tagged
within repeated rerunning of the algorithm.
This fact results from the inherent symmetry of
the partitions according to their clusters labels.




                            18
                                                     August 8, 2009
Concept
We solve this problem by the following way:
Let S = S1 ∪ S 2 . Consider three categorizations:
                  Π K := Cl ( S , K ) ,
                  Π K ,1 := Cl ( S1, K ) ,
                  Π K ,2 := Cl ( S2 , K ) .
Thus, we get two partitions for each of the samples
Si, i=1,2. The first one is induced by ΠK and the
second one is Π K ,i , i = 1, 2 .
                              19
                                                     August 8, 2009
Concept
For each one of the samples i =1,2, our purpose is
to find the permutation ψ of the set {1,…,K} which
minimizes the quantities of the misclassified items:

                                         ( i ) x , i = 1, 2 ,
ψ i*                 ψ α
       = arg min ∑ I      (           )
                          K ,i ( x ) ≠ α K ( ) 
            ψ   x∈ X                            

where I(z) is the indicator function of the event z and
α K ,i , α Ki ) are assignments defined by ∏ K , ∏ K ,i ,
           (

correspondingly.
                                  20
                                                         August 8, 2009
Concept

The well-known Hungarian method for solving
this problem has computational complexity of O(K3).
After changing the cluster labels of the partitions
∏ K ,i , i = 1, 2 , consistent with ψ i , i = 1, 2 ,
                                      *

we can assume that these partitions are coordinated,
i.e., the clusters are consistently designated.




                           21
                                                August 8, 2009
Algorithm
1. Choose the parameters: K*, J, m, Cl .
2. For K = 2 to K*
3.    For j = 1 to J
4.      Sj,1= sample (X, m) , Sj,2= sample (X  Sj,1, m)
5.      Calculate
        ΠK , j =Cl( S(j), K) ,
        ΠK , j,1 =Cl( Sj ,1, K) ,
        ΠK , j,2 =Cl( Sj ,2, K) .
6.      Solve the coordination problem.
                             22
                                                   August 8, 2009
Algorithm
7.       Calculate Yj(k),   k=1,…,K, % (jK ) .
                                     Y
8.    end if j
9.    Calculate an asymmetry index (percentile) IK
           % (jK ) | j = 1,...,J }.
      for {Y
10. end if K
11. The “true” K* is selected as the one which yields
    the maximal value of the index.

Here, sample(S,m) is a procedure which selects a
random sample of size m from the set S, without
replacement.               23
                                                 August 8, 2009
Numerical Experiments
We have carried out various numerical experiments on synthetic
and real data sets. We choose K*=7 in all tests, and we provide
10 trials for each experiment.
The results are presented via the error-bar plots of the sample
percentiles’ mean within the trials. The sizes of the error bars
equal two standard deviations, found inside the trials of the results.
The standard version of the Partitioning Around Medoids (PAM)
algorithm has been used for clustering.
The empirical percentiles of 25%, 75% and 90% have been used
as the asymmetry indexes.
                                    24
                                                              August 8, 2009
Numerical Experiments – Synthetic Data
The synthesized data are mixtures of 2-dimensional
Gaussian distributions with independent coordinates
owning the same standard deviation σ.
Mean values of the components are placed on the
unit circle on the angular neighboring distance 2π / k .
                                                     ˆ

Each data set contains 4000 items.
Here, we took J=100 (J: number of samples) and
m=200 (m: size of samples).

                             25
                                                   August 8, 2009
Synthetic Data - Example 1
The first data set has the parameters k = 4 and σ = 0.3.
                                      ˆ




As we see, all of the three indexes clearly indicate
four clusters.                26
                                                   August 8, 2009
Synthetic Data - Example 2
 The second synthetic data set has the parameters k = 5
                                                  ˆ
 and σ = 0.3.




The components are obviously overlapping in this case.
                             27
                                                  August 8, 2009
Synthetic Data - Example 2




 As it can be seen, the true number of clusters has been
 successfully found by all indexes.
                            28
                                                 August 8, 2009
Numerical Experiments – Real-World Data
  First Data Sets
The first real data set was chosen from the text collection
http://ftp.cs.cornell.edu/pub/smart/ .

This set consists of the following three sub-collections
DC0: Medlars Collection (1033 medical abstracts),
DC1: CISI Collection (1460 information science abstracts),
DC2: Cranfield Collection (1400 aerodynamics abstracts).


                               29
                                                     August 8, 2009
Numerical Experiments – Real-World Data
  First Data Sets
 We picked the 600 “best” terms, following the common
 bag of words method.
 It is known that this collection is well separated
 by means of its first two leading principal components.
 Here, we also took J=100 and m=200.




                             30
                                                  August 8, 2009
Real-World Data - First Data Sets




All the indexes receive their maximal values at K=3,
i.e., the number of clusters is properly determined.
                           31
                                               August 8, 2009
Numerical Experiments – Real-World Data
  Second Data Set
 Another considered data set is the famous
 Iris Flower Data Set, available, for example, at
 http://archive.ics.uci.edu/ml/datasets/Iris .
 This dataset is composed from 150 4-dimensional
 feature vectors of three equally sized sets of iris flowers.
 We choose J=200 and the sample size equals 70.


                               32
                                                     August 8, 2009
Real-World Data – Iris Flower Data Set




Our method turns out a three clusters structure.
                            33
                                                   August 8, 2009
Conclusions -
 The Rationale of Our Approach
• In this paper, we propose a novel approach, based on
  the Minimal Spanning Tree two sample test, for the
  cluster stability assessment.
• The method offers to quantify the partitions’ features
  through the test statistic computed within the clusters
  built by means of sample pairs.
• The worst cluster, determined by the lowest
  standardized statistic value, characterizes the
  partition quality.

                              34
                                                   August 8, 2009
Conclusions -
 The Rationale of Our Approach
• The departure from the theoretical model, which
  suggests well-mingled samples within the clusters,
  is described by the left tail of the score distribution.
• The shortest tail corresponds to the “true” number
  of clusters.
• All presented experiments detect the true number
  of clusters.



                               35
                                                     August 8, 2009
Conclusions

• In the case of the five components Gaussian data set,
  the true number of clusters was found even though
  a certain overlapping of the clusters exists.
• The four Gaussian components data set contains
  sufficiently separated components. Therefore,
  it is of no revelation that the true number of clusters
  is attained here.




                              36
                                                   August 8, 2009
Conclusions

• The analysis of the abstracts data set is carried out
  with 600 terms and the true number of clusters
  was also detected.
• The Iris Flower dataset is sufficiently difficult to
  analyze due to the fact that two clusters are not
  linearly separable. However, the true number
  of clusters was found here as well.




                              37
                                                    August 8, 2009
References
Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning trees,
ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies
(Neringa, Lithuania, May 20-23, 2008) 248-252.

Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., On a minimal spanning tree approach in the
cluster validation problem, to appear in the special issue of INFORMATICA at the occasion of 20th Mini-EURO
Conference Continuous Optimization and Knowledge Based Technologies (Neringa, Lithuania, May 20-23, 2008),
Dzemyda, G., Miettinen, K., and Sakalauskas, L., guest editors.

Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal
spanning trees approach, Proceedings of the Second Global Conference on Power Control and Optimization, AIP
Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN
978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds..




                                                           38
                                                                                                    August 8, 2009

More Related Content

PDF
Polynomial Matrix Decompositions
Förderverein Technische Fakultät
 
PDF
Estimating Space-Time Covariance from Finite Sample Sets
Förderverein Technische Fakultät
 
PDF
A discussion on sampling graphs to approximate network classification functions
LARCA UPC
 
PDF
W33123127
IJERA Editor
 
PDF
From RNN to neural networks for cyclic undirected graphs
tuxette
 
PDF
Convolutional networks and graph networks through kernels
tuxette
 
PDF
Bridging knowledge graphs_to_generate_scene_graphs
Woen Yon Lai
 
PDF
50120130405020
IAEME Publication
 
Polynomial Matrix Decompositions
Förderverein Technische Fakultät
 
Estimating Space-Time Covariance from Finite Sample Sets
Förderverein Technische Fakultät
 
A discussion on sampling graphs to approximate network classification functions
LARCA UPC
 
W33123127
IJERA Editor
 
From RNN to neural networks for cyclic undirected graphs
tuxette
 
Convolutional networks and graph networks through kernels
tuxette
 
Bridging knowledge graphs_to_generate_scene_graphs
Woen Yon Lai
 
50120130405020
IAEME Publication
 

What's hot (20)

PDF
A short and naive introduction to using network in prediction models
tuxette
 
PDF
ElectroencephalographySignalClassification based on Sub-Band Common Spatial P...
IOSRJVSP
 
PDF
Principal component analysis and matrix factorizations for learning (part 1) ...
zukun
 
PDF
Machine learning in science and industry — day 2
arogozhnikov
 
PDF
Beck Workshop on Modelling and Simulation of Coal-fired Power Generation and ...
UK Carbon Capture and Storage Research Centre
 
PDF
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
PDF
Machine learning in science and industry — day 1
arogozhnikov
 
PDF
Fuzzy c means_realestate_application
Cemal Ardil
 
PDF
Graph Neural Network in practice
tuxette
 
PDF
00463517b1e90c1e63000000
Ivonne Liu
 
PDF
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
PDF
Fuzzy entropy based optimal
ijsc
 
PDF
Machine learning in science and industry — day 3
arogozhnikov
 
PDF
Steganographic Scheme Based on Message-Cover matching
IJECEIAES
 
PDF
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
PDF
Image Denoising Based On Sparse Representation In A Probabilistic Framework
CSCJournals
 
PDF
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Pantelis Bouboulis
 
PDF
(研究会輪読) Weight Uncertainty in Neural Networks
Masahiro Suzuki
 
PDF
Iclr2016 vaeまとめ
Deep Learning JP
 
PPTX
Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...
wl820609
 
A short and naive introduction to using network in prediction models
tuxette
 
ElectroencephalographySignalClassification based on Sub-Band Common Spatial P...
IOSRJVSP
 
Principal component analysis and matrix factorizations for learning (part 1) ...
zukun
 
Machine learning in science and industry — day 2
arogozhnikov
 
Beck Workshop on Modelling and Simulation of Coal-fired Power Generation and ...
UK Carbon Capture and Storage Research Centre
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
Machine learning in science and industry — day 1
arogozhnikov
 
Fuzzy c means_realestate_application
Cemal Ardil
 
Graph Neural Network in practice
tuxette
 
00463517b1e90c1e63000000
Ivonne Liu
 
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
Fuzzy entropy based optimal
ijsc
 
Machine learning in science and industry — day 3
arogozhnikov
 
Steganographic Scheme Based on Message-Cover matching
IJECEIAES
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
Image Denoising Based On Sparse Representation In A Probabilistic Framework
CSCJournals
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Pantelis Bouboulis
 
(研究会輪読) Weight Uncertainty in Neural Networks
Masahiro Suzuki
 
Iclr2016 vaeまとめ
Deep Learning JP
 
Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...
wl820609
 
Ad

Viewers also liked (20)

PDF
Lesson 21: Curve Sketching
Mel Anthony Pepito
 
PDF
Lesson 24: Area and Distances
Mel Anthony Pepito
 
PDF
Lesson 4: Calculating Limits (Section 41 slides)
Mel Anthony Pepito
 
PDF
Lesson 26: Evaluating Definite Integrals
Mel Anthony Pepito
 
PDF
Lesson 17: Indeterminate Forms and L'Hôpital's Rule
Mel Anthony Pepito
 
PDF
Lesson 16: Inverse Trigonometric Functions (Section 041 slides)
Mel Anthony Pepito
 
PDF
Lecture7
Bharathvajan .k
 
PDF
Lesson 11: Implicit Differentiation
Mel Anthony Pepito
 
PDF
Lesson 27: Integration by Substitution (Section 041 slides)
Mel Anthony Pepito
 
PDF
Lesson 22: Optimization II (Section 021 slides)
Mel Anthony Pepito
 
PDF
Lesson 8: Basic Differentiation Rules (Section 41 slides)
Mel Anthony Pepito
 
PDF
Introduction
Mel Anthony Pepito
 
PDF
Lesson 22: Optimization (Section 021 slides)
Mel Anthony Pepito
 
PDF
Lesson 13: Related Rates Problems
Mel Anthony Pepito
 
PDF
Lesson 8: Basic Differentiation Rules (Section 21 slides)
Mel Anthony Pepito
 
PDF
Lesson 3: Limits (Section 21 slides)
Mel Anthony Pepito
 
PDF
Lesson 6: Limits Involving ∞ (Section 21 slides)
Mel Anthony Pepito
 
PDF
Lesson 12: Linear Approximation
Mel Anthony Pepito
 
PDF
Lesson 10: The Chain Rule (Section 41 slides)
Mel Anthony Pepito
 
PDF
Lesson18 -maximum_and_minimum_values_slides
Mel Anthony Pepito
 
Lesson 21: Curve Sketching
Mel Anthony Pepito
 
Lesson 24: Area and Distances
Mel Anthony Pepito
 
Lesson 4: Calculating Limits (Section 41 slides)
Mel Anthony Pepito
 
Lesson 26: Evaluating Definite Integrals
Mel Anthony Pepito
 
Lesson 17: Indeterminate Forms and L'Hôpital's Rule
Mel Anthony Pepito
 
Lesson 16: Inverse Trigonometric Functions (Section 041 slides)
Mel Anthony Pepito
 
Lecture7
Bharathvajan .k
 
Lesson 11: Implicit Differentiation
Mel Anthony Pepito
 
Lesson 27: Integration by Substitution (Section 041 slides)
Mel Anthony Pepito
 
Lesson 22: Optimization II (Section 021 slides)
Mel Anthony Pepito
 
Lesson 8: Basic Differentiation Rules (Section 41 slides)
Mel Anthony Pepito
 
Introduction
Mel Anthony Pepito
 
Lesson 22: Optimization (Section 021 slides)
Mel Anthony Pepito
 
Lesson 13: Related Rates Problems
Mel Anthony Pepito
 
Lesson 8: Basic Differentiation Rules (Section 21 slides)
Mel Anthony Pepito
 
Lesson 3: Limits (Section 21 slides)
Mel Anthony Pepito
 
Lesson 6: Limits Involving ∞ (Section 21 slides)
Mel Anthony Pepito
 
Lesson 12: Linear Approximation
Mel Anthony Pepito
 
Lesson 10: The Chain Rule (Section 41 slides)
Mel Anthony Pepito
 
Lesson18 -maximum_and_minimum_values_slides
Mel Anthony Pepito
 
Ad

Similar to Methods from Mathematical Data Mining (Supported by Optimization) (20)

PDF
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
IJORCS
 
PDF
Dh31504508
IJMER
 
PDF
Clustering: A Survey
Raffaele Capaldo
 
PDF
An Iterative Improved k-means Clustering
IDES Editor
 
PDF
Improved Performance of Unsupervised Method by Renovated K-Means
IJASCSE
 
PDF
K means clustering in the cloud - a mahout test
João Gabriel Lima
 
PDF
Cluster Analysis
SSA KPI
 
PDF
Local vs. Global Models for Effort Estimation and Defect Prediction
CS, NcState
 
PDF
Ir3116271633
IJERA Editor
 
PDF
Bs31267274
IJMER
 
PDF
An Analysis On Clustering Algorithms In Data Mining
Gina Rizzo
 
PDF
41 125-1-pb
Mahendra Sisodia
 
PDF
Novel approach for predicting the rise and fall of stock index for a specific...
IAEME Publication
 
PDF
Similarity distance measures
thilagasna
 
PDF
Pami meanshift
irisshicat
 
PDF
Data clustering a review
unyil96
 
PDF
Bj24390398
IJERA Editor
 
PDF
B colouring
xs76250
 
PDF
Clustering Approach Recommendation System using Agglomerative Algorithm
IRJET Journal
 
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
IJORCS
 
Dh31504508
IJMER
 
Clustering: A Survey
Raffaele Capaldo
 
An Iterative Improved k-means Clustering
IDES Editor
 
Improved Performance of Unsupervised Method by Renovated K-Means
IJASCSE
 
K means clustering in the cloud - a mahout test
João Gabriel Lima
 
Cluster Analysis
SSA KPI
 
Local vs. Global Models for Effort Estimation and Defect Prediction
CS, NcState
 
Ir3116271633
IJERA Editor
 
Bs31267274
IJMER
 
An Analysis On Clustering Algorithms In Data Mining
Gina Rizzo
 
41 125-1-pb
Mahendra Sisodia
 
Novel approach for predicting the rise and fall of stock index for a specific...
IAEME Publication
 
Similarity distance measures
thilagasna
 
Pami meanshift
irisshicat
 
Data clustering a review
unyil96
 
Bj24390398
IJERA Editor
 
B colouring
xs76250
 
Clustering Approach Recommendation System using Agglomerative Algorithm
IRJET Journal
 

More from SSA KPI (20)

PDF
Germany presentation
SSA KPI
 
PDF
Grand challenges in energy
SSA KPI
 
PDF
Engineering role in sustainability
SSA KPI
 
PDF
Consensus and interaction on a long term strategy for sustainable development
SSA KPI
 
PDF
Competences in sustainability in engineering education
SSA KPI
 
PDF
Introducatio SD for enginers
SSA KPI
 
PPT
DAAD-10.11.2011
SSA KPI
 
PDF
Talking with money
SSA KPI
 
PDF
'Green' startup investment
SSA KPI
 
PDF
From Huygens odd sympathy to the energy Huygens' extraction from the sea waves
SSA KPI
 
PDF
Dynamics of dice games
SSA KPI
 
PPT
Energy Security Costs
SSA KPI
 
PPT
Naturally Occurring Radioactivity (NOR) in natural and anthropic environments
SSA KPI
 
PDF
Advanced energy technology for sustainable development. Part 5
SSA KPI
 
PDF
Advanced energy technology for sustainable development. Part 4
SSA KPI
 
PDF
Advanced energy technology for sustainable development. Part 3
SSA KPI
 
PDF
Advanced energy technology for sustainable development. Part 2
SSA KPI
 
PDF
Advanced energy technology for sustainable development. Part 1
SSA KPI
 
PPT
Fluorescent proteins in current biology
SSA KPI
 
PPTX
Neurotransmitter systems of the brain and their functions
SSA KPI
 
Germany presentation
SSA KPI
 
Grand challenges in energy
SSA KPI
 
Engineering role in sustainability
SSA KPI
 
Consensus and interaction on a long term strategy for sustainable development
SSA KPI
 
Competences in sustainability in engineering education
SSA KPI
 
Introducatio SD for enginers
SSA KPI
 
DAAD-10.11.2011
SSA KPI
 
Talking with money
SSA KPI
 
'Green' startup investment
SSA KPI
 
From Huygens odd sympathy to the energy Huygens' extraction from the sea waves
SSA KPI
 
Dynamics of dice games
SSA KPI
 
Energy Security Costs
SSA KPI
 
Naturally Occurring Radioactivity (NOR) in natural and anthropic environments
SSA KPI
 
Advanced energy technology for sustainable development. Part 5
SSA KPI
 
Advanced energy technology for sustainable development. Part 4
SSA KPI
 
Advanced energy technology for sustainable development. Part 3
SSA KPI
 
Advanced energy technology for sustainable development. Part 2
SSA KPI
 
Advanced energy technology for sustainable development. Part 1
SSA KPI
 
Fluorescent proteins in current biology
SSA KPI
 
Neurotransmitter systems of the brain and their functions
SSA KPI
 

Recently uploaded (20)

DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PPTX
CDH. pptx
AneetaSharma15
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PPTX
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PDF
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
CDH. pptx
AneetaSharma15
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 

Methods from Mathematical Data Mining (Supported by Optimization)

  • 1. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Methods from Mathematical Data Mining (Supported by Optimization) Gerhard-Wilhelm Weber * and Başak Akteke-Öztürk Gerhard- Akteke- Institute of Applied Mathematics Middle East Technical University, Ankara, Turkey * Faculty of Economics, Management and Law, University of Siegen, Germany Center for Research on Optimization and Control, University of Aveiro, Portugal 1 EURO CBBM EURO EURO ORD EURO CE*OC August 8, 2009
  • 2. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Clustering Theory Cluster Number and Cluster Stability Estimation Z. Volkovich Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel Z. Barzily Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel G.-W. Weber Departments of Scientific Computing, Financial Mathematics and Actuarial Sciences, Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey D. Toledano-Kitai Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel 2 August 8, 2009
  • 3. Clustering • An essential tool for “unsupervised” learning is cluster analysis which suggests categorizing data (objects, instances) into groups such that the likeness within a group is much higher than the one between the groups. • This resemblance is often described by a distance function. 3 August 8, 2009
  • 4. Clustering For a given set S ⊂ IR d a clustering algorithm CL constructs a clustered set: CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)), such that CL(x) = CL(y) = i, if x and y are similar: x, y ∈ πi(S), for some i=1,…,k; and CL(x) ≠ CL(y), if x and y are dissimilar. 4 August 8, 2009
  • 5. Clustering The disjoint subsets πi (S), i=1,…,k, are named clusters: k U π (S ) i =1 i = S , and π i ∩ π j = ∅ for i ≠ j. 5 August 8, 2009
  • 6. Clustering CL(x) = CL(y) CL(x) ≠ CL(y) 6 August 8, 2009
  • 7. Clustering The iterative clustering process is usually carried out in two phases: a partitioning phase and a quality assessment phase. In the partitioning phase, a label is assigned to each element in view of the assumption that, in addition to the observed features, for each data item, there is a hidden, unobserved feature representing cluster membership. The quality assessment phase measures the grouping quality. The outcome of the clustering process is a partition that acquires the highest quality score. Except for the data itself, two essential input parameters are typically required: an initial partition and a suggested number of clusters. Here, the parameters are denoted as • int-part ; • k. 7 August 8, 2009
  • 8. The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essential clustering problem. Another problem arising here is choosing the right number of the clusters. It is well known that this key task of the cluster analysis is ill posed. For instance, the “correct” number of clusters in a data set can depend on the scale in which the data are measured. In this talk, we address to the last problem concerning determination of the number of clusters. 8 August 8, 2009
  • 9. The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essential clustering problem. Another problem arising here is choosing the right number of the clusters. It is well known that this key task of the cluster analysis is ill posed. For instance, the “correct” number of clusters in a data set can depend on the scale in which the data are measured. 9 August 8, 2009
  • 10. The Problem Many approaches to this problem exploit the within-cluster dispersion matrix (defined according to the pattern of a covariance matrix). The span of this matrix (column space) usually decreases as the number of groups rises, and may have a point in which it “falls”. Such an “elbow” on the graph locates, in several known methods, the “true” number of clusters. Stability based approaches, for the cluster validation problem, evaluate the partitions’ variability under repeated applications of a clustering algorithm. Low variability is understood as high consistency in the result obtained, and the number of clusters that maximizes cluster stability is accepted as an estimate for the “true” number of clusters. 10 August 8, 2009
  • 11. The Concept In the current talk, the problem of determining the true number of clusters is addressed by the cluster stability approach. We propose a method for the study of cluster stability. This method suggests a geometrical stability of a partition. • We draw samples from the source data and estimate the clusters by means of each of the drawn samples. • We compare pairs of the partitions obtained. • A pair is considered to be consistent if the obtained division is close. 11 August 8, 2009
  • 12. The Concept • We quantify this closeness by the number of edges connecting points from different samples in a minimal spanning tree (MST) constructed for each one of the clusters. • We use the Friedman and Rafsky two sample test statistic which measures these quantities. Under the null hypothesis on the homogeneity of the source data, this statistic is approximately normally distributed. So, the case of well mingled samples within the clusters leads to normal distribution of the considered statistic. 12 August 8, 2009
  • 13. The Concept Examples of MST produced by samples within a cluster: 13 August 8, 2009
  • 14. The Concept The left-side picture is an example of “a good cluster” where the quantity of edges connecting points from different samples (marked by solid red lines) is relatively big. The right-side picture images a “poor situation” when only one (and long) edge connects the (sub-) clusters. 14 August 8, 2009
  • 15. The Two-Sample MST-Test Henze and Penrose (1979) considered the asymptotic behavior of Rmn : the number of edges of V which connect a point of S to a point of T . Suppose that |S|=m → ∞ and |T|=n → ∞ such that m /(m+n) → p∈ (0, 1). ∈ Introducing q = 1 − p and r = 2pq, they obtained: 1   Rmn − 2mn  m+n (  → N 0, σ d 2 ), m+n  2 where the convergence is in distribution and N(0, σ d ) denotes the normal distribution with a 0 expectation and a variance 2 σ d := r (r + Cd (1 − 2r)), for some constant Cd depending only on the space’s dimension d. 15 August 8, 2009
  • 16. Concept • Resting upon this fact, the standard score 2K  m Y j :=  Rj −  m  K of the mentioned edges quantity is calculated for each cluster j=1,…, K , where m is the sample size and K denotes the number of clusters. % • The partition quality Y is represented by the worst cluster corresponding to the minimal standard score value obtained. 16 August 8, 2009
  • 17. Concept • It is natural to expect that the true number of clusters can be characterized by the empirical distribution of the partition standard score having the shortest left tail. • The proposed methodology is expressed as a sequential creation of the described distribution with its left-asymmetry estimation. 17 August 8, 2009
  • 18. Concept One of important problems appearing here is the so-called clusters coordination problem. Actually, the same cluster can be differently tagged within repeated rerunning of the algorithm. This fact results from the inherent symmetry of the partitions according to their clusters labels. 18 August 8, 2009
  • 19. Concept We solve this problem by the following way: Let S = S1 ∪ S 2 . Consider three categorizations: Π K := Cl ( S , K ) , Π K ,1 := Cl ( S1, K ) , Π K ,2 := Cl ( S2 , K ) . Thus, we get two partitions for each of the samples Si, i=1,2. The first one is induced by ΠK and the second one is Π K ,i , i = 1, 2 . 19 August 8, 2009
  • 20. Concept For each one of the samples i =1,2, our purpose is to find the permutation ψ of the set {1,…,K} which minimizes the quantities of the misclassified items: ( i ) x , i = 1, 2 , ψ i* ψ α = arg min ∑ I  ( ) K ,i ( x ) ≠ α K ( )  ψ x∈ X   where I(z) is the indicator function of the event z and α K ,i , α Ki ) are assignments defined by ∏ K , ∏ K ,i , ( correspondingly. 20 August 8, 2009
  • 21. Concept The well-known Hungarian method for solving this problem has computational complexity of O(K3). After changing the cluster labels of the partitions ∏ K ,i , i = 1, 2 , consistent with ψ i , i = 1, 2 , * we can assume that these partitions are coordinated, i.e., the clusters are consistently designated. 21 August 8, 2009
  • 22. Algorithm 1. Choose the parameters: K*, J, m, Cl . 2. For K = 2 to K* 3. For j = 1 to J 4. Sj,1= sample (X, m) , Sj,2= sample (X Sj,1, m) 5. Calculate ΠK , j =Cl( S(j), K) , ΠK , j,1 =Cl( Sj ,1, K) , ΠK , j,2 =Cl( Sj ,2, K) . 6. Solve the coordination problem. 22 August 8, 2009
  • 23. Algorithm 7. Calculate Yj(k), k=1,…,K, % (jK ) . Y 8. end if j 9. Calculate an asymmetry index (percentile) IK % (jK ) | j = 1,...,J }. for {Y 10. end if K 11. The “true” K* is selected as the one which yields the maximal value of the index. Here, sample(S,m) is a procedure which selects a random sample of size m from the set S, without replacement. 23 August 8, 2009
  • 24. Numerical Experiments We have carried out various numerical experiments on synthetic and real data sets. We choose K*=7 in all tests, and we provide 10 trials for each experiment. The results are presented via the error-bar plots of the sample percentiles’ mean within the trials. The sizes of the error bars equal two standard deviations, found inside the trials of the results. The standard version of the Partitioning Around Medoids (PAM) algorithm has been used for clustering. The empirical percentiles of 25%, 75% and 90% have been used as the asymmetry indexes. 24 August 8, 2009
  • 25. Numerical Experiments – Synthetic Data The synthesized data are mixtures of 2-dimensional Gaussian distributions with independent coordinates owning the same standard deviation σ. Mean values of the components are placed on the unit circle on the angular neighboring distance 2π / k . ˆ Each data set contains 4000 items. Here, we took J=100 (J: number of samples) and m=200 (m: size of samples). 25 August 8, 2009
  • 26. Synthetic Data - Example 1 The first data set has the parameters k = 4 and σ = 0.3. ˆ As we see, all of the three indexes clearly indicate four clusters. 26 August 8, 2009
  • 27. Synthetic Data - Example 2 The second synthetic data set has the parameters k = 5 ˆ and σ = 0.3. The components are obviously overlapping in this case. 27 August 8, 2009
  • 28. Synthetic Data - Example 2 As it can be seen, the true number of clusters has been successfully found by all indexes. 28 August 8, 2009
  • 29. Numerical Experiments – Real-World Data First Data Sets The first real data set was chosen from the text collection http://ftp.cs.cornell.edu/pub/smart/ . This set consists of the following three sub-collections DC0: Medlars Collection (1033 medical abstracts), DC1: CISI Collection (1460 information science abstracts), DC2: Cranfield Collection (1400 aerodynamics abstracts). 29 August 8, 2009
  • 30. Numerical Experiments – Real-World Data First Data Sets We picked the 600 “best” terms, following the common bag of words method. It is known that this collection is well separated by means of its first two leading principal components. Here, we also took J=100 and m=200. 30 August 8, 2009
  • 31. Real-World Data - First Data Sets All the indexes receive their maximal values at K=3, i.e., the number of clusters is properly determined. 31 August 8, 2009
  • 32. Numerical Experiments – Real-World Data Second Data Set Another considered data set is the famous Iris Flower Data Set, available, for example, at http://archive.ics.uci.edu/ml/datasets/Iris . This dataset is composed from 150 4-dimensional feature vectors of three equally sized sets of iris flowers. We choose J=200 and the sample size equals 70. 32 August 8, 2009
  • 33. Real-World Data – Iris Flower Data Set Our method turns out a three clusters structure. 33 August 8, 2009
  • 34. Conclusions - The Rationale of Our Approach • In this paper, we propose a novel approach, based on the Minimal Spanning Tree two sample test, for the cluster stability assessment. • The method offers to quantify the partitions’ features through the test statistic computed within the clusters built by means of sample pairs. • The worst cluster, determined by the lowest standardized statistic value, characterizes the partition quality. 34 August 8, 2009
  • 35. Conclusions - The Rationale of Our Approach • The departure from the theoretical model, which suggests well-mingled samples within the clusters, is described by the left tail of the score distribution. • The shortest tail corresponds to the “true” number of clusters. • All presented experiments detect the true number of clusters. 35 August 8, 2009
  • 36. Conclusions • In the case of the five components Gaussian data set, the true number of clusters was found even though a certain overlapping of the clusters exists. • The four Gaussian components data set contains sufficiently separated components. Therefore, it is of no revelation that the true number of clusters is attained here. 36 August 8, 2009
  • 37. Conclusions • The analysis of the abstracts data set is carried out with 600 terms and the true number of clusters was also detected. • The Iris Flower dataset is sufficiently difficult to analyze due to the fact that two clusters are not linearly separable. However, the true number of clusters was found here as well. 37 August 8, 2009
  • 38. References Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning trees, ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies (Neringa, Lithuania, May 20-23, 2008) 248-252. Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., On a minimal spanning tree approach in the cluster validation problem, to appear in the special issue of INFORMATICA at the occasion of 20th Mini-EURO Conference Continuous Optimization and Knowledge Based Technologies (Neringa, Lithuania, May 20-23, 2008), Dzemyda, G., Miettinen, K., and Sakalauskas, L., guest editors. Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal spanning trees approach, Proceedings of the Second Global Conference on Power Control and Optimization, AIP Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN 978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds.. 38 August 8, 2009