50120130406008

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 6, November - December (2013), pp. 78-82
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com

IJCET
©IAEME

DIVISIVE HIERARCHICAL CLUSTERING USING PARTITIONING
METHODS
Megha Gupta
M.Tech Scholar, Computer Science & Engineering,
Arya College of Engg & IT
Jaipur, Rajasthan, India
Vishal Shrivastava
Professor, Computer Science & Engineering
Arya College of Engg & IT
Jaipur, Rajasthan, India

ABSTRACT
Clustering is the process of partitioning a set of data so that the data can be divided into
subsets. Clustering is implemented so that same set of data can be collected on one side and other set
of data can be collected on the other end. Clustering can be done using many methods like
partitioning methods, hierarchical methods, density based method. Hierarchical method creates a
hierarchical decomposition of the given set of data objects. In successive iteration, a cluster is split
into smaller clusters, until eventually each object is in one cluster, or a termination condition holds.
In this paper, partitioning method has been used with hierarchical method to form better and
improved clusters. We have used various algorithms for getting better and improved clusters.
Keywords: Clustering, Hierarchical, Partitioning methods.
I. INTRODUCTION
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to
the nontrivial extraction of implicit, previously unknown and potentially useful information from
data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently
treated as synonyms, data mining is actually part of the knowledge discovery process.

78


The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to [1] some form of new knowledge. The iterative process consists of the following
steps:
Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data
are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a
common source.
• Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the
data collection.
• Data transformation: also known as data consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified
based on given measures.
• Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users understand
and interpret the data mining results.
Clustering is the organization of data in classes. However, unlike classification, in clustering,
class labels are unknown and it is up to the clustering algorithm to discover acceptable classes.
Clustering is also called unsupervised classification, because the classification is not dictated by
given class labels. There are many clustering approaches all based on the principle of maximizing the
similarity between objects in a same class (intra-class similarity) and minimizing the similarity
between objects of different classes (inter-class similarity) [2].
•

II. RELATED WORK
Hierarchical Clustering for Data-mining
Hierarchical methods for supervised and unsupervised data mining give multilevel
description of data. It is relevant for many applications related to information extraction, retrieval
navigation and organizations. The two interpretation techniques have been used for description of the
clusters:
1. Listing of prototypical data examples from the cluster.
2. Listing of typical features associated with the cluster.
The Generalizable Gaussian Mixture model (GGM) and the soft Generalizable Gaussian
mixture model (SGGM) are addressed for supervised and unsupervised learning. These two models
help in calculating parameters of the Gaussian clusters with a modified EM procedure from two
disjoint sets of observation that helps in ensuring high generalization ability [3].
Procedure
The agglomerative clustering scheme is started by k clusters at level j=1, as given by the
optimized GGM model of p(x). At each higher level in the hierarchy, two clusters are merged based
on a similarity measure between pairs of clusters. The procedure is repeated till the top level. That is,
at level j=1, there are k clusters and 1 cluster at the final level, j=2k-1. The natural distance measure
between the cluster densities is the Kullback- Leibler (KL) divergence, since it reflects dissimilarity
between the densities in the probabilistic space.

79


Limitations
The drawback is KL only obtains an analytical expression for first level in the hierarchy
while distances for the subsequently levels have to be approximated.
Automatically Labeling Hierarchical Clustering
A simple algorithm has been used that automatically assigns labels to hierarchical clusters.
The algorithm evaluates candidates labels using information from cluster, parent cluster and corpus
statistics. A trainable threshold enables the algorithm to assign just a few high- quality labels to each
cluster. First, it is assumed that algorithm has access to a general collection of documents E,
representing the word distribution in general English. This English corpus is used in selecting label
candidates [4].
Procedure
Given a cluster S and its parent cluster P, which includes all of documents in S and in sibling
clusters of S, the algorithm selects labels for the cluster with following steps:
1. Collect phrase statistics: For every unigram, bigram and tri-gram phrase p occurring in the
cluster S, calculate the document frequency and term frequency statistics for the cluster, the
parent cluster and the general English corpus.
2. Select label candidates: Select the label candidates from unigram, bigram, tri-gram phrases
based on document frequency in the cluster and in general English language.
3. Calculate the descriptive score: Calculate the descriptive score for each label candidate, and then
sort the label candidates by these scores.
4. Calculate the cutoff point: Decide how many label candidate to display based on the descriptive
scores.
Limitation
The most errors come from clusters containing small numbers of documents. The small
number of observations in small clusters can be good and bad labels indistinguishable; minor
variations in vocabulary can also produce statistical feature with high variance.
Fast Hierarchical Clustering and other applications of dynamic closet pairs
The data structures for dynamic closet pair problems with arbitrary distance functions, based
on a technique used for Euclidean closet pairs. This paper show how to insert or delete object from
n-object set, maintaining the closet pair, O (n log2n) time per update and O (n) space. The purpose of
this paper is to show that much better bounds are possible, using data structures that are simple. If
linear space is required, this represents an order –of-magnitude improved.
Procedure
The data structure consists of a partition of the dynamic set S into k ≤ log n subsets S1,
S2….Sk, together with a digraph Gi for each set Si. Initially all points will be in S1 and G1 will have n1 edges. Gi may contain edges with neither endpoint in Si; if the number of edges in all graphs grows
to 2n then the data structure will be rebuild by moving all points to S1 and recomputing G1. The
closet pair will be represented by an edge in some Gi, so the pair can be found by scanning the edges
in all graphs [5].
Create Gi for a new partition Si. Initially, Gi will consist of a single path. Choose the first
vertex of the path to be any object in Si. Then, extend the path one edge at a time. When last vertex
in the path P is in Si, choose the next vertex to be its nearest neighbor in S P, and when the last
vertex is in S Si, choose the next vertex to be its nearest neighbor in Si P. Continue until the path
can no longer be extended because S P or Si P is empty.
80


Merge Partitions: The update operations can cause k to be too large relative to n. If so, choose
subsets Si and Sj as close to equal in size as |Si| ≤ |Sj| and |Sj| / |Si| minimized. Merge these two subsets
into one and create graph Gi for the merged subset.
To insert x, create a new subset Sk+1= {x} in the partition of S, create Gk+1, and merge
partitions as necessary until k ≤ log n.
To delete x, create a new subset Sk+1 consisting of all objects y such that (y, x) is a directed
edge in some Gi. Remove x and all its adjacent edges from all the graphs Gi. Create the graph Gk+1
for Sk+1, and merge partitions as necessary until k ≤ log n.
Theorem: The data structure above maintains the closet pair in S in O (n) space, amortized time O (n
log n) per insertion, and amortized time O (n log2 n) per deletion.
Limitations
The methods that are tested involve sequential scans through memory, a behavior known to
reduce the effectiveness of cached memory.
Motivations
Using hierarchical clustering, better clusters will be formed. The clusters formed will appear
in better way and there will be tight bonding in between them. It means that the clusters formed will
be refined using the various algorithm of hierarchical clustering.
III. PROBLEM STATEMENT
The objective of the proposed work is to perform hierarchical clustering to obtain the more
refined clusters with strong relationship between members of same cluster.
IV. PROPOSED APPROACH
In this paper, we have used K-means algorithm and CFNG to find better and improved clusters.
K-means Algorithm
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute
the objects into k clusters, Ci…..Ck, that is, Ci ⊂ D and Ci ∩ Cj=Ø for (1≤ i, j≤k). An objective
function is used to access the partitioning quality so that objects within a cluster are similar to one
another but dissimilar to objects in other clusters. This is, the objective function aims for high
intracluster similarity and low intercluster similarity [6].
A centroid-based partitioning technique uses the centroid of a cluster, Ci, to represent that
cluster. The centroid of a cluster is its center point. The centroid can be defined in various ways such
as by the mean or medoids of the objects assigned to the cluster. The difference between an object p
and Ci, the representation of the cluster, is measured by dist(p,Ci), where dist(x,y) is the Euclidean
distance between two points x and y.
CFNG
Colored farthest neighbor graph shares many characteristics with SFN (shared farthest
neighbors) by Rovetta and Masulli [7]. This algorithm yields binary partitions of objects into subsets,
whereas number of subsets obtained by SFN can vary. The SFN algorithm can easily split a cluster
where no natural partition is necessary, while the CFNG often avoids such splits.

81

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
0976-6367(Print),
6

V. RESULTS
To observe the effect of hierarchical clustering, k-means and CFNG algorithm are used and
k means
to observe the results, the experimental setup was designed using Java, MySQL. The obtained results
are compared with K-means and CFNG, when executed individually.
means

Figure 1: Comparison of Proposed Algorithm with K-means and CFNG
K means
VI. CONCLUSION AND FUTURE SCOPE
We have obtained better and improved clusters us
using K-means and CFNG algorithms
means
hierarchically. The final clusters obtained are tightly bonded with each other.
In this paper, we have used 2 different algorithms for hierarchical clustering. Instead of using
clustering
CFNG, we could have used other hierarchical clustering algorithm.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]

Osmar R. Zaïane,” Principles of Knowledge discovery in databases”, 1999.
Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques”.
Osmar R. Zaïane,” Principles of Knowledge discovery in databases”, 1999.
Pucktada Treeratpituk, Jamie Callon,” Automatically Labeling Hierarchical Clusters”.
David Eppstein, “Fast Hierarchical Clustering and other applications of dynamic closet pairs”.
G.Plaxton, Approximation Algorithms for Hierarchical Location problems. Proceedings of the
35th ACM Symposium on the Theory of Computation, 2003.
[7] A.Borodin, R.Ostrovsky & Y. Rabani. Subquadratic approximation algorithm for clustering
odin,
problem in high dimensional spaces. Proceedings of 31st ACM Symposium on Theory of
Computation, 1999.
[8] Rinal H. Doshi, Dr. Harshad B. Bhadka and Richa Mehta, “Development of Pattern
Development
Knowledge Discovery Framework using Clustering Data Mining Algorithm”, International
ta
Algorithm
Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013,
,
pp. 101 - 112, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
,
[9] Deepika Khurana and Dr. M.P.S Bhatia, “Dynamic Approach to K-Means Clustering
nd
Bhatia
Means
Algorithm”, International Journal of Computer Engineering & Technology (IJCET),
Volume 4, Issue 3, 2013, pp. 204 - 219, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[10] Meghana. N.Ingole, M.S.Bewoor, S.H.Patil,, “Context Sensitive Text Summarization using
Context
Hierarchical Clustering Algorithm”, International Journal of Computer Engineering &
Algorithm
Technology (IJCET), Volume 3, Issue 1, 2012, pp. 322 - 329, ISSN Print: 0976 – 6367, ISSN
,
Online: 0976 – 6375.
82

50120130406008

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to 50120130406008 (20)

More from IAEME Publication (20)

Recently uploaded (20)

50120130406008