Anefficientk-meansclusteringalgorithmsAnalysisandimplementation资源-CSDN下载

需积分: 12 49 浏览量 2014-05-17 09:46:53 上传评论收藏 1.88MB PDF 举报

### 高效K-means聚类算法分析与实现 #### 摘要解析与扩展在K-means聚类算法中，目标是最小化每个数据点到其最近中心点的距离的平方和。一种广泛使用的启发式算法是Lloyd算法。本文提出了一种简单且高效的Lloyd K-means聚类算法实现方法，即过滤算法。该算法易于实现，主要的数据结构为kd树。通过数据分析表明，随着数据点间簇的分离度增加，算法的运行时间会减少。此外，还进行了一系列实证研究，包括合成数据和真实数据集（如颜色量化、数据压缩和图像分割）的应用。 #### 一、引言深入解析聚类问题在众多应用领域中普遍存在，例如数据挖掘和知识发现、数据压缩与向量量化以及模式识别与模式分类等。不同应用场景对“好”的聚类定义各不相同，因此聚类方法也多种多样，既包括直观的方法也有系统性的方法。例如，基于分裂合并的方法如ISODATA；随机化方法如CLARA和CLARANS；基于神经网络的方法；以及针对大规模数据库设计的方法，如DBSCAN、BIRCH和ScaleKM等。在这其中，基于最小化正式目标函数的聚类公式中，最典型的当属K-means算法。K-means的核心思想是将n个数据点划分为k个簇，使得簇内的相似性尽可能高，而簇间的相似性尽可能低。这一目标通过最小化所有数据点到其所属簇中心点距离的平方和来实现。 #### 二、K-means算法原理及Lloyd算法简介 K-means算法的基本步骤如下： 1. **初始化**：随机选择k个数据点作为初始簇中心。 2. **分配阶段**：将每个数据点分配给最近的簇中心所在的簇。 3. **更新阶段**：根据每个簇中的数据点重新计算簇中心。 4. **重复**：重复第2步和第3步直到簇中心不再变化或达到最大迭代次数。 Lloyd算法是K-means算法的一种典型实现方式，它交替执行簇分配和簇中心更新这两个步骤。这种迭代过程可以有效地找到局部最优解。 #### 三、过滤算法的介绍及其特点过滤算法是一种高效的K-means实现方法，它利用kd树这一数据结构来加速最近邻搜索过程，从而提高整体性能。kd树是一种用于多维空间中的点的查找数据结构，它可以快速地找到最近的邻居。过滤算法的主要特点包括： 1. **易于实现**：只需要实现kd树即可支持高效的最近邻查询。 2. **高效性**：通过对数据进行预处理构建kd树，可以在分配阶段快速确定每个数据点所属的簇。 3. **适应性强**：适用于不同类型的数据集，并且随着数据集中簇的分离度增加，算法性能更佳。 4. **实证研究**：通过对合成数据和真实数据集的研究，验证了过滤算法的有效性和实用性。 #### 四、过滤算法的实际应用案例本文通过一系列实验验证了过滤算法的有效性。这些实验不仅包括合成数据，还包括实际应用中的数据集，具体涉及以下几个方面： 1. **颜色量化**：通过聚类技术将图像中的颜色数量减少，以达到压缩图像的目的。 2. **数据压缩**：通过聚类来识别数据中的冗余信息，从而减少存储需求。 3. **图像分割**：基于像素特征的聚类可以帮助自动地将图像分成不同的区域。通过这些实验，过滤算法显示出了良好的性能表现，在多种应用场景下均能有效提高聚类效率。 #### 五、结论与展望本文提出了一种高效的K-means聚类算法实现方法——过滤算法，该算法通过kd树加速了最近邻搜索过程，从而显著提高了算法的整体效率。通过理论分析和实证研究，证明了过滤算法的有效性和实用性。未来的研究方向可能包括进一步优化算法性能，以及探索更多应用场景下的效果。

资源推荐

资源详情

资源评论

An Efficient k-Means Clustering Algorithm:

Analysis and Implementation

Tapas Kanungo, Senior Member, IEEE, David M. Mount, Member, IEEE,

Nathan S. Netanyahu, Member, IEEE, Christine D. Piatko, Ruth Silverman, and

Angela Y. Wu, Senior Member, IEEE

AbstractÐIn k-means clustering, we are given a set of n data points in d-dimensional space R

and an integer k and the problem is to

determine a set of k points in R

, called centers, so as to minimize the mean squared distance from each data point to its nearest center.

A popular heuristic for k-means clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's

k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only

major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis

of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we

present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization,

data compression, and image segmentation.

Index TermsÐPattern recognition, machine learning, data mining, k-means clustering, nearest-neighbor searching, k-d tree,

computational geometry, knowledge discovery.

1INTRODUCTION

LUSTERING problems arise in many different applica-

tions, such as data mining and knowledge discovery

[19], data compression and vector quantization [24], and

pattern recognition and pattern classification [16]. The

notion of what constitutes a good cluster depends on the

application and there are many methods for finding clusters

subject to various criteria, both ad hoc and systematic.

These include approaches based on splitting and merging

such as ISODATA [6], [28], randomized approaches such as

CLARA [34], CLARANS [44], methods based on neural nets

[35], and methods designed to scale to large databases,

including DBSCAN [17], BIRCH [50], and ScaleKM [10]. For

further information on clustering and clustering algorithms,

see [34], [11], [28], [30], [29].

Among clustering formulations that are based on

minimizing a formal objective function, perhaps the most

widely used and studied is k-means clustering. Given a set

of n data points in real d-dimensional space, R

, and an

integer k, the problem is to determine a set of k points in R

called centers, so as to minimize the mean squared distance

from each data point to its nearest center. This measure is

often called the squared-error distortion [28], [24] and this

type of clustering falls into the general category of variance-

based clustering [27], [26].

Clustering based on k-means is closely related to a

number of other clustering and location problems. These

include the Euclidean k-medians (or the multisource Weber

problem) [3], [36] in which the objective is to minimize the

sum of distances to the nearest center and the geometric

k-center problem [1] in which the objective is to minimize

the maximum distance from every point to its closest center.

There are no efficient solutions known to any of these

problems and some formulations are NP-hard [23]. An

asymptotically efficient approximation for the k-means

clustering problem has been presented by Matousek [41],

but the large constant factors suggest that it is not a good

candidate for practical implementation.

One of the most popular heuristics for solving the k-means

problem is based on a simple iterative scheme for finding a

locally minimal solution. This algorithm is often called the

k-means algorithm [21], [38]. There are a number of variants

to this algorithm, so, to clarify which version we are using, we

will refer to it as Lloyd's algorithm. (More accurately, it should

be called the generalized Lloyd's algorithm since Lloyd's

original result was for scalar data [37].)

Lloyd's algorithm is based on the simple observation that

the optimal placement of a center is at the centroid of the

associated cluster (see [18], [15]). Given any set of k centers Z,

for each center z 2 Z, let V z denote its neighborhood, that is,

the set of data points for which z is the nearest neighbor. In

geometric terminology, V z is the set of data points lying in

the Voronoi cell of z [48]. Each stage of Lloyd's algorithm

moves every center point z to the centroid of V z and then

updates V zby recomputing the distance from each point to

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002 881

. T. Kanungo is with the IBM Almaden Research Center, 650 Harry Road,

San Jose, CA 95120. E-mail: [email protected].

. D.M. Mount is with the Department of Computer Science, University of

Maryland, College Park, MD 20742. E-mail: [email protected].

. N.S. Netanyahu is with the Department of Mathematics and Computer

Science, Bar-Ilan University, Ramat-Gan, Israel.

E-mail: [email protected].

. C.D. Piatko is with the Applied Physics Laboratory, The John Hopkins

University, Laurel, MD 20723. E-mail: [email protected].

. R. Silverman is with the Center for Automation Research, University of

Maryland, College Park, MD 20742. E-mail: [email protected].

. A.Y. Wu is with the Department of Computer Science and Information

Systems, American University, Washington, DC 20016.

E-mail: [email protected].

Manuscript received 1 Mar. 2000; revised 6 Mar. 2001; accepted 24 Oct.

2001.

Recommended for acceptance by C. Brodley.

For information on obtaining reprints of this article, please send e-mail to:

[email protected], and reference IEEECS Log Number 111599.

0162-8828/02/$17.00 ß 2002 IEEE

its nearest center. These steps are repeated until some

convergence condition is met. See Faber [18] for descriptions

of other variants of this algorithm. For points in general

position (in particular, if no data point is equidistant from two

centers), the algorithm will eventually converge to a point

that is a local minimum for the distortion. However, the result

is not necessarily a global minimum. See [8], [40], [47], [49] for

further discussion of its statistical and convergence proper-

ties. Lloyd's algorithm assumes that the data are memory

resident. Bradley et al. [10] have shown how to scale k-means

clustering to very large data sets through sampling and

pruning. Note that Lloyd's algorithm does not specify the

initial placement of centers. See Bradley and Fayyad [9], for

example, for further discussion of this issue.

Because of its simplicity and flexibility, Lloyd's algorithm

is very popular in statistical analysis. In particular, given any

other clustering algorithm, Lloyd's algorithm can be applied

as a postprocessing stage to improve the final distortion. As

we shall see in our experiments, this can result in significant

improvements. However, a straightforward implementation

of Lloyd's algorithm can be quite slow. This is principally due

to the cost of computing nearest neighbors.

In this paper, we present a simple and efficient

implementation of Lloyd's algorithm, which we call the

filtering algorithm. This algorithm begins by storing the data

points in a kd-tree [7]. Recall that, in each stage of Lloyd's

algorithm, the nearest center to each data point is computed

and each center is moved to the centroid of the associated

neighbors. The idea is to maintain, for each node of the tree,

a subset of candidate centers. The candidates for each node

are pruned, or ªfiltered,º as they are propagated to the

node's children. Since the kd-tree is computed for the data

points rather than for the centers, there is no need to update

this structure with each stage of Lloyd's algorithm. Also,

since there are typically many more data points than

centers, there are greater economies of scale to be realized.

Note that this is not a new clustering method, but simply an

efficient implementation of Lloyd's k-means algorithm.

The idea of storing the data points in a kd-tree in clustering

was considered by Moore [42] in the context of estimating the

parameters of a mixture of Gaussian clusters. He gave an

efficient implementation of the well-known EM algorithm.

The application of this idea to k-means was discovered

independently by Alsabti et al. [2], Pelleg and Moore [45], [46]

(who called their version the blacklisting algorithm), and

Kanungo et al. [31]. The purpose of this paper is to present a

more detailed analysis of this algorithm. In particular, we

present a theorem that quantifies the algorithm's efficiency

when the data are naturally clustered and we present a

detailed series of experiments designed to advance the

understanding of the algorithm's performance.

In Section 3, we present a data-sensitive analysis which

shows that, as the separation between clusters increases, the

algorithm runs more efficiently. We have also performed a

number of empirical studies, both on synthetically generated

data and on real data used in applications ranging from color

quantization to data compression to image segmentation.

These studies, as well as a comparison we ran against the

popular clustering scheme, BIRCH

[50], are reported in

Section 4. Our experimentsshowthat the filtering algorithm is

quite efficient even when the clusters are not well-separated.

2THE FILTERING ALGORITHM

In this section, we describe the filtering algorithm. As

mentioned earlier, the algorithm is based on storing the

multidimensional data points in a kd-tree [7]. For complete-

ness, we summarize the basic elements of this data structure.

Define a box to be an axis-aligned hyper-rectangle. The

bounding box of a point set is the smallest box containing all the

points. A kd-tree is a binary tree, which represents a

hierarchical subdivision of the point set's bounding box

using axis aligned splitting hyperplanes. Each node of the

kd-tree is associated with a closed box, called a cell. The root's

cell is the bounding box of the point set. If the cell contains at

most one point (or, more generally, fewer than some small

constant), then it is declared to be a leaf. Otherwise, it is split

into two hyperrectangles by an axis-orthogonal hyperplane.

The points of the cell are then partitioned to one side or the

other of this hyperplane. (Points lying on the hyperplane can

be placed on either side.) The resulting subcells are the

children of the original cell, thus leading to a binary tree

structure. There are a number of ways to select the splitting

hyperplane. One simple way is to split orthogonally to the

longest side of the cell through the median coordinate of the

associated points [7]. Given n points, this produces a tree with

On nodes and Olog n depth.

We begin by computing a kd-tree for the given data

points. For each internal node u in the tree, we compute the

number of associated data points u:count and weighted

centroid u:wgtCent, which is defined to be the vector sum of

all the associated points. The actual centroid is just

u:wgtCent=u:count. It is easy to modify the kd-tree con-

struction to compute this additional information in the

same space and time bounds given above. The initial

centers can be chosen by any method desired. (Lloyd's

algorithm does not specify how they are to be selected. A

common method is to sample the centers at random from

the data points.) Recall that, for each stage of Lloyd's

algorithm, for each of the k centers, we need to compute the

centroid of the set of data points for which this center is

closest. We then move this center to the computed centroid

and proceed to the next stage.

For each node of the kd-tree, we maintain a set of candidate

centers. This is defined to be a subset of center points that

might serve as the nearest neighbor for some point lying

within the associated cell. The candidate centers for the root

consist of all k centers. We then propagate candidates down

the tree as follows: For each node u, let C denote its cell and let

Z denote its candidate set. First, compute the candidate z



2 Z

that is closest to the midpoint of C. Then, for each of the

remaining candidates z 2 Znfz



g, if no part of C is closer to z

than it is to z



, we can infer that z is not the nearest center to

any data point associated with u and, hence, we can prune, or

ªfilter,º z from the list of candidates. If u is associated with a

single candidate (which must be z



) then z



is the nearest

neighbor of all its data points. We can assign them to z



adding the associated weighted centroid and counts to z



Otherwise, if u is an internal node, we recurse on its children.

If u is a leaf node, we compute the distances from its

associated data point to all the candidates in Z and assign the

data point to its nearest center. (See Fig. 1.)

It remains to describe how to determine whether there is

any part of cell C that is closer to candidate z than to z



. Let

H be the hyperplane bisecting the line segment



. (See

Fig. 2.) H defines two halfspaces; one that is closer to z and

882 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002

1. Balanced Iterative Reducing and Clustering using Hierarchies.

剩余11页未读，继续阅读

评论收藏

内容反馈

try804397911

粉丝: 0

An efficient k-means clustering algorithms Analysis and implemen...

最新资源

An efficient k-means clustering algorithms Analysis and implemen...

KMeans-clustering-Algorithm

An_efficient_k′-means_clustering_algorithm

High-efficient point cloud simplification based on the improved K-means clustering algorithm

一种优化的K-MEANS聚类算法

K means clustering algorithm

The k-means clustering algorithm

K-means聚类算法

k-means算法

Clustering Analysis

聚类算法之K-means算法

cluster analysis

K-means实际应用

K_means聚类算法研究综述

k-means算法详解

Fast K-means clustering 快速K-均值聚类法,

ClusterAnalysis

调用sklearn库的K-Means聚类分析实例

k-means（k均值聚类算法）

K-Means算法实现聚类分析&实现人工神经网络 实验报告+代码

K-Means算法的初始聚类中心的优化

k-means的matlab实现

k-means聚类算法简介

K-means聚类分析与python实现

an introduction to clustering analysis

K-均值聚类算法研究

matlab实现K-means算法

【我的百度实习总结】百度网盘——一刻相册实习

智能科技驱动未来创新

最新资源

K-Means算法实现聚类分析&实现人工神经网络实验报告+代码