
its nearest center. These steps are repeated until some
convergence condition is met. See Faber [18] for descriptions
of other variants of this algorithm. For points in general
position (in particular, if no data point is equidistant from two
centers), the algorithm will eventually converge to a point
that is a local minimum for the distortion. However, the result
is not necessarily a global minimum. See [8], [40], [47], [49] for
further discussion of its statistical and convergence proper-
ties. Lloyd's algorithm assumes that the data are memory
resident. Bradley et al. [10] have shown how to scale k-means
clustering to very large data sets through sampling and
pruning. Note that Lloyd's algorithm does not specify the
initial placement of centers. See Bradley and Fayyad [9], for
example, for further discussion of this issue.
Because of its simplicity and flexibility, Lloyd's algorithm
is very popular in statistical analysis. In particular, given any
other clustering algorithm, Lloyd's algorithm can be applied
as a postprocessing stage to improve the final distortion. As
we shall see in our experiments, this can result in significant
improvements. However, a straightforward implementation
of Lloyd's algorithm can be quite slow. This is principally due
to the cost of computing nearest neighbors.
In this paper, we present a simple and efficient
implementation of Lloyd's algorithm, which we call the
filtering algorithm. This algorithm begins by storing the data
points in a kd-tree [7]. Recall that, in each stage of Lloyd's
algorithm, the nearest center to each data point is computed
and each center is moved to the centroid of the associated
neighbors. The idea is to maintain, for each node of the tree,
a subset of candidate centers. The candidates for each node
are pruned, or ªfiltered,º as they are propagated to the
node's children. Since the kd-tree is computed for the data
points rather than for the centers, there is no need to update
this structure with each stage of Lloyd's algorithm. Also,
since there are typically many more data points than
centers, there are greater economies of scale to be realized.
Note that this is not a new clustering method, but simply an
efficient implementation of Lloyd's k-means algorithm.
The idea of storing the data points in a kd-tree in clustering
was considered by Moore [42] in the context of estimating the
parameters of a mixture of Gaussian clusters. He gave an
efficient implementation of the well-known EM algorithm.
The application of this idea to k-means was discovered
independently by Alsabti et al. [2], Pelleg and Moore [45], [46]
(who called their version the blacklisting algorithm), and
Kanungo et al. [31]. The purpose of this paper is to present a
more detailed analysis of this algorithm. In particular, we
present a theorem that quantifies the algorithm's efficiency
when the data are naturally clustered and we present a
detailed series of experiments designed to advance the
understanding of the algorithm's performance.
In Section 3, we present a data-sensitive analysis which
shows that, as the separation between clusters increases, the
algorithm runs more efficiently. We have also performed a
number of empirical studies, both on synthetically generated
data and on real data used in applications ranging from color
quantization to data compression to image segmentation.
These studies, as well as a comparison we ran against the
popular clustering scheme, BIRCH
1
[50], are reported in
Section 4. Our experimentsshowthat the filtering algorithm is
quite efficient even when the clusters are not well-separated.
2THE FILTERING ALGORITHM
In this section, we describe the filtering algorithm. As
mentioned earlier, the algorithm is based on storing the
multidimensional data points in a kd-tree [7]. For complete-
ness, we summarize the basic elements of this data structure.
Define a box to be an axis-aligned hyper-rectangle. The
bounding box of a point set is the smallest box containing all the
points. A kd-tree is a binary tree, which represents a
hierarchical subdivision of the point set's bounding box
using axis aligned splitting hyperplanes. Each node of the
kd-tree is associated with a closed box, called a cell. The root's
cell is the bounding box of the point set. If the cell contains at
most one point (or, more generally, fewer than some small
constant), then it is declared to be a leaf. Otherwise, it is split
into two hyperrectangles by an axis-orthogonal hyperplane.
The points of the cell are then partitioned to one side or the
other of this hyperplane. (Points lying on the hyperplane can
be placed on either side.) The resulting subcells are the
children of the original cell, thus leading to a binary tree
structure. There are a number of ways to select the splitting
hyperplane. One simple way is to split orthogonally to the
longest side of the cell through the median coordinate of the
associated points [7]. Given n points, this produces a tree with
On nodes and Olog n depth.
We begin by computing a kd-tree for the given data
points. For each internal node u in the tree, we compute the
number of associated data points u:count and weighted
centroid u:wgtCent, which is defined to be the vector sum of
all the associated points. The actual centroid is just
u:wgtCent=u:count. It is easy to modify the kd-tree con-
struction to compute this additional information in the
same space and time bounds given above. The initial
centers can be chosen by any method desired. (Lloyd's
algorithm does not specify how they are to be selected. A
common method is to sample the centers at random from
the data points.) Recall that, for each stage of Lloyd's
algorithm, for each of the k centers, we need to compute the
centroid of the set of data points for which this center is
closest. We then move this center to the computed centroid
and proceed to the next stage.
For each node of the kd-tree, we maintain a set of candidate
centers. This is defined to be a subset of center points that
might serve as the nearest neighbor for some point lying
within the associated cell. The candidate centers for the root
consist of all k centers. We then propagate candidates down
the tree as follows: For each node u, let C denote its cell and let
Z denote its candidate set. First, compute the candidate z
2 Z
that is closest to the midpoint of C. Then, for each of the
remaining candidates z 2 Znfz
g, if no part of C is closer to z
than it is to z
, we can infer that z is not the nearest center to
any data point associated with u and, hence, we can prune, or
ªfilter,º z from the list of candidates. If u is associated with a
single candidate (which must be z
) then z
is the nearest
neighbor of all its data points. We can assign them to z
by
adding the associated weighted centroid and counts to z
.
Otherwise, if u is an internal node, we recurse on its children.
If u is a leaf node, we compute the distances from its
associated data point to all the candidates in Z and assign the
data point to its nearest center. (See Fig. 1.)
It remains to describe how to determine whether there is
any part of cell C that is closer to candidate z than to z
. Let
H be the hyperplane bisecting the line segment
zz
. (See
Fig. 2.) H defines two halfspaces; one that is closer to z and
882 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002
1. Balanced Iterative Reducing and Clustering using Hierarchies.