K-Means Clustering

Brian T. Luke, Ph.D.

   

K-Means Clustering and Fuzzy Clustering are different than Hierarchical Clustering and Diversity Selection in that the number of clusters, K, needs to be determined at the onset. The goal is to divide the objects into K clusters such that some metric relative to the centroids of the clusters is minimized. If the clustering is over binary objects, medoids need to be used instead of centroids. A medoid is is just a bit string that minimizes the sum of distances to all objects in the cluster [1] (see Clustering Binary Objects for further details).

Various metrics to the centroids that can be minimized include

  • maximum distance to its centroid for any object.
  • sum of the average distance to the centroids over all clusters.
  • sum of the variance over all clusters.
  • total distance between all objects and their centroids.

The metric to minimize and the choice of a distance measure will determine the shape of the optimium clusters.

Two procedures are available to search for the optimum set of clusters. The first assigns each object to a cluster and the second sets initial positions for the cluster centroids.

In the first procedure, the objects are randomly assigned to one of the K clusters. Once this is done, the position of the K centroids are determined, as is the value of the metric to minimize. A global optimization method is then used to reassign some of the objects to different clusters. New centroids are determined, as is the metric to minimize. This procedure is continued until (hopefully) the optimum assignment of objects to clusters is found.

In the second procedure for K-means clustering, placement of the K centroids can be done by the following procedure.

  1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
  2. Assign each object to the group that has the closest centroid.
  3. When all objects have been assigned, recalculate the positions of the K centroids.
  4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

A global optimization method is then used to move the position of one or more of the centroids. The above procedure is repeated and new metric is determined. The object is to move the centroids into a position such that an optimum separation of objects into groups occurrs.

This latter method of K-means clustering is the one that has to be used if you are clustering binary objects.

Reference:
V. Guralnik and G. Karypis, Workshop on Data Mining in Bioinformatics (2001) 73-80.

Example of K-Means Clustering