Jarvis-Patrick Clustering

Brian T. Luke, Ph.D.

   

Jarvis-Patrick Clustering [1] uses a nearest neighbor approach to clustering objects. It is the method used by Daylight Systems to cluster a chemical database since studies by Dr. Peter Willett [2] found it to be the best method (see Clustering Binary Objects for further information).

To run this type of clustering, you need a measure of the distance between two objects and two integers, J and K. J is the size of the neighbor list, and K is the number of common neighbors. This method is as follows.

  • Determine the J nearest neighbors for each object in the set to be clustered.
  • Two objects are placed in the same cluster if
    1. they are contained in each other's neighbor list.
    2. they have at least K nearest neighbors in common.

Several points need to be made about this procedure. The first is that it is not hierarchical. Only a single pass through all pairs of objects needs to be made. This produces clusters that are similar to those produced by a Simple Linkage Agglomerative Clustering, only it is much faster. In other words, if objects L and M pass the criteria listed above, as do objects L and N, all three will be placed in the same cluster whether or not objects M and N pass the criteria.

The number of clusters produced is dependent upon the input values of J and K. Therefore, if a large number of objects needs to be clustered, several values may have to be tried before a reasonable number of clusters are formed and the number of singletons (or entropy clusters) is reasonably small.

Daylight Systems suggests several methods to reduce the number of singletons. They also found that removing the first criterion "usually results in improved clustering" with a significant increase in the CPU-cost.

This method has the advantage of identifying tight clusters within normal clusters by using two K-values, K and K' (K < K'). Two objects lie in the same cluster if the number of common neighbors is at least K, while they are in a tight cluster if this is at least K'.

References:
[1] R.A. Jarvis, E.A. Patrick "Clustering Using a Similarity Measure Based on Shared Near Neighbors", IEEE Transactions on Computers, C22, 1025-1034 (1973).
[2] "Similarity and Clustering in Chemical Information Systems," P. Willett, Research Studies Press, Wiley, New York, 1987.

Example of Jarvis-Patrick Clustering