Clustering Binary Objects

Brian T. Luke, Ph.D.

   

Binary objects are simply objects that are described by a finite-length bit string. They are often used in chemical databases because they can be very useful in searching for a particular compound. If the search string has a particular bit turned ON and a database compound does not, then that structure can be passed over. Two possible representations are Molecular Fingerprints and Structural Keys.

The bit strings that characterize two objects can also be used to calculate a "distance." This effective distance can then be used with a clustering algorithm to place the objects into groups. One of the most popular methods for calculating the distance between two objects is to first determine their similarity. If a general similarity metric is denoted S, it usually has a value between 0.0 (the bit strings have no ON bits in common) and 1.0 (the two bit strings are identical). A distance can then be given by A(1-S)M, where M is a non-negative exponent and A is a constant. Conversely, each bit string can be thought of as components of an L-dimensional vector and the similarity as just the cosine of the angle between them (which is actually the case for the Ochini similarity). Therefore, a distance can be (A)arcos(S), or just a constant times the value of this angle.

If the bit string has a length of L, it is possible to go down this string and count the number of times a bit is ON in both strings, ON in one and OFF in the other, or OFF in both strings. The four sums are presented in the table below.

  Object j
01
Object
i
0 B00 B01
1 B10 B11

The first subscript refers to the value of the bit for object i and the second for object j, summed over all L bits. Therefore, B10 is the number of times a bit is ON in i and OFF in j.

Some other symbols that will be used in this section are

  • Bi ( = B10 + B11) is the total number of ON bits in object i.
  • Bj ( = B01 + B11) is the total number of ON bits in object j.
  • BC ( = B11) is the total number of times a bit is ON in both bit strings.
  • BI ( = B00 + B11) is the total number of times the two bit strings agree.
  • L ( = B00 + B01 + B10 + B11) is the length of the bit string.

With these definintions, commonly used similarity metrics are presented in the following table.

Label Equation
Simple Matching
Sokal & Michener
SM = BI/L
Russel & Rao RR = BC/L
Tanamoto Coefficient TC = BC/(Bi+Bj-BC)
Dice, Czekanowski, or Sorensen
or Nei & Lie's genetic distance
DCS = 2BC/(Bi+Bj)
Rogers & Tanamoto RT = BI/(2L-BI)
Jaccard's Coefficient of Community JCC = BC/(L-B00)
Kulczynski KS = (1/2)(BC/Bi+BC/Bj)
Ochini
(cosine similarity function [1])
R Package, v4.0 S03 = 2BI/(L+BI)
S09 = 3BC/(3BC+B10+B01)
S10 = BC/(BC+2B10+2B01)
Jaccard's Similarity JSi-j = BC/Bi
JSj-i = BC/Bj
Tversky Index TIab = BC/(aB10+bB01+BC)

The last two metrics are different from the rest. Jaccard's Similarity is different in that it is not symmetrical; the similarity of i to j is different than the similarity of j to i unless the two bit strings have exactly the same number of ON bits (Bi=Bj). This metric is useful if you want to measure a "substructure similarity." If i is a substructure of j, then every bit that is ON in i will also be ON in j though j will also have other bits ON. The other metrics will yield a similarity less than 1.0, while JSi-j produces a similarity of 1.0. Similarly, if j has less bits ON than i, JSj-i can be used to measure a substructure distance.

The Tversky Index is unique in that it has two adjustable parameters, a and b. If a=1 and b=0, TI10=JSi-j. If a+b=1, this expression can be rewritten as

TIab = BC/(aBi+bBj)
and if they are both equal to one-half, this just becomes one-half of Sorensen's Index, SI. This also means that for particular values of a and b, the Tversky Index may have to be scaled so that identical bit strings produce a similarity of 1.0.

A dissimilarity metric, D, is a measure of how different two bit strings are, and can therefore be scaled by any constant to yield a distance. This metric has a value of 0.0 for identical bit strings and 1.0 for strings with no bits in common. These can easily be formed by D=(1-S), where S is a similarity value from the table above, but certain dissimilarities presented in the literature are given in the table below. They are proportional to either the sum or the product if the disagreement between the bit strings, B10 and B01 (note that B10+B01=L-BI).

Label Equation
Euclidean Dissimilarity
Squared Euclidean Dissimilarity DSE = (L-BI)/L
Pattern Difference DP = 4B10B01/L
Lance & Williams
Bray-Curtis nonmetric coefficient
DLW = (L-BI)/(Bi+Bj)
Squared Euclidean
Substructure Dissimilarity
DSESi-j = (Bi-BC)/Bi = B10/Bi
DSESj-i = (Bj-BC)/Bj = B01/Bj

In chemical databases, this metrics can yield a significant dissimilarity (distance) if a substructure of a molecule is compared to the full molecule. To remove this problem, Daylight Systems developed a Squared Euclidean Substructure Distance, DSES. If compound i is smaller than compound j (less bits turned ON), DSESi-j may yield a more accurate distance. Similariy, if object j has less bits turned ON than i (Bj < Bi), DSESj-i can be used.

With any of these distance metrics, a clustering algorithm can be used to place these binary objects into groups. Since it is not possible to calculate an average position between two binary objects (unless they are identical), one cannot determine a centroid for a group of binary objects. To circumvent this, a medoid [2] can be used instead. A medoid is simply a bit string that minimizes the sum of the distance to all objects in the group, and can be used with Wards Method (described in Agglomerative Linkages) and K-means clustering.

In 1987, Peter Willett [3] studied the clustering of chemical databases and found that Jarvis-Patrick clustering [4] produced the best results.

I used a similar procedure to try to cluster nucleotide sequences. Because of the large variation in the number of 1's for different bit strings, I found that the following distance metric produced better results.

D'SE = Bj(Bi-Bc)/Bi2

One final point is that it is possible to weight each bit in the string. If bik is the bit in position k (k=1,2,...L) for object i, it can be weighted by wk. The terms in the above expressions become

The weight for each bit can be related to the "inverse frequency" of being turned on. In other words, if this bit is on for all N objects to be clustered, it cannot be used as a descriminator. Conversely, if it is on in only a few of the objects, it may be an important descriminator and should be given a relatively large weight. One such inverse frequency weight is given by the expression

wk = log(N/fk)
where fk is the number of objects that have this bit turned on.

References:
[1] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.
[2] V. Guralnik and G. Karypis, "A Scalable Algorithm for Clustering Protein Sequences," in Workshop on Data Mining in Bioinformatics, (2001) 73-80.
[3] "Similarity and Clustering in Chemical Information Systems," P. Willett, Research Studies Press, Wiley, New York, 1987.
[4] R.A. Jarvis, E.A. Patrick "Clustering Using a Similarity Measure Based on Shared Near Neighbors", IEEE Transactions on Computers, C22, 1025-1034 (1973).