## Clustering Binary Objects## Brian T. Luke, Ph.D. |

Main Index |
Binary objects are simply objects that are described by a finite-length bit string. They are often used in chemical databases because they can be very useful in searching for a particular compound. If the search string has a particular bit turned ON and a database compound does not, then that structure can be passed over. Two possible representations are Molecular Fingerprints and Structural Keys.
The bit strings that characterize two objects can also be used to calculate
a "distance." This effective distance can then be used with a clustering
algorithm to place the objects into groups.
One of the most popular methods for calculating the distance between
two objects is to first determine their similarity.
If a general similarity metric is denoted S, it usually has a value
between 0.0 (the bit strings have no ON bits in common) and 1.0
(the two bit strings are identical).
A distance can then be given by A(1-S) If the bit string has a length of L, it is possible to go down this string and count the number of times a bit is ON in both strings, ON in one and OFF in the other, or OFF in both strings. The four sums are presented in the table below.
The first subscript refers to the value of the bit for object Some other symbols that will be used in this section are - B
_{i}( = B_{10}+ B_{11}) is the total number of ON bits in object*i*. - B
_{j}( = B_{01}+ B_{11}) is the total number of ON bits in object*j*. - B
_{C}( = B_{11}) is the total number of times a bit is ON in both bit strings. - B
_{I}( = B_{00}+ B_{11}) is the total number of times the two bit strings agree. - L ( = B
_{00}+ B_{01}+ B_{10}+ B_{11}) is the length of the bit string.
With these definintions, commonly used similarity metrics are presented in the following table.
The last two metrics are different from the rest. Jaccard's Similarity
is different in that it is not symmetrical; the similarity of
The Tversky Index is unique in that it has two adjustable parameters,
_{ab} = B_{C}/(aB_{i}+bB_{j})
a and b, the Tversky Index may have to be scaled so that
identical bit strings produce a similarity of 1.0.
A dissimilarity metric, D, is a measure of how different two bit strings
are, and can therefore be scaled by any constant to yield a distance.
This metric has a value of 0.0 for identical bit strings and 1.0 for
strings with no bits in common. These can easily be formed by
D=(1-S), where S is a similarity value from the table above, but
certain dissimilarities presented in the literature are given in the
table below. They are proportional to either the sum or the product
if the disagreement between the bit strings, B
In chemical databases, this metrics can yield a significant dissimilarity
(distance) if a substructure of a molecule is compared to the full
molecule. To remove this problem,
Daylight Systems developed a Squared Euclidean Substructure Distance,
DSES. If compound With any of these distance metrics, a clustering algorithm can be used to place these binary objects into groups. Since it is not possible to calculate an average position between two binary objects (unless they are identical), one cannot determine a centroid for a group of binary objects. To circumvent this, a medoid [2] can be used instead. A medoid is simply a bit string that minimizes the sum of the distance to all objects in the group, and can be used with Wards Method (described in Agglomerative Linkages) and K-means clustering. In 1987, Peter Willett [3] studied the clustering of chemical databases and found that Jarvis-Patrick clustering [4] produced the best results. I used a similar procedure to try to cluster nucleotide sequences. Because of the large variation in the number of 1's for different bit strings, I found that the following distance metric produced better results. _{SE} = B_{j}(B_{i}-B_{c})/B_{i}^{2}
One final point is that it is possible to weight each bit in the string.
If b The weight for each bit can be related to the "inverse frequency" of being turned on. In other words, if this bit is on for all N objects to be clustered, it cannot be used as a descriminator. Conversely, if it is on in only a few of the objects, it may be an important descriminator and should be given a relatively large weight. One such inverse frequency weight is given by the expression _{k} = log(N/f_{k})
_{k} is the number of objects that have this bit turned on.
References: |