Molecular Fingerprints

Brian T. Luke, Ph.D.


Fingerprints are similar to structural keys in that they contain information about the atoms and substructures contained within a molecule, but do not suffer from the inflexability of the latter. All of the chemical information on a molecule can be contained in its fingerprint, but this packing of information into a finite string means that specific information about the molecule is lost. The net effect of this is that when comparing a substructure to a database compound, it is possible to obtain false positives, something that does not happen with a structural key. The procedure for constructing a structure's fingerprint, which is stored as a string of bits of length L, is as follows:

  1. All L bits of the fingerprint are initially set to 0.
  2. Every substructure of the compound, starting from each atom and extending down the bonds (including bond type) until the entire molecule is represented, is assigned a unique pattern described by a small number positions (say 4 or 5) along the fingerprint. Instead of allowing the chain length to increase until the entire structure is represented, it can be stopped at a given depth (e.g. 8).
  3. These sets of bits in the fingerprint are set to 1. In other words, each substructure corresponds to a partial fingerprint and the total fingerprint is constructed by summing them, using the logical <OR>.

As you can see from the recipe given above, it is possible for a substructure pattern found later in the parsing of a molecule to represent bits that have already been turned on in the overall fingerprint by summing all of the previous patterns. When a substructure's fingerprint is compared to that from a given database molecule, the verification process requires that all bits turned on in the substructure also be turned on in the molecule. This in no way ensures that the substructure is present in the molecule, but if the substructure contains a 1 in a particular position and the database molecule contains a 0, you can be absolutely sure that the substructure is missing.

The final point is that the probability of finding a false positive is related to the length of the fingerprint array, L. If L is very large, the probability of finding a false positive is quite small. On the other hand, the information content of a given fingerprint may be small (dominated by 0's) and the comparison of the arrays may take too long. The length can be reduced by folding the array; which simply means to split the array in half and add the two halfs together using the logical . Though this reduces the length of the fingerprint array and decreases the time needed in the compare, it increases the probability of finding a false positive. On the other hand, this process does not nullify the exclusion comparison; if the substructure contains a 1 in a bit position that has a 0 in the database compound, that substructure is not present in the compound no matter how many times the fingerprint has been folded. If folding is done too many times, you may reach a situation where a given database molecule contains a 1 in all positions, which means that no substructure can be ruled out.

Return to Clustering Binary Objects