Skip to main content


next previous up

Next 4- How Good is the Algorithm?
Previous 3.1- Generating candidate signatures
Up 3- The Extraction/Evaluation Algorithm

3.2- Choosing the best signature

Before describing the selection process, it is worth noting two things. First, the aforementioned procedure for generating the candidate signatures is not crucial. Any other technique, including manual labor by a human expert, could be used. Second, the number of candidate signatures could be very small -- even just one. This would be appropriate if one were using the signature extractor to evaluate a signature that has been chosen previously, most likely by a human expert.

The key idea behind the algorithm is to estimate the probability that each of the candidate signatures will match a randomly-chosen block of bytes in the machine code of a randomly-chosen program, either exactly or with some specified number or pattern of mismatches. In the case of extraction, we select one or more signatures with the lowest estimated ``false-positive'' probabilities of all the candidates, making sure that this probability is less than some established threshold. In the case of evaluation, we just place a seal of approval on a signature if its estimated false-positive probability is less than the established threshold. Thus the problem of signature extraction or evaluation is reduced to the following problem: For a given sequence of S bytes tex2html_wrap_inline456 (call it tex2html_wrap_inline458 for short), estimate the probability tex2html_wrap_inline460 for tex2html_wrap_inline462 to occur in a large body of normal, uninfected code.

The most obvious way to compute tex2html_wrap_inline464 is to tally the number of occurrences of tex2html_wrap_inline466 in a large corpus of uninfected programs. Call this quantity tex2html_wrap_inline468 . Then tex2html_wrap_inline470 could be estimated as simply

 

equation68

where tex2html_wrap_inline472 is the number of S-byte sequences in the corpus.

However, there is a serious problem with this technique. Suppose that the corpus is reasonably large -- on the order of a gigabyte or so. For relatively common sequences (ones that could be expected to appear several times per gigabyte), the probability estimate given by Eq. 1 would be reasonably accurate. Somewhat common sequences (ones that could be expected to appear once or twice per gigabyte, or once in every few gigabytes) might or might not appear in the corpus. Extremely rare sequences almost certainly would not appear in the corpus. It is readily apparent from these considerations that the technique prescribed in Eq. 1 has very little ability to discriminate between somewhat common and very uncommon sequences. Another way to express this is that the dynamic range of possible probability estimates yielded by this method is inadequate; it cannot produce estimated probabilities less than tex2html_wrap_inline476 , or about tex2html_wrap_inline478 for a gigabyte corpus.

A more illustrative way to understand the inadequacy of this technique is by making an analogy to a problem encountered in training machines to understand human speech. In order to make sense of a stream of phonemes that have been extracted from an utterance, it is often useful to have a language model that is derived from a large corpus of utterances. The problem is similar to that of estimating the probability that a given sentence will be uttered, given a large corpus of previous utterances. For example, suppose that we have access to a recording of all press interviews with United States senators during the 1980's, and that we would like to estimate the probability for each of the following three sentences to be spoken by a U.S. Senator sometime during the 1990's:

1. God bless America.

2. It's all true -- we are space aliens.

3. We bless all true American space god aliens.

The 1980's corpus would probably consist of tens or hundreds of millions of sentences. Within that corpus, we might expect sentence 1 to appear several times. It would be somewhat surprising to find an instance of sentence 2 in the corpus, but not absolutely inconceivable. Sentence 3 seems extremely unlikely. It is not the sort of sentence that one would expect to occur naturally in ordinary human discourse; it has the look of something generated by a somewhat incompetent computer program.

The most likely scenario is that one would find several instances of sentence 1 in the 1980's corpus, but no instances of sentences 2 and 3. Under these circumstances, the method of Eq. 1 would assign an estimated probability of zero to sentences 2 and 3. It would fail to distinguish sentence 2 as being unusual, but significantly more likely than sentence 3.

In fact, less than halfway through the 1990's, we find that sentence 1 has indeed been uttered several times. But we also find that sentence 2 has been uttered at least once -- by Senator Phil Gramm of Texas, who made his stunning confession on June 7, 1994, after the question of his extraterrestial origin was posed by a reporter from The Weekly World News [2]. As far as we are aware, the world is still waiting for sentence 3.

There is a critical need for an algorithm capable of distinguishing the likelihood of sentence 2 as much greater than that of sentence 3, even if neither one appears in the corpus. Fortunately, researchers in the field of speech recognition have been dealing with just this sort of problem for many years. The solution is to collect statistics on short sequences of adjacent words. Trigrams (sequences of three adjacent phonemes or words) are fairly popular in the speech community. The key insight is that reasonably good statistics can be obtained for sufficiently short sequences. Then, a simple approximation formula can be used estimate the probability of a long sequence by combining the measured frequencies of the shorter sequences from which it is composed.

In sentence 2, the sequences ``It's all true'' and ``we are'' are quite common, and ``space aliens'' is only somewhat unusual, so the overall sentence can be classified as unusual, but not terrifically so. In sentence 3, none of the individual words are very unusual, but the sequences ``American space god'', and ``space god aliens'' are quite unusual, and contribute to a very low overall estimated probability for the sentence. In effect, the trigram technique (which is easily generalized to higher-order n-grams) allows one to extrapolate beyond an existing corpus to a vastly larger universe of statistically similar utterances. This permits one to discriminate between somewhat unusual and extremely unusual utterances.

The technique that we have developed for automatic signature extraction is completely analogous to this standard speech recognition technique. Suppose that trigram statistics were tallied for a large corpus. Then the estimated probability for the sequence tex2html_wrap_inline482 would be calculated from

 

equation87

where tex2html_wrap_inline484 is the number of trigrams in the corpus.

To get some insight into the derivation of Eq. 2, consider the simpler problem of estimating the probability of a 4-byte sequence tex2html_wrap_inline486 from measured frequencies of its constituents. First, one can re-write tex2html_wrap_inline488 in the form:

 

equation95

where p(A|B) is to be interpreted as the probability of byte sequence A having been preceded by byte sequence B. Eq. 3 is exact up to this point. However, if we suppose the correlation between byte tex2html_wrap_inline496 and byte tex2html_wrap_inline498 is sufficiently weak that it can be ignored, the term tex2html_wrap_inline500 can be replaced as follows:

 

equation99

Inserting Eq. 4 into Eq. 3 yields

 

equation106

a special case of Eq. 2.

The extension of Eq. 2 to higher-order n-grams is conceptually trivial. The technique used to extract and evaluate IBM AntiVirus's signatures is a more sophisticated variant of Eq. 2 that incorporates measured frequencies of 1-, 2-, 3-, 4-, and 5-grams. A few additional tricks are used to solve the problem of adequate storage for the 3-, 4-, and 5-gram frequenciesgif.

The more sophisticated variant of Eq. 2 has some additional useful capabilities. Signatures with fixed wildcards are handled by letting the wildcards serve as demarcations between non-wildcarded regions. The estimated probabilities of all non-wildcarded regions are multiplied to obtain the overall estimated probability of the signature. To calculate estimated probabilities of signatures for which m mismatches are permitted, one can (conceptually) generate all of the tex2html_wrap_inline506 possible mismatch configurations, treat each configuration individually as a signature with m fixed wildcards, and then add the probabilities of all configurations together. If implemented naıvely, and if m is more than 4 or 5, the combinatorics can make the computation painfully slow, even on a fast workstation. Fortunately, I have developed a recursive algorithm that bypasses the combinatorics, and obviates the need to generate each mismatch configuration explicitly. The recursive algorithm permits the estimated probability of a signature to be computed very quickly, regardless of the number of allowed mismatches.

Note that, in the speech recognition example, the trigram technique achieved good discrimination between the likelihood of sentences 2 and 3 without having any real knowledge of grammar and semantics. Similarly, the automatic signature extraction technique makes no attempt to understand code in the same way that a human expert in machine code/assembly language would. It does not know what instructions mean; in fact, it does not even know that instructions exist. It sees machine code as nothing more than a long sequence of bytes, and it applies a blind, brute-force force statistical technique to those bytes. Remarkably, this lack of understanding is more than offset by the algorithm's ability to acquire a knowledge of machine code statistics to a depth to which no human expert could or would hope to aspire. This allows the algorithm to perform extremely well, as will now be seen.


next previous up

Next 4- How Good is the Algorithm?
Previous 3.1- Generating candidate signatures
Up 3- The Extraction/Evaluation Algorithm


Back To Index