Skip to main content


next previous up

Next Immunological memory
Previous Automatic virus analysis
Up A Computer Immune System

Automatic signature extraction

The basic goal of automatic signature extraction is to choose a signature that is very likely to be found in all instances of the virus, and very unlikely to be found accidentally in uninfected programs. In other words, we wish to minimize false negatives and false positives. False negatives are dangerous because they leave the user vulnerable to attack. False positives are extremely annoying to customers, and so infuriating to vendors of falsely-accused software that they have led to at least one lawsuit.

To minimize false negatives, we first start with the contents of the invariant regions that have been identified by the automatic virus analysis procedure. However, it is quite conceivable that not all of the potential variation has been captured within the samples. As a general rule, non-executable ``data'' portions of programs, which can include representations of numerical constants, character strings, work areas for computations, etc., are inherently more likely to vary from one instance of the virus to another than are ``code'' portions, which represent machine instructions. The origin of the variation may be internal to the virus, or a virus hacker might deliberately change a few data bytes in an effort to elude virus scanners. To be conservative, ``data'' areas are excluded from further consideration as possible signatures. Although the task of separating code from data is in principle somewhat ill-defined, there are a variety of methods, such as running the virus through a debugger or virtual interpreter, which perform reasonably well.

At this point, there are one or more sequences of invariant machine code bytes from which viral signatures could be selected. We take the set of candidate signatures to be all possible contiguous blocks of S bytes found in these byte sequences, where S is a signature length that is predetermined or determined by the algorithm itself. (Typically, S ranges between approximately 12 and 36.) The remaining goal is to select from among the candidates one or perhaps a few signatures that are least likely to lead to false positives.

We have formulated the problem of minimizing the false positive probability as follows. For each candidate signature, estimate the probability for it to match a random sequence of length S that is generated by the same probability distribution that generates legitimate software on the relevant platform. (Of course, machine code is written by people or compilers, not probability distributions, so such a probability distribution is a theoretical and somewhat ill-defined construct, but we estimate its statistics from a set of over 10,000 DOS and OS/2 programs, constituting half a gigabyte of code.) Then, we select the candidate signature for which the estimated probability is the smallest.

In slightly more detail, the key steps of the algorithm are as follows:

  1. Form a list of all n-grams (sequences of n bytes; tex2html_wrap_inline573 ) contained in the input data. ( tex2html_wrap_inline575 is typically 5 or 8.)
  2. Calculate the frequency of each such n-gram in the ``self'' collection.
  3. Use a simple formula that chains together conditional probabilities based on the measured n-gram frequencies to form a ``false-positive'' probability estimate for each candidate signature, i.e. the probability that it matches a random S-byte sequence chosen from code that is statistically similar to ``self''.
  4. Select the signature with the lowest estimated false-positive probability.

Characterizations of this method [Kephart and Arnold1994] show that the probability estimates are poor on an absolute scale, due to the fact that code tends to be correlated on a longer scale than 5 or 8 bytes. However, the relative ordering of candidate signatures is rather good, so the method generally selects one of the best possible signatures. In fact, judging from the relatively low false-positive rate of the IBM AntiVirus signatures (compared with that of other anti-virus vendors), the algorithm's ability to select good signatures may be better than that achieved by human experts.

In a sense, the signature extraction algorithm combines elements of outmoded and current theories of how the vertebrate immune system develops antibodies and immune-cell receptors to newly encountered antigen. The template theory, which held sway from the mid-1930's until the early 1960's, was that antibodies and receptors molded themselves around the antigen. The clonal selection theory holds that a vast, random repertoire of antibodies and receptors is generated, and those that recognize self are eliminated during the maturation phase. Of the remaining antibodies and receptors, at least a few will match any foreign antigen that is encountered. The clonal selection theory gained favor in the 1960's, and is currently accepted [Paul1991].

Our automatic signature extraction method starts out looking like the template theory. Instead of generating a large random collection of signatures that might turn out to be useful someday, we take the collection of code for a particular virus as our starting point in choosing a signature. However, we do share one important element with the clonal selection theory: elimination of self-recognizing signatures. In fact, the automatic signature extraction method is even more zealous in this endeavor than clonal selection, in that it only retains the ``best'' signature.


next previous up

Next Immunological memory
Previous Automatic virus analysis
Up A Computer Immune System

 

  back to index