Generic Detection of VirusesTwo methods of computer virus identification have already been introduced: the overly broad, ex post facto detection provided by activity monitors and integrity management systems, and the overly specific detection offered by virus scanners. Somewhere in between is the ideal ``generic detector'': taking a program's code as input, it determines whether the program is viral or non-viral. Perfect generic detection is an algorithmically ``undecidable'' problem: as observed by [Cohen1987], it is reducible to the halting problem. However, imperfect generic detection that is good in practice is possible, and is naturally viewed as a problem in automatic pattern classification. Standard classification techniques encompass linear methods and non-linear ones such as nearest-neighbor classification, decision trees, and multi-layer neural networks. Within the problem of the generic detection of viruses, detection of ``boot sector viruses'' is both an important and relatively tractable sub-problem. A boot sector is a small sequence of code that tells the computer how to ``pick itself up by its bootstraps''. For IBM-compatible PC's, boot sectors are exactly 512 bytes long; their main function is to load and execute additional code stored elsewhere. Although there are over 4,000 different file-infecting viruses and only about 250 boot-sector viruses, of the 20 viruses most commonly seen 19 are boot viruses, and account for over 80% of all virus incidents. Boot viruses similarly dominate the rolls of newly observed viruses, so an ability to detect new boot sector viruses is significant in the war against viruses. Detecting boot viruses is a relatively limited pattern classification task. Legitimate boot sectors all perform similar functions. Viral boot sectors also all perform similar functions, before passing control to a legitimate boot sector loaded from elsewhere. For this application, false positives are critical. False negatives mean missed viruses, and since viruses occur fairly rarely, so will false negatives. Also, if a classifier does let a virus slip by, the outcome is no worse than if no virus protection were in place. On the other hand, false positives can occur any time, and will leave a user worse off than she would have been without virus protection. Moreover, a false positive on one legitimate boot sector will mean false-positive events on thousands of computers. False positives are not tolerable. Nearest-neighbor classification might seem to be a simple, attractive approach to the classification of legitimate and viral boot sectors. Natural measures of the difference between two boot sectors include the Hamming distance between them (considered as 512-element vectors), or the edit distance [Crochemore1994] between them (considered as text strings). To classify a new boot sector, the procedure would find the ``nearest'' of the 250 known boot sector viruses and 100 legitimate boot sectors (a representative if not comprehensive set that we have collected), and classify the new boot sector as viral if its nearest neighbor is viral, legitimate if its nearest neighbor is legitimate. Unfortunately, nearest-neighbor classification performs poorly for this problem. A viral boot sector can be just a short string of viral code written over a legitimate boot sector, so in any overall comparison, the virus will be more similar to the legitimate boot sector it happened to overwrite than to any other virus. This says that what makes viral boot sectors viral is not any overall quality but the presence of specific viral functions. These functions can be used to construct a virus classifier. For example, one common action is for a virus to reduce the apparent size of available memory so that space taken up by the virus will not be noticed. Although this action may be variously implemented in machine code, most machine code implementations match one of a few simple patterns. (A fictitious pattern typifying the form is C31B****AC348F**90D3D217 -- about 10 fixed bytes and some wildcards.) Of the viruses that lower memory in less conventional ways, most still contain a 2-byte pattern weakly indicative of the same function, but more prone to false positives. Similar strong and weak patterns describe other common viral functions. Using expert knowledge of viral and non-viral boot sectors and several days of extensive experimentation, we hand-crafted an ad hoc classifier (see Figure 1).
Figure: A hand-crafted multi-level classifier network. Eliminating the ``MAX'' boxes produces a more conventional neural network, but it is inferior, even when the seven weights are optimized. The classifier scans a boot sector for the presence of patterns that provide strong or weak evidence for any of four viral functions. One point is credited for weak evidence, and two points for strong evidence. A boot sector is classified as viral if its total score is 3 or higher. This classifier performed well on the 350 examples, with a false-negative rate of about 18% and a false-positive rate too small to measure over the 100 negative examples. That is, 82% of viruses were detected, and no legitimate boot sector was classified as viral. We hoped to develop a procedure for automatically constructing a virus classifier, using similar features as inputs to a neural network. Since the ad hoc classifier incorporated knowledge of all of the available boot sectors, there was a possibility that it suffered from overfitting, in which case it would generalize poorly on new data. It would be much easier to assess the generalization performance of an automatically constructed classifier. Also, we hoped that algorithmic extraction of features and optimization of network weights might give even better classification performance, especially in the false-positive measure. Finally, we believed that an automated procedure would adapt much more readily to new trends in boot sector viruses. If substantially new types of boot sector viruses became common, we could simply retrain the classifier -- a much easier task than hacking on an ad hoc classifier, or re-writing it from scratch. Essentially, what we did was this. We extracted a set of 3-byte strings, or ``trigrams'', appearing frequently in viral boot sectors but infrequently in legitimate ones. The presence (1) or absence (0) of the strings defined the input vector to a single-layer neural network. (See Figure 2.)
Figure: Single-layer neural classifier with about 50 input features and weights determined algorithmically. The trigrams and weights are fictitious. Its weights were ``trained'' over about half the examples, and the resulting network's performance tested on the other half. During the development of the automatic classifier, we encountered novel challenges in feature pruning and ill-defined learning that we think represent interesting general issues in learning. These will be introduced in the course of a more detailed description of the classifier's construction.
|