Skip to main content


next previous up

Next A Computer Immune System
Previous Feature selection
Up Generic Detection of Viruses

Classifier training and performance

By construction, the selected trigrams are very good features: within the training set, no legitimate boot sector contains any of them, and most of the viral boot sectors contain at least 4. Paradoxically, the high quality of the features poses the second challenge, what we have called the problem of ill-defined learning. Since no negative example contains any of the features, any ``positive'' use of the features gives a perfect classifier.

Specifically, the neural network classifier of Figure 2 with a threshold of 0 and any positive weights will give perfect classification on the training examples, but since even a single feature can trigger a positive, it may be susceptible to false positives on the test set and in real-world use. The same problem shows up as an instability when the usual back-propagation [Rumelhart et al. 1986] training procedure is used to optimize the weights: larger weights are always better, because they drive the sigmoid function's outputs closer to the asymptotic ideal values of -1 and 1.

In fact all that will keep a feature's ideal weighting from being infinite is the feature's presence in some negative example. Since none of the features were present in any negative example, our solution was to introduce new examples. One way is to add a set of examples defined by an identity matrix. That is, for each feature in turn, an artificial negative example is generated in which that feature's input value is 1 and all other inputs are 0. This adds one artificial example for each trigram feature; it might be better to emphasize features which are more likely to appear by chance.

To do so, we used 512 bytes of code taken from the initial ``entry point'' portions of many PC programs to stand in as artificial legitimate boot sectors; the thought was that these sections of code, like real boot sectors, might be oriented to machine setup rather than performance of applications. Of 5,000 such artificial legitimate boot sectors, 100 contained some viral feature. (This is about as expected. Each selected trigram had general-code frequency of under 1/200,000, implying that the chance of finding any of 50 trigrams among 512 bytes is at most 13%; the observed rate for the artificial boot sectors was 5%.) Since not all of the 50 trigrams occurred in any artificial boot sector, we used this approach in combination with the ``identity matrix'' one.

At this point the problem is finally in the form of the most standard sort of (single-layer) feed-forward neural network training, which can be done by back-propagation. In typical training and testing runs, we find that the network has a false-negative rate of 10-15%, and a false-positive rate of 0.02% as measured on artificial boot sectors.gif (Given the trigrams' frequencies of under 1/200,000, if their occurrences were statistically independent, the probability of finding two within some 512 bytes would be at most 0.8%.) Consistent with the 0.02% false-positive rate, there were no false positives on any of the 100 genuine legitimate boot sectors.

There was one eccentricity in the network's learning. Even though all the features are indicative of viral behavior, most training runs produced one or two slightly negative weights. We are not completely sure why this is so, but the simplest explanation is that if two features were perfectly correlated (and some are imperfectly correlated), only their total weight is important, so one may randomly acquire a negative weight and the other a correspondingly larger positive weight.

For practical boot virus detection, the false-negative rate of 15% or less and false-positive rate of 0.02% are an excellent result: 85% of new boot sector viruses will be detected, with a tiny chance of false positives on legitimate boot sectors. In fact the classifier, incorporated into IBM AntiVirus, has caught several new viruses. There has also been at least one false positive, on a ``security'' boot sector with virus-like qualities, and not fitting the probabilistic model of typical code. Rather than specifically allowing that boot sector, less than an hour of re-training convinced the neural network to classify it negatively; this may help to reduce similar false positives.

Of the 10 or 15% of viruses that escape detection, most do so not because they fail to contain the feature trigrams, but because the code sections containing them are obscured in various ways. If the obscured code is captured by independent means, the trigrams can be passed on to the classifier and these viruses too will be detected.


next previous up

Next A Computer Immune System
Previous Feature selection
Up Generic Detection of Viruses


 

  back to index