Feature selectionThe first step in the construction was the selection of byte strings to act as features. Where a human expert is able to use high-level understanding of viruses, knowledge of machine code, and natural intelligence to select complex feature patterns containing wildcards, for algorithmic feature generation we contented ourselves with simple 3-byte features. A training set with 150 512-byte viral boot sectors includes 76,500 ``trigrams'', of which typically 25,000 are distinct. This is where the first challenge, feature pruning, comes in. A well known principle in machine learning states that the number of training examples must be considerably larger than the number of adjustable parameters to reliably give good generalization to test examples [Hertz et al. 1991]. With 150 viral and 45 non-viral training examples, a network must have well fewer than 195 weights -- say about 50 -- implying a lesser or equal number of inputs. Somehow the 25,000 trigrams must be winnowed down to 50. Since what is desired are trigrams that are indicative of viral as opposed to legitimate behavior, it is natural to remove trigrams appearing too frequently in legitimate boot sectors. Eliminating all trigrams which appear even once in the 45 legitimate training examples reduces the 25,000 candidate trigrams by only about 8%. On the reasoning that trigrams occurring frequently in PC programs in general are analogous to the English word ``the'' and not salient features, further winnowing can be done. Eliminating trigrams with frequency over 1/200,000 (occurring on average more than once in 200,000 bytes) again reduces the number about 8%, leaving about 21,000 of the original 25,000 candidate features. Much more drastic pruning is required. It is provided by selecting trigram features which figure importantly in the viral training set. One way to do this would be to select trigrams occurring at least some number of times in the viral training set, but this leaves some viral samples unrepresented by any trigrams. A better approach comes from selecting a ``cover'' of trigrams: a set of trigrams with at least one trigram representing each of the viral samples. In fact, we can afford something close to a 4-cover, so that each viral sample is represented by 4 different trigrams in the set. (A few samples have fewer than 4 representatives in the full set of 21,000 trigrams, in which case they are only required to be 3-covered, 2-covered, or singly covered, as possible.) Four-covering produces a set of about 50 trigram features, few enough to be used as input to a neural net. (Even so, a complete two-layer network with h hidden nodes would have h times as many weights as inputs, which here is prohibitive even for an h of 2 or 3; this is why we used a single-layer network.) Reassuringly, most of the trigrams were substrings of or otherwise similar to the more complex patterns of the ad hoc classifier. However, there were a few trigrams that could not be related to any of these patterns, and on expert inspection they turned out to define a meaningful new feature class.
|