3.1- Generating candidate signaturesIn our virus isolation laboratory, we use the following procedure to identify portions of the virus that are likely to be invariant from one instance to another. An automatic algorithm runs the infected sample on a DOS machine, and then tries to lure the virus into infecting a diverse suite of ``decoy'' programs. A decoy's sole purpose in life is to become infected. To increase the chances of success in this noble, selfless endeavor, decoys are designed to be as attractive as possible to those types of viruses that spread most successfully. A good strategy for a virus to follow is to infect programs that are touched by the operating system in some way. Such programs are most likely to be executed by the user, and thus serve as the most successful vehicle for further spread. Therefore, the algorithm entices a putative virus to infect the decoy programs by executing, reading, writing to, copying, or otherwise manipulating each of them. Such activity tends to attract the attention of many viruses that remain active in memory even after they have returned control to their host. To catch viruses that do not remain active in memory, the decoys are placed in places where the most commonly used programs in the system are typically located, such as the root directory, the current directory, and other directories in the path. The next time the infected file is run, it is very likely to select one of the decoys as its victim. From time to time, each of the decoy programs is examined to see if it has been modified. Any that have been modified are assumed to have been infected with the virus, and are stored in a special directory, where they await the next processing step. After having obtained several infected decoys, the infected regions of the decoys are compared with one another to establish which regions of the virus are constant from one instance to another. Usually, most of the virus is constant, with one or more small regions that vary. In some cases, there is a fairly short constant region near the beginning of the virus, followed by a large variable region; this is indicative of a simple self-garbling virus. In a small percentage of cases, the constant regions are so short as to be useless for the purpose of extracting signatures. Such a situation indicates that the virus is at least moderately polymorphic, and in this case the algorithm gives up, and a human expert performs the analysis. Further improvements to the algorithm could be made to handle certain types of polymorphism, but there will always be a place for human virus experts! Provided that the virus is not overly polymorphic, there are at this point one or more sections of the virus which tentatively have been classified as being invariant. However, it is quite conceivable that not all of the potential variation has been captured within the samples. Various heuristics are employed to identify portions of the ``invariant'' sections of the virus which by their nature are unlikely to vary from one instance of the virus to another. In particular, ``code'' portions of the virus which represent machine instructions (with the possible exception of bytes representing addresses) are typically invariant. ``Data'' portions of the virus, which for example could represent numerical constants, character strings, screen images, work areas for computations, addresses, etc. are often invariant as well, but are much more vulnerable to modification by the virus itself when it replicates itself or by humans who intentionally modify viruses so as to help them elude virus scanners. We use a variety of techniques to segregate code and data portions, and only the code portions are retained for further processing. At this point, there are one or more sequences of invariant machine code bytes from which viral signatures could be selected. We take the set of candidate signatures to be all possible contiguous blocks of S bytes found in these byte sequences, where S is a signature length specified by the user or determined by the algorithm itself. (Typically, S ranges between approximately 12 and 36.) The remaining goal is to select from among the candidates one or perhaps a few signatures that are least likely to lead to false positives.
|