Skip to main content


next previous up

Next 2- Virus Scanners and Signatures
Previous Automatic Extraction of Computer Virus Signatures
Up Automatic Extraction of Computer Virus Signatures

1- Introduction

One of the most widely-used methods for the detection of computer viruses is the virus scanner, which uses short strings of bytes to identify particular viruses in executable files, boot records, or memory. The byte strings (referred to as signatures) for a particular virus must be chosen such that they always discover the virus if it is present, but seldom give a false alarm. Typically, a human expert makes this choice by converting the binary code of the virus to assembler, analyzing the assembler code, picking sections of code that appear to be unusual, and identifying the corresponding bytes in the machine code.

Unfortunately, new viruses and new variations on previously-known viruses are appearing at a high rate, as illustrated by Fig. 1. For the last several years, IBM's High Integrity Computing Laboratory has been collecting new viral strains from anti-virus researchers around the world. In June, 1991, the rate at which new viruses were added to the collection was approximately 0.6 per day. By June, 1994, the rate had quadrupled to approximately 2.4 per day. Other researchers, using a somewhat different (but equally valid) method for counting the number of distinct viruses, report that the current rate at which new viruses are written is about 5 per day. In any case, it cannot be denied that the high rate of new viruses is creating a heavy burden for human experts, who must spend an increasing proportion of their valuable time performing a task that demands both a high level of skill and a high tolerance for tedium.

  

figure48

Figure: Cumulative number of viruses for which signatures have been obtained by IBM's High Integrity Computing Laboratory vs. time.

In order to alleviate this problem, we have developed a statistical method for automatically extracting near-optimal signatures from computer virus code. The algorithm examines each sequence of code in the virus and estimates the probability for that code to be found in legitimate software. The code sequence with the lowest ``false-positive'' probability is chosen as the signature. IBM has used this technique to extract nearly 2000 virus signatures. In addition, it has been used to evaluate and improve hundreds of signatures which had been chosen previously by human experts.

The existence of an automatic technique for signature extraction also opens the possibility of ameliorating a long-criticized deficiency of virus scanners: the delay between when a virus first appears in the world and when a signature capable of recognizing that virus is distributed to an appreciable fraction of the population. At the High Integrity Computing Laboratory, we are planning to eliminate this delay for a large class of viruses by incorporating automatic signature extraction into an automatic immune system for computer networks [1].

First, in section 2, we shall explain briefly some of the philosophy and pragmatics underlying the use of virus scanners and signatures. Then, in section 3, we shall describe the signature extraction/evaluation algorithm in some detail. As will be seen, the probability estimates generated by the algorithm must be calibrated before the algorithm can be used to extract or evaluate signatures; a method for doing this is described in section 4. Section 5 presents some results which demonstrate that the algorithm does an excellent job of discriminating between good and bad signatures. In section 6, I describe briefly an automatic immune system that we are planning to implement, of which the signature extractor is a vital component. We conclude in section 7 with a brief summary.


next previous up

Next 2- Virus Scanners and Signatures
Previous Automatic Extraction of Computer Virus Signatures
Up Automatic Extraction of Computer Virus Signatures


Back To Index