5- Comparison to Previous SurveysThe 1991 Dataquest survey [4] and the 1990 [3] and 1991 Certus surveys attempted to answer some questions relating to the prevalence of computer viruses. In addition, they addressed some issues relating to the costs and other consequences of incidents. As was mentioned in the introduction, these issues also have an important bearing on anti-virus policies, but lie beyond the scope of this paper. Where possible, we compare the results of the Dataquest survey to our own, reinterpreting their data where necessary. Details of the Certus surveys are not directly available to us, but fortunately a paper by Tippett [3] gives some overall results for the 1990 survey, and a few details of the 1991 Certus survey are reported in the Dataquest survey. Thus in a few cases we are able to compare our results to the Certus surveys as well. The Dataquest survey reported 48 different PC viruses in actual incidents during seven quarters of observation (from the first quarter of 1990 through the third quarter of 1991). In our somewhat smaller population, we observed 83 different viruses during 1991 alone. The Dataquest survey found the Stoned and 1813 (Jerusalem) viruses to be the two most common, as did we. However, they found that these two alone accounted for a surprising 89% of all incidents -- much more than our figures of 34% for 1991 and 51% for 1990. We are not able to account for this discrepancy completely, but a partial explanation for the significantly greater variety of viruses observed in our population is that it is international, whereas the Dataquest population was limited to North America. In order to provide a better basis for comparison to Dataquest's results, we have segregated our data on the relative frequency of viruses (presented earlier in Fig. 5) into North-American and non-North-American components in Fig. 8.
Figure 8: Relative frequency of virus incidents in 1991 in North America (left) and the rest of the world (right).
When we focus on North America only, the proportion of incidents due to Stoned and Jerusalem is increased from 34% to 55%, which is still considerably lower than Dataquest's 89%. The general character of the distribution is somewhat more in line with the Dataquest results -- a few viruses dominate, and only 35 different types are observed. The top ten viruses account for all but 13% of the incidents. We are in rather good agreement with Dataquest as to which viruses are the most prevalent. According to Fig. 5, the relative newcomer Joshi is next in line after Stoned and Jerusalem. Our lists of the ``Top Ten'' share eight viruses in common. Dataquest has seen more 4096 incidents than we have (it just missed our Top Ten), and we have seen considerably more instances of Michelangelo and 1575 (Green Caterpillar). The agreement is better than might be expected, given the problems inherent in the statistical sampling of rare events. An unexpected benefit of segregating our data as we have done in Fig. 8 is that it brings to light a substantial difference in character between the relative prevalence of viruses in North America and that which exists in the rest of the world. A glance at Fig. 8 immediately demonstrates that the distribution of viruses is much more egalitarian outside of North America. Outside of North America, the Stoned and Jerusalem are still among the most common (demoted slightly to second and third place, respectively), but together they account for only 18% of the incidents. Nearly twice as many different viruses are observed (69 vs. 35), and the top ten viruses fail to account for over 37% of the incidents -- about three times the figure for North America. In addition, the specifics of which viruses are most common are somewhat different. The Cascade family (principally composed of 1701 and 1704), which is the seventh most common in North America, narrowly edges out Stoned for first place outside of North America. The Form, Tequila, and Flip-2153 viruses appear to be much more common outside North America. We do not as yet have a convincing explanation for the huge discrepancies between relative virus prevalence within and outside of North America. We pose this puzzle as a challenge to anyone (including ourselves) who wishes to develop a theoretical model of computer virus propagation. Unfortunately, although we believe that quarterly statistics on the number of incidents involving some of the most common viruses were collected by Dataquest, an analysis of them is not yet available to us. Once it is, it will be very interesting to see how it compares with our Fig. 6. A superficial look at the Dataquest results might lead one to conclude that they measured the total number of virus incidents as a function of time. Unfortunately, a closer look reveals that the statistic they report can not easily be interpreted in this way. What they actually report is the fraction of organizations that experienced one or more virus incidents during a given time period. This is not a very useful statistic as it stands, because it fails to distinguish between organizations with 100 PCs and organizations with 100,000 PCs. It would be very surprising and admirable if an organization with 100,000 PCs were to experience only one virus incident during a year! In the case of large organizations, one would like to know how many virus incidents there were, not just whether there were none or some. Unfortunately, the statistic is not merely useless. It practically begs to be misinterpreted as a per capita figure by the press, since such a measure is much more useful and natural. Here is an excerpt from a recent New York Times article:
A recent survey of the computer virus problem by Dataquest, a San Jose, Calif., market research concern, showed that of 600,000 personal-computer users from North American businesses, 63 percent had experienced a computer virus and 38 percent of those reported a loss of data. [5] The New York Times is not alone by any means. We are aware of other articles (e.g. one in the March 3, 1992 edition of World News Today) which make the same assumption. The unfortunate result of this is that the public is being told that the computer virus problem is about 1000 times worse than it really is.
We can try to salvage the Dataquest result by
assuming that each of the 618,000 PCs in their sample population was equally likely to
be infected by a virus of external origin. Fortunately, Dataquest includes in
their analysis and summary
a coarse-grained distribution of the size of the organizations they surveyed.
In the appendix, we derive a formula which uses this distribution We can now use our derived virus incident rates to reconstruct approximately how many virus incidents one could expect in organizations containing a specified number of PCs. For example, in 1990, the derived incident rate of 0.42 per 1000 PCs means that a population of 100 PCs would have had about a 4.1% chance of having one or more incidents, a population of 1000 PCs would have had a 34.3% chance, and a population of 10,000 PCs would have had a 98.5% chance. An average over the distribution of organization sizes reported by Dataquest yields the 26% figure that they quote. This illustrates why their statistic is so very misleading -- it depends strongly on the organization size, as one would expect. Pursuing our reconstruction a bit further, we can derive approximately the average number of incidents one would expect among organizations of a specified size that experienced at least one incident. Again, using the 1990 figure, we can expect the average number of incidents in a group of 100 PCs that experienced at least one virus to be 1.02, 1.22 for 1000 PCs, and 4.26 for 10,000 PCs. An average over the distribution of organization sizes reported by Dataquest yields 1.27 incidents per organization among those which experienced at least one. The figure of 4.26 for 10,000 PCs particularly illustrates the importance of asking how many incidents were experienced by a particular organization, rather than simply whether there were any. Apparently, Dataquest did attempt to ask how many incidents each organization experienced. However, in the very same figure in which they present the above results, Dataquest claims that the average number of incidents per organization that experienced one or more was 5 for 1990. What can explain the immense discrepancy between this and our derived figure of 1.27? Let us abandon our mathematical transformation for just a moment and think about the problem intuitively. What could account for why 74% of all organizations were virus-free, while the other unfortunate 26% experienced an average of 5 incidents each during 1990? Divine wrath is a distinct possibility. However, we believe there is a rational explanation for this mystery. The problem can be traced to a set of ambiguously-worded questions in the survey, of which the following is an example:
A careful reading of the survey and conversations with Dataquest personnel and Dr. David Stang of the National Computer Security Association (which co-sponsored the survey) made it clear that the intent of this question was to determine how many virus incidents there were during the specified time period. However, the question can easily be interpreted to mean ``How many infected PCs or diskettes were there during 1990?'' If most of the interviewees interpreted the question this way, the reported statistic would reflect the number of infected PCs and diskettes, not the number of incidents. When alerted to this possibility, Stang quickly found further evidence supporting our conjecture: some respondents had given answers ranging from a few dozen to 100. He agreed that at least these particular answers probably represented the number of PCs and diskettes, not the number of incidents. We conclude that the wording of this set of questions in the survey blurred the vital distinction between the number of virus incidents and the number of infected PCs. It is clear from Dataquest's analysis and from conversations with Dataquest personnel that the designers of the survey were well aware of this distinction, but unfortunately it was not conveyed clearly to the respondents. Forging ahead bravely, we can try to salvage the Dataquest results in the light of this reinterpretation. Suppose for the sake of argument that everyone misinterpreted the survey question, so that 5 actually represents the average number of infected PCs and diskettes for each organization that experienced a virus in 1990. Then, using our derived figure of 1.27 incidents for each such organization, we find that the average number of infected PCs and diskettes per incident was approximately 4. This is about 2.5 times the typical incident size in our sample population for 1991. Interestingly, it is reasonably close to the figure of 3.4 PCs that we observed in our population when anti-virus strategies were just being put in place. It may be that an average incident size of 3 to 4 PCs is typical of environments which are only weakly armed against viruses. This would offer further evidence that the anti-virus measures put in place in our sample population succeeded in reducing the virus problem substantially. It is possible that this seeming agreement is mere coincidence. Some respondents may have interpreted the survey question correctly, which would bias our estimate of the average incident size towards a value smaller than the actual one. Now let us try to extract from the Dataquest data a notion of how the computer virus problem is changing with time. Dataquest succumbed to a number of pitfalls which we would like to point out so that future studies can avoid them. They recorded their statistic (which as we have discussed is not really meaningful without making further assumptions) during three or four time periods and fit the trend using Tippett's model of exponential growth. Unfortunately, Tippett's model applies to the increase in the number of copies of one particular virus, while the Dataquest statistic was (erroneously!) attempting to measure the growth in the number of incidents due to all viruses. As we noted in the previous section, the growth in the total number of incidents per quarter in Fig. 7 is due to both the increased prevalence of individual viruses and the increase in the number of viruses. The latter is purely a psychosociological phenomenon, and it is doubtful that any reliable theory explaining the number of new viruses as a function of time could be devised. Furthermore, even if the data were for one particular virus rather than all viruses combined, application of Tippett's model would be questionable for two reasons. First, there is good reason to doubt the model itself, both on theoretical grounds [2, 6] and on the basis of Fig. 6 of this paper (see previous section for a more detailed discussion). Second, even if Tippett's model were approximately correct, it would be extremely dangerous to fit an exponential to just three or four data points. We illustrate this last point in Fig. 9. Suppose that, in Fig. 7, we only had data from the first three quarters of 1990. This is comparable to the amount of data upon which Dataquest and Certus based their estimates of the future course of the computer virus problem. The best exponential fit to these three points would give a ``doubling time'' of 4.7 months. In fact, this would fit comfortably between the values derived by Dataquest (5.2 months) and Certus (4.3 months). Then, we could use this doubling time to try to predict the number of incidents for each quarter through the end of 1991. How would we do? Terribly! The prediction gets worse and worse with time, and is off by nearly a factor of four after five quarters. The problem is not simply that the estimated doubling time is inaccurate. The data indicate that the entire concept of a doubling time is fundamentally flawed. This is demonstrated by the dramatic growth of the ``doubling time'' (which is supposed to be a constant quantity) with time. In an earlier Certus survey reported in [3], the doubling time for the virus problem as a whole between June, 1989 and June, 1990 was given as just 1.8 months. During the first three quarters of 1990, our data and that of Certus and Dataquest indicate a ``doubling time'' of 4 or 5 months. If we attempt to fit a doubling time to our 1991 data, we find that it is 12 months! We confidently predict that, if anyone still insists on measuring a doubling time for 1992, it will turn out to be noticeably longer than any that have been quoted so far.
Figure 9: Moral on the dangers of extrapolating from a small number of data points. The data are the same as in Fig. 7. An exponential curve with a ``doubling time'' of 4.7 months provided an excellent fit to the number of observed incidents in our sample population during the first three quarters of 1990. Extrapolation from this curve predicts that the number of incidents should increase by a factor of 9.0 between the third quarter of 1990 and the fourth quarter of 1991. In fact, the number of incidents increased by a factor of only 2.5 during that time period, and the qualitative shape of the predicted curve is entirely wrong. A linear fit would have been considerably better.
In summary, here is some advice for anyone who is planning to conduct a computer virus survey.
We hope that future surveys will take this advice, and make use of the suggestions for useful statistics that we have listed and discussed in this paper. With some debugging, future surveys could be a valuable complement to the study that we have conducted, giving us access to data from a larger and more diverse population.
|