|
Introduction
Electronic circuits function by identifying small packets of
charge as elemental bits of information. Any noise which modifies these
small packets of charge may cause the stored information to change.
This electrical noise may come from well-known sources such as a noisy
power supply or radiation from lightning [1].
Extensive design effort
is spent making electronics immune to such noise. Many experts think
that computer soft fails1
are due mostly to electronic noise.
Since 1978, it has been known that energetic nuclear particles can also
create such electronic noise (the history is reviewed later). These
nuclear particles come from two sources: 1) the decay of
radioactive atoms which exist in trace amounts in all materials
(Figure 1)
and 2) extraterrestrial cosmic rays
which bombard the earth constantly from the far depths of the galaxy
(Figure 2). This paper and the accompanying
papers in this issue are focused primarily on soft fails caused at the
computer chip level by both types of nuclear radiation.
Figure 1
Figure 2
Soft fails at the chip level do not necessarily affect the computer
user. For instance, the system may be turned off before an incorrect
bit of memory is used, or the incorrect bit may be overwritten before
it is used. Furthermore, for applications which demand the highest
reliability, the computer is often designed so that both hard and soft
fails of its electronics are undetectable to the user. Detection and
correction of computer errors are possible with extra circuitry.
Electronic designers may meet desired reliability goals by either
choosing very reliable components (insensitive to radiation) or
including extra circuits to detect and correct errors.
This review concentrates on major scientific and technical advances
which have not previously been published. Development of extensive
theoretical modeling of soft errors is discussed elsewhere [2].
In retrospect, the fact that trace radioactivity might cause soft fails
in computer circuitry could have been predicted in the 1960s or 1970s.
For a decade, electronic components had been getting smaller, with
lower voltages and fewer electrons in a charge packet indicating a zero
or one. With the introduction of 16Kb memory chips in 1977, the storage
charge in memory cells had been reduced to about 1M electrons from
about 4M electrons for 4Kb LSI (Large-Scale Integration)
circuits.2
The radioactive decay particle which would prove
to be the most upsetting was an alpha-particle, a decay product found
mostly from the decay chain of uranium or thorium atoms. An alpha-particle
can cause a sudden burst of 1M electrons in a semiconductor
over a path length of a few microns. This was the dimension of the new
16Kb FET memory cells. For the first time, the memory cell information
quantum was capable of being altered by radioactive decay
products.
Naturally occurring radioactive isotopes in an LSI circuit may have
concentrations far below the level of possible screening and still
affect the chip reliability. For example,
Po210
is a radioactive element which is widely used in both academia and industry
as a radiation source because it is very cheap and not controlled (in
moderate quantities) by government regulations.
Po210 occurs naturally as a daughter of radon gas and is commonly found in
basements and in water pumped from wells. But a very small amount (far
less than one
Po210
part per billion atoms) can cause
sensitive LSI circuits to fail several times per minute. It is
serendipity that many common industrial processes, such as carbon
filtration of water supplies, screen out
Po210
to a
remarkable degree, and limit its contamination. This paper later
reviews IBM incidents of contamination of LSI manufacturing by
Po210
(see later discussions for soft-fail problems in 1983
and 1987).
The second type of radiation of concern is that from cosmic rays.
Cosmic rays come from deep in our galaxy, and are of unknown origin.
They have immense energies and bombard the earth from all sides. A flux
of about 1600/m2-s
bombards all of the earth's outer
atmosphere with enough energy to penetrate down to sea level
(Figure 2). They are so energetic that they can
penetrate our atmosphere (equivalent to 13 feet of concrete), and
then penetrate through the ceiling into a multistory building.
Although Figure 2 shows many types of sea-level
particles, the only two
which have been shown to cause significant LSI problems are neutrons
and pions. Both of these particles work in the same way: They hit a
silicon nucleus and fracture it into a star of exploding fragments.
Each of these fragments generates a stream of electrical charges which
can upset almost any circuit. We have found that all circuits are
susceptible to soft fails due to cosmic rays. The scientific problem is
to determine accurately what is the probability of upset, and how to
contain this within the desired standards of reliability. Electronics
may be partially shielded from cosmic rays by concrete ceilings: The
attenuation is about
2d/1.4, where d
is the concrete thickness in feet (e.g., 3 ft of concrete = 50%
reduction) [3].
Summary of conclusions
In this summary, the abbreviation SER refers to soft-error
rate, usually in fails per unit time (e.g., fails per year).
This rate consists of two parts: cosmic SER and radioactive
SER, referring to the two sources of particles. Topics in italics
are fully described in later sections.
- The cosmic ray intensity is greatest at high terrestrial altitudes,
and approaches zero under extensive shielding. IBM has conducted
extensive field testing3
of components at high
altitudes (10,000 ft), at moderate altitudes (5000 ft),
at sea level, and under shielding of 50 ft of
concrete. All elevated-altitude tests showed cosmic-ray-induced
fails in electronic components.
- In all tests, the observed fail rate scaled directly with the cosmic
ray intensity, over a total observed change of more than 1000× .
- Accelerated testing techniques have been developed to
measure the cosmic SER of LSI circuits in a few hours
[4]. The
results are shown to be quite close to those measured by field
testing of thousands of parts for many months at various altitudes
(after contributions from other sources of soft fails are removed)
[5].
- As LSI memory devices become smaller, they become more sensitive to
nuclear-radiation-induced soft fails.
- The circuit types most resistant to soft errors are CMOS and n-MOS
static arrays, and DRAMs with deep buried layers to prevent
charge funneling. The circuits most sensitive to cosmic rays
are bipolar SRAMs. For recent circuits, the SER differences between
various chips span more than four decades of sensitivity.
- Bipolar logic circuits were measured in 1990 and found to be
sensitive to cosmic-ray-induced errors [4].
Since this is the
earliest date of extensive logic testing, the fail rate of previous
generations of logic chips is unknown.
- IBM manufacturing process variations historically introduced a spread
of circuit SER of about ± 1.5×
between identical chips, but recent
chips from 1988 to 1992 show a SER variation of about 2× .
Some non-IBM
commercial LSI circuits show chip-to-chip variation > 200×
(> 20,000%), which increases the need for detailed testing of
many parts. See Reference [4] for details and examples.
- The measured SER cross sections of all chips are almost independent of
the angle of the cosmic ray particles to the circuit plane. This leads
to the conclusion that the cosmic SER of a device depends mostly on its
sensitive volume, and not on its individual device dimensions. Mounting
orientation is not a significant factor in a chip's SER.
Historical perspective of soft errors
This historical review is limited to those events which affected
the understanding of radiation-induced soft fails of LSI
electronic components at terrestrial altitudes, and follows
the outline given in
Table 1.
Table 1 Table of contents.
| Historical perspective of soft errors |
| 1954-1957 | Discovery of soft fails in digital electronics |
| 1975 | Soft errors in satellites from solar particles |
| 1978 | Discovery of soft errors from alpha-particles |
| IBM Historical development: 1978-1988 |
| 1978 | IBM predictions of soft errors from cosmic rays |
| 1984 | Reports of memory reliability problems in Denver |
| 1984 | Testing of sensitivity of bipolar memory chips to cosmic ray particles |
| 1985 | Accelerated testing of sensitivity of LSI chips to cosmic rays |
| 1986 | Field testing of bipolar SRAM chips for soft errors |
| 1985 | Mobile cosmic ray monitor |
| 1987 | Radioactive contamination of a semiconductor factory |
| 1988 | Completion of bipolar SRAM field testing |
| Cosmic ray SER studies: 1989-1992 |
- 1954-1957--Discovery of soft fails in digital electronics
From 1954 to 1957, there were reports of electronic problems
during above-ground nuclear bomb tests when many kinds of electronic
anomalies occurred in monitoring equipment. The term "hard fail"
described a device which ceased to function after a heavy dose of
radiation. Other aberrant or spurious electronic signals were
attributed to electronic noise from the bomb's electromagnetic shock
wave, and one estimate was that "mistakes" began at particle
doses of about
1011
particles/cm2.
In 1957, the International Geophysical Year, the first
extensive studies were made of cosmic rays and their origin. This first
year of organized international science was followed in 1963 by the
International Quiet Sun Year, which established the first
comprehensive picture of cosmic ray identities and fluxes. The results
from the first satellite experiments showed that about 1600
particles/m2-s
bombarded satellites with particle energies
above 1 GeV. The solar cycle was also a contributor to the total cosmic
ray flux, with a flux intensity that changed by
106 between
the active sun and the quiet sun, but these particles were of such low
energy that almost none made cascades which would reach sea level.
Hence, the solar cycle is only a moderate variation on observed SER
[6].
From 1962 to 1970, early satellite electronics were found to be
unreliable, and considerable redundancy had to be built into the
circuits. A major satellite problem was differential satellite charging
in the solar wind, which led to noise and arcing between satellite
modules. One solution was to cover the satellite with a blanket of
MYLAR* coated with gold to minimize both differential charging and
heating. Further, data transmission to earth was noisy, and
electronic soft fails could not be separated from transmission
errors. To counter this, most transmissions were broken into small data
streams, with parity checks and handshaking, similar to the current
methods used for secure telephone data transmission and some
FAX transmissions.
- 1975--Soft errors in satellites from solar particles
In 1975, Binder et al. of Hughes Aircraft Co. published the first
analysis of four "anomalies" which had occurred on satellites,
which they did not believe were due to satellite charging problems
[7]. These four anomalies were found
in an analysis of 17
satellite-years of operation. The authors proposed that these events
might be due to heavy ions in the solar wind, striking the electronics
and making dense electron-hole tracks in the transistor
semiconductors. They suggested that 100-MeV iron atoms in the solar
wind might be responsible. Since the satellite fail data they quoted
seemed so trivial, with a satellite system error rate of one
fail in four years, their paper seemed to establish that this
mechanism was not a serious problem. This probably was a valid
conclusion for the discrete component electronics of their time. The
100-MeV iron particles which were considered are still a principal
source of satellite errors, but since these particles cannot
penetrate the atmosphere, they are of no consequence to terrestrial
electronics.
- 1978--Discovery of soft errors from alpha-particles
In 1978, the first evidence of sea-level soft fails from energetic
particle impact was given in a famous paper by May and Woods of Intel
[8]. This paper resulted from a serious industrial problem at Intel
concerning operational errors in their 2107 series 16Kb DRAMs. It was
discovered that the problem was trace radioactivity in the memory
packaging materials. The May and Woods paper was submitted to a device
reliability conference. Although the paper was presented in June 1978
and was not published until early 1979, the preprint was rapidly
circulated throughout the industry, and within two months articles were
appearing in trade newspapers.
Since the May and Woods event was so important, their company continued
backtracking to identify the cause of the problem. The source of the
contamination proved to be quite instructive. Because of the dramatic
increase in demand for LSI ceramic packaging in the 1970s, a new
factory had been built on the Green River in Colorado. Unfortunately,
it was built just downstream from the tailings of an old uranium mine.
The water used by the factory proved to have high levels of radioactive
elements, which contaminated the ceramic LSI
packages.4
IBM had begun to have evidence of its own soft-fail problems, and with
the circulation of the paper by May and Woods, the first IBM task force
on soft fails was created in mid-1978. It found that alpha-particles
were indeed one source of an IBM reliability problem, and this group
initiated the first modeling and screening efforts on the effects of
radiation on IBM
chips.5
IBM historical development: 1978-1988
- 1978--IBM predictions of soft errors from cosmic rays
In 1978, Ziegler of IBM realized that if alpha-particles could
generate circuit soft fails, there must be some possibility that cosmic
rays could do the same thing, even though at sea level there were no
heavy ions or alpha-particles in the cosmic ray flux. All of the previous studies of
satellite problems, discussed above, had presumed that light ions in
the solar wind, such as protons, could not be a source of significant
soft fails, since they produced less than 20,000 electrons per
micron of track length in silicon (while memory cells were storing
about 1M electrons for their logic state). However, nuclear reactions
between cosmic ray particles and LSI materials might be a possible
problem. This interaction might cause a silicon nucleus to fission and
fragment, giving off several simultaneous fast heavy particles which
could generate a localized charge burst of more than 4M electrons in a
micron volume (see schematic in
Figure 3).
Figure 3
Ziegler joined with W. Lanford, a nuclear physicist at Yale University,
and they worked for almost a year on the problem of how each of the
components of cosmic rays might interact with integrated circuits. They
found that the few scientific papers which discussed cosmic particles
and the solar wind of particles at satellite altitudes had little
bearing on terrestrial electronics, since these particles could not
penetrate the earth's atmosphere. Only intergalactic particles,
with a mean energy of more than 2 GeV and a flux of
0.2/cm2-s,
could penetrate to sea level. But
evaluating these particles was a complex task because none of them
actually penetrate to sea level--they interact with atmospheric atoms
and create cascades of new and different particles. Typically, what
hits sea level is the sixth generation of cascade particles, with none
of the original particles left
(Figure 2). This cascade contains every
strange particle known to physicists, including not only the stable
particles of protons, neutrons, and electrons (with a typical energy of
more than 100 MeV), but transient particles such as pions and muons.
The paper by Ziegler and Lanford, published early in 1979, was the
first detailing the mechanisms by which sea-level cosmic rays could
cause upsets in electronics [9]. They
analyzed most of the sea-level particles and their interaction with
silicon, combining the flux of each particle with its probability of
causing bursts of charge in silicon. Their problem was to isolate
what, if anything, might be important in the complex interactions
between these various particles and the many materials that make up
LSI circuits. They predicted that the effect could be detected in
current chips such as 64Kb DRAMs (mainly through alpha-particles
generated in the silicon reactions) and would be significant in 256Kb
RAMs (Figure 4). This figure summarized how
sea-level cosmic rays might cause soft fails in devices at a rate of
10/Mhr of operation. Ziegler and Lanford followed this paper with a
more detailed study and considered the upset charge necessary for the
LSI circuit structure of the period. They assumed that the new large
computers of that period would have a 64Mb memory and estimated that,
in 64 Mb of memory, cosmic radiation could lead to about one soft fail
per day [10].
Figure 4
In late 1979, Kolasinski et al. of Aerospace Co. conducted experimental
studies of the SER of satellite electronics under bombardment by heavy
ions (iron and krypton at 100+ MeV) [11]
based on the predictions of
Binder et al. They tested a wide variety of chips and found cross
sections for both soft and hard fails. As in the Binder paper, they
presumed that the only significant mechanism was direct energy loss by
the ion in the silicon, so they ignored light ions such as H and He.
Also in 1979, the first experimental work based in part on the
predictions of Ziegler and Lanford was performed. Guenzer et al. tested
a variety of chips and found that nuclear reactions could cause chip
fails [12]. They used proton beams,
and to confirm that the fails
they observed were due to nuclear reactions, they also put the chips in
a neutron beam and observed similar fail rates. This paper is also
notable in that it rejected the term "soft fail," and introduced
"single-event upset" to mean the same thing. This terminology is
still used in papers about satellite systems [12].
Also, various
other groups concerned with satellite and military
reliability reported studies of radiation-induced upsets
[13].
The first major paper from IBM on the analysis and modeling
of the SER of chips was published by Kirkpatrick of IBM Research
[14]. This paper developed
a formalism for modeling the
diffusion and collection of charge from ionizing particles in silicon;
it was followed by a more complete approach, by Sai-Halasz and
Wordeman, that considered some of the details of radiation interactions
with integrated circuits [15].
In 1980, Hitachi announced that some of their bipolar RAMs failed under
alpha-particle bombardment. This was the first time that bipolar
circuits had been found to fail from radioactive contamination.
In 1981, IBM discovered reliability problems with 16Kb DRAM memory
chips. It was found that radioactive
Kr85
was being trapped
in the 16Kb DRAM packages during a special test of package
integrity against moisture. A module testing machine was built and
ultimately screened four million of these modules for trace
radioactivity, with a fallout of about 2% contaminated
modules.
One of the most significant discoveries in the understanding of how
charged particles, such as alpha-particles, upset circuits was made
by Hsieh, Murley, and O'Brien of
IBM. Their surprising result was that the charge developed along a
particle's track would so distort local electrical fields that the
charge would sometimes be pulled back up the track toward the silicon
surface rather than diffusing into the silicon bulk. This greatly
increased the SER of circuits. The authors called the effect
"funneling," and it has been the subject of many scientific
papers which have proven its validity [16].
In 1981, a method was proposed to reduce the sensitivity of DRAM
circuits to ionizing particles [17].
Wordeman, Dennard, and
Sai-Halasz suggested that the funneling effect could be clipped by
introducing an n-grid below the active circuit elements. This proposal
suggested a junction below the circuit, perhaps introduced by
high-energy ion implantation, which would clip the fields produced by an
ionizing particle, preventing the full charge from being sucked back
into the active circuit volume. They suggested a grid, rather than a
plane, so that the reference potential of the chip back side could
still be felt by the circuit.
In 1982, IBM Burlington reported that an IBM 18Kb bipolar control-store
chip failed under alpha-particle bombardment. Within two months the
chip was redesigned with an increased resistance to radiation
noise.6
G. Sai-Halasz of IBM completed a comprehensive modeling program to
predict the sensitivity of circuits to alpha-particles
[17]. This
program was so successful that it is the basis for many SER
modeling programs. This program formed one cornerstone of the cosmic
ray SER modeling program by Srinivasan, O'Brien, and Murley of
Fishkill, called SEMM, which is widely used within
IBM [2].
In 1983, IBM became aware of many chip reliability problems which
were reported to be due to nuclear radiation noise. Since there
was a wide variety of equivalent LSI chips available, allowing the same
electronic function to be performed using different chips, a fast
solution was to switch to more resistant chips as soon as a
problem
surfaced.7,8
T. O'Gorman began the first field test of the cosmic ray SER
of chips.9
He designed a portable tester which contained 248
chips, each with 288 Kb, a total of 71 Mb. His watershed experiment was
the first to demonstrate two facts: 1) Natural cosmic rays do cause
soft fails, and 2) his IBM chips had been contaminated with
Po210 sometime during manufacturing (see the
Hera radioactivity problem in 1987). His experiments over
the next three years involved SER measurements deep underground, at sea
level, and at one and two miles altitude. He measured the background
SER of chips (from radioactive contamination on the chip) by making a
SER measurement deep underground. Then, since the cosmic ray flux
increases in intensity with altitude, any changes in SER rate
with altitude would be due to the sensitivity of the chips to cosmic
rays. His results showed a distinct altitude-dependent SER,
with a SER increase of more than 10× going
from sea level to two miles up
(Figure 5).
Figure 5
In 1983, the first SER experiment with metastable particles was
performed. J. F. Dicello and co-workers (Clarkson University) showed
that pions (lifetime about 20 ps) could cause soft
fails in LSI components. After this experiment, it could be
concluded that all types of particle radiation caused fails at some
probability [18].
- 1984--Reports of memory reliability problems in Denver
In 1984, the first exhaustive review of IBM operational logs found
a definite "altitude"
effect.10,11 These reports showed that one type of cache-memory module installed
at locations above 2600 feet had twice as many memory
parity-check errors as similar modules at sea level. A study of
those memory units returned for repair showed that most from Denver
tested to be without defect. See Figure 6.
Figure 6
A later intensive review of the Denver operational logs indicated some
multiple (simultaneous) memory errors. This had never been observed in
sea-level tests, and implied that some special mechanism must be
involved which changed several nearby bits at the same instant. It was
pointed out that the cosmic ray flux is much harder (more energetic) in
Denver than at sea level; hence, the cosmic rays induced larger noise
bursts. A detailed effort was made to model the memory chips, and the
calculation showed a very low SER, about 1/1000 of that observed in
Denver.12
Clearly, there was a long way to go in modeling
cosmic SER.
- 1984--Testing of sensitivity of bipolar memory chips to cosmic ray particles
Experiments were set up by Yourke, Wortzman, Tolat, and Enger to
expose two cache modules, each containing 72 bipolar memory chips, to a
dilute neutron beam in
California.13
The first memory module
was set up in a quiescent "radiation cave" and left running for a
complete week (no radiation was present during this period). It
operated flawlessly, with no fails. Then, a dilute beam of neutrons,
about 6 in. (15 cm) in diameter (about the size of the 72 chips of the
cache module) and with a flux of less than
1052-s
was introduced into the cave, with
the neutron beam aimed to pass through the cache
(Figure 7).
Within 20 seconds of the beam incidence, the tester
registered a cache parity check, and it recovered within a few seconds
to normal operation. At 40 seconds into the test it took another
"hit," and while it was recovering a third hit occurred at 42
seconds into the experiment. The beam of neutrons was further reduced
until accurate quantitative measurements could be made of the cross
section for neutron-induced soft fails. Both memory modules, designed
several years apart, showed the same sensitivity to radiation.
Figure 7
- 1985--Accelerated testing of sensitivity of LSI chips to cosmic rays
Beginning around 1975, satellite electronics had been tested using
particle beams from accelerators. These tests used primarily heavy ion
beams, since the particles most disruptive to electronics in orbit were
typically 100-MeV iron nuclei. These particles do not occur in
terrestrial cosmic ray cascades, and hence are not a problem for
terrestrial electronics. However, the experimental protocol had been
well established for testing chips in accelerator beams to determine
their sensitivity to radiation. Chips were isolated on thin sockets and
set into a beam of particles. A remote tester filled the chips with
logic patterns, and then constantly interrogated for possible fails
during the tests. The IBM accelerated testing program is described in
Reference [4].
These experiments had to be carefully modeled, since there is no
practical way to generate a flux of particles with the same mixture of
particles and energies which are found in the cosmic ray cascades at
terrestrial altitudes. No beams of pions or muons existed which might
be used for routine chip testing, and neutron beams were difficult to
calibrate for accurate quantitative measurements. Further, since the
mean cosmic ray particle energy which would create an upset was
estimated to be about 1 GeV, investigators were limited to just a few
available accelerators in the United States. Experiments were first set
up and run at Harvard University (20-150 MeV) and at Los Alamos
(250-800 MeV) by J. Ziegler, H. Muhlfeld, C. Montrose, and H.
Curtis. Initial tests used both neutron and proton beams, and then
theoretical considerations allowed the investigators to concentrate
exclusively on measurements with proton beams, with scaling
considerations allowing the results to be used for neutron
and pion particles.
All IBM chips were tested for variations in sensitivity based on
logical state (whether the cell held a zero or a one), on circuit
voltage, on chip temperature, on angle of the beam to the circuit
plane, on refresh cycle time (for DRAMs), and on manufacturing
variables (a survey of many chips of the same chip type).
In 1986, the IBM report "Accelerated Testing of IBM
Circuits for Cosmic Ray Soft Error Rates" was completed. This
reviewed the experimental procedures for predicting the fail rates of
IBM chips at terrestrial altitudes, and predicted these SER rates for
25 chips fabricated from 1973 to 1986. This report became the
benchmark of the IBM experimental SER program, and is reviewed later
in this journal [4].
- 1986--Field testing of bipolar SRAM chips for soft errors
To obtain SER "engineering data," it was necessary to
conduct both accelerated testing (discussed above) and
field testing of chips. Accelerated testing of chips
for cosmic SER was an untested procedure which could only be validated
by finding the true SER of chips from naturally occurring cosmic rays.
The field testers would test hundreds of chips at sea level, then at
various altitudes. This would establish their absolute cross section
for failure, and would also establish the variation which would be
expected at different terrestrial altitudes. Ziegler calculated the
expected altitude variation on the basis of the assumption that only
protons, neutrons, and pions would be capable of causing
upsets. His results are shown in
Figure 8.
Figure 8
An extensive search to locate a high-altitude site for testing was
severely constrained by the need for more than 50 kW of power to run
the SRAM testers. Finally, a site was found in Leadville, CO (altitude
10,152 ft), which is the highest-altitude incorporated city in
the United States. During 1985, a controlled-ambient laboratory was
built in an abandoned laundromat by L. LaFave.
- 1985--Mobile cosmic ray monitor
From 1984 to 1987, IBM attempted to evaluate quantitatively the
effect of cosmic rays on terrestrial electronics. It was decided to
evaluate the altitude dependence of LSI sensitivity to cosmic rays on
the basis of the predictions of Figure 8.
As memory chip testers
were being built, IBM scientists began to search for a way to
simultaneously monitor cosmic ray intensity and test LSI chips at
various sites. A contract was made with the University of New Hampshire
to build a neutron spectrometer in order to evaluate the cosmic ray
flux and the energy spectrum of incident neutrons. But as the design
progressed, it became apparent that only a small portion of the neutron
flux could be analyzed. Historical research on previous attempts to
measure terrestrial cosmic ray fluxes showed that one experiment
dominated this field. During the United Nations' International
Quiet Sun Years of 1965-1968, the Canadian Atomic Energy
Authority built a mobile cosmic ray laboratory to obtain the altitude
and latitude dependence of cosmic rays. This mobile laboratory was
operated at many locations in North America and produced comprehensive
measurements which ended at Mount Haleakela in Hawaii (see Reference
[3] for details). Upon investigation, it was found that all of the
authors of these twenty-year-old papers were retired or dead, and the
whereabouts of the 17-ton, million-dollar mobile cosmic ray laboratory
was unknown. When an inquiry was made to IBM Hawaii, one field
engineer, D. Fujimoto, became interested in the idea, and spent several
months looking for the trailer. He found it abandoned on a mountainside
in Maui, in good shape except for numerous bullet-like holes whose
alignment indicated that the projectiles had been fired from the
volcano at Wailuku. The basic equipment was found to be operable, and
papers were filed with the titular owner, the U.S. National
Oceanographic and Atmospheric Administration, for a long-term joint
experiment. The trailer was moved to New York, reconditioned, and then
moved to Leadville, CO, the initial location for the IBM cosmic ray
experiments. The equipment became a crucial part of the experiment, for
its data experimentally verified the predictions shown in
Figure 8. An
example of its data is shown in
Figure 9, along
with preliminary data from the field testing of SRAMs used in memory
caches.
Figure 9
- 1987--Radioactive contamination of a semiconductor factory
No IBM SER historical review would be complete without mentioning
the "Hera problem." During the year 1986, there was an anomalous
increase in LSI memory problems. Electronics in early 1987 appeared to
have problem rates approaching 20 times higher than predicted. In
contrast, identical LSI memories being manufactured in Europe showed no
anomalous problems. Because of knowledge of the
radioactivity problem with the Intel** 2107 RAMs
[9], it was thought that the LSI package
probably was at fault, since
the IBM chips were mounted on similar ceramic materials. LSI ceramic
packages made by IBM in Europe and in the U.S. were exchanged, but
the European computer modules (with European chips and U.S. packaging)
showed no fails, while the U.S. chips with European packages
still failed at a high rate. This indicated that the problem
was undoubtedly in the U.S.-manufactured LSI chips. In April 1987,
significant design changes had
been made to the memory chip with the most problems, a 4Kb bipolar RAM.
The newer chip had been given the nickname Hera, and so at
an early stage the incident became known as the "Hera
problem."14
By June 1987, the problem was very
serious.15 A group was
organized to investigate the problem. The first breakthrough in
understanding occurred with the analysis of "carcasses" from the
memory chips (the term carcasses refers to the chips on an
LSI wafer which do not work correctly, and are not used but saved in
case some problem occurs at a future time). Some of these carcasses
were shown to have significant
radioactivity16
(Figure 10).
Figure 10
Six weeks was spent in the manufacturing process lines, looking for
radioactivity, and traces were found inside various processing
units. However, it could not be determined whether these traces came
from the raw materials used, or whether they were transferred from the
chips themselves, which might have been contaminated earlier in their
processing. Further, it was discovered that radioactive filaments
(containing radioactive thorium) were commonly used in some
evaporators. A detailed analysis by T. Zabel of some of the "hot"
chips revealed that the radioactive contamination came from a single
source:
Po210
This isotope is found in the uranium decay
chain, which contains about twelve different radioactive species.
The surprising fact was that
Po210
was the only
contaminant on the LSI chips, and all the other expected decay-chain
elements were missing. Hundreds of chips were analyzed for
radioactivity, and
Po210
contamination was found going back
more than a year. Then it was found that whatever caused the
radioactivity problem disappeared on all wafers started
after May 22, 1987. After this precise date, all new wafers
were free of contamination (Figure 10),
except for small amounts which
probably were contaminated by other older chips being processed by
the same equipment. Since it takes about four months for chips to be
manufactured, the pipeline was still full of "hot"
chips in July and August 1987. Further sweeps of the
manufacturing lines showed trace radioactivity, but the plant was
essentially clean. The contamination had appeared in 1985, increased by
more than 1000 times until May 22, 1987, and then totally
disappeared!
Several months passed, with widespread testing of manufacturing
materials and tools, but no radioactive contamination was discovered.
All memory chips in the manufacturing lines were spot-screened for
radioactivity, but they were clean. The radioactivity reappeared in the
manufacturing plant in early December 1987, mildly contaminating
several hundred wafers, then disappeared again. A search of all the
materials used in the fabrication of these chips found no source of the
radioactivity. With further screening, and a lot of luck, a new and
unused bottle of nitric acid was identified by J. Hannah as
radioactive.17
One surprising aspect of this discovery was
that, of twelve bottles in the single lot of acid, only one was
contaminated. Since all screening of materials assumed lot-sized
homogeneity, this discovery of a single bad sample in a large lot
probably explained why previous scans of the manufacturing line had
been negative. The unopened bottle of radioactive nitric acid led
investigators back to a supplier's factory, and it was found
that the radioactivity was being injected by a bottle-cleaning machine
for semiconductor-grade acid
bottles.18
This bottle cleaner
used radioactive Po210 material to ionize an air jet which
was used to dislodge electrostatic dust inside the bottles after
washing. The jets were leaking radioactivity because of a change in the
epoxy used to seal the
Po210
inside the air jet capsule.
Since these jets gave off infrequent and random bursts of
radioactivity, only a few bottles out of thousands were
contaminated.19
Once the contamination was identified and the source pinpointed to the
acid etch bottles, contaminated etch bottles were replaced with clean
bottles and the problem completely disappeared. All Hera chips from
"hot" lots were recalled from the field and were replaced with
clean Hera chips.
- 1988--Completion of bipolar SRAM field testing
In 1988, the four-year testing of memory parts was completed.
Tests had been run at E. Fishkill, NY (sea level), at Leadville, CO
(10,155 ft), at Boulder, CO (5255 ft) and deep underground at
Kansas City, MO (under 5000
g/cm2 of rock). Further, during
the Leadville testing, concrete shields had been constructed over the
testers to determine the effect of shielding on the cosmic ray flux.
The result of the lengthy effort is shown in
Figure 11.
Figure 11
Cosmic ray SER studies: 1989-1992
By 1989, thorough indications of the effects of cosmic rays on IBM
electronics had been obtained from statistical studies and specific
testing. Nine reports, from various IBM divisions, determined the
actual SER of various chips from natural cosmic rays. The cosmic ray
component was isolated by looking for fail rates as a
function of the altitude of the memory modules, since this was presumed
to be the unique signature of cosmic ray fails (Denver has four times
the cosmic ray intensity of New York City). These internal reports
showed agreement with the previous predictions made by accelerated
testing in 1986, and validated the procedures used to predict the
cosmic SER of IBM chips. The reports are reviewed in
Table 2.
In all but one case, the results are within 2×
of the predicted values.
Table 2 Accuracy of predicted SER from accelerated testing.
| Chip type ¹ | Chip size ² | Accelerated SER ³ | Field test SER ³ | Type of field test4 | Typical application |
| SRAM | 4096 | 1590 | 1118 | Stat. anal. | Cache memory |
| SRAM | 4096 | 1720 | 1770 | Field test | Fast memory |
| SRAM | 4096 | 1720 | 1300 | Stat. anal. | Fast memory |
| SRAM | 9216 | 1670 | 618 | Stat. anal. | Cache memory |
| SRAM | 9216 | 1670 | 1340 | Field test | Cache memory |
| DRAM | 288k | 1300000 | 126000 | Field test | Main memory |
| DRAM | 1M | 2700 | 3000 | Field test | Main memory |
| SRAM | 2034 | 1600 | 998 | Stat. anal. | I/O channels |
| CMOS | 144k | 250 | 210 | Field test | Cache memory |
¹ Chips labeled SRAM were bipolars, while those labeled DRAM were FETs.
² Chip size is given in bits.
³ SER results are all normalized to sea level. The SER is NOT given on a per-bit basis, but on a per-chip basis to allow for easy comparison of chips from different generations.
4"Stat. anal." = statistical analysis of chip problems. This number shows results for chips in actual use, but the SER must be inferred from logs which were recorded for a different purpose.
"Field test" = controlled experiments in which hundreds of chips were tested in noise-free environments for extended periods of time (usually several months). Results were corrected to sea level. This result gives the most accurate measurement of the SER sensitivity of a chip, but assumes a quiescent operational environment.
By 1989, IBM began full testing for cosmic SER on all chips. The
success of this attention to chip SER is shown in two figures which
compare chip sensitivities for circuits manufactured over a ten-year
period. Figure 12 shows DRAM sensitivity to
cosmic rays for chips entering manufacturing from 1983 to the present.
These memory chips increase in complexity and size with time, going
from a 288Kb in 1982 to the 4Mb in 1988. Of note is the accuracy of the
modeling of these chips, which is within a factor of 3×
of the later measured sensitivity.
Figure 12
Shown in Figure 13 is a summary of the cosmic
ray soft-fail data for bipolar chips manufactured over a twenty-year period. The
data show a steady increase in sensitivity until 1985, when
the first substantial data were collected showing that there was a
problem with cosmic ray upsets. After 1985, the chips show a broad
range of sensitivity, with the "hardest" chips decreasing in
sensitivity by about 20×
from the 1984 chips. The broad range came
from engineering trade-offs among cosmic SER sensitivity, chip speed,
and function. For some applications, chips which were required to be
very fast (which usually meant that they were sensitive to cosmic ray
radiation) could be backed up by circuits which could detect and
correct errors. Hence, these chips were designed to be as fast as
possible and were allowed to be sensitive to cosmic rays, since any
errors would be corrected. Other chips, without as much error
protection, might have to be redesigned to decrease their sensitivity
to radiation.
Figure 13
From 1990 to 1993, a joint study between A. Taber (IBM) and E. Normand
(Boeing Co.) resulted in the extension of the sea-level work into
upsets in avionics [19, 20]. This important work
established that the increase in soft-fail rate for
electronic devices scaled with the cosmic ray flux up to altitudes of
20 km (65,000 ft). It also demonstrated that airborne
device and system SER increases with the increasing cosmic ray flux at
higher latitudes (up to 70 N).
Also, a microbeam was developed to study in detail how
individual devices of an LSI circuit responded to
high-energy particles. Led by D. Heidel and L. Geppert, this
group built an LSI tester connected to a particle accelerator. They
used precisely positioned apertures to create a micron-sized beam so
that individual devices would be exposed with no disturbance to nearby
circuit components. This way, the complex dynamic response of
integrated circuits to energetic charged particles could be measured
with precision [21].
Conclusion
This review paper has described the experimental work at IBM over
the last fifteen years in evaluating the effect of cosmic rays on
terrestrial electronic components. This work originated in 1978, went
through several years of research to verify its magnitude, and became a
significant factor in IBM's efforts toward improved product
reliability. Other papers in this issue of the IBM Journal of
Research and Development expand on most of the major
elements of this effort.
*MYLAR is a registered trademark of E. I. du Pont de Nemours &
Company.
**Intel is a registered trademark of Intel Corporation.
References
Footnotes
1
The term hard fail refers to the permanent failure of
some electronic element. Originally it derived from the term
hardware failure. In complex LSI circuits, it can mean the
failure of just one basic functional block of the complex chip. A
soft fail (or soft error) is defined as a
spontaneous error or change in stored information which cannot be
reproduced. Such errors in LSI chips are usually caused by excessive
electronic noise.
2
There currently is no consensus about how to combine the notation
of computer units with physical units in print. Computer science
usually discusses bits using numerical bases of 2 or 8 or 16, while
physical units are always in base 10. In this journal, the numerical
abbreviation K is used to mean units of 1024, while the notation k
means units of 1000. The notation b means single bits, while B means
bytes (8 bits). The notation M means about a million in both units;
however, if it is modified with a b or a B, it means
1,048,576 (1024 ×
1024); otherwise, it means
1,000,000. In the above sentence, 1M electrons =
1,000,000 electrons, while 16Kb = 16,384
bits.
3
The term field testing refers to all testing of
computer chips after exposure to natural background radiation. This
includes testing with special chip testers which evaluate hundreds of
chips under controlled conditions.
4
T. C. May, private communication.
5
D. B. Eardley, IBM internal report, 1978.
6
R. Roche, IBM General Technology Division, internal report, 1982.
7
S. H. Voldman and G. C. Fung, IBM Burlington, internal report,
1983.
8
F. Read, IBM General Technology Division, internal report, 1983.
9
T. J. O'Gorman, IBM General Technology Division, internal
reports, August 1986 and March 1990.
10
R. Sussman, IBM internal report, 1984.
11
J. Pantalone and N. N. Tendolkar, IBM Data Systems Division,
internal report, 1984.
12
H. Yourke, IBM Data Systems Division, and S. H. Voldman, IBM
General Technology Division, internal report, 1985.
13
H. Yourke, D. Wortzman, V. Tolat, and T. Enger, IBM internal
report, 1984.
14
R. Elam, D. Grose, and R. Lange, IBM General Technology Division,
internal report, 1987.
15
B. Messina and J. Gerardi, IBM Data Systems Division, internal
report, 1987.
16
R. L. Patrick, IBM internal report, 1987.
17
J. Hannah, IBM internal report, 1987.
18
J. F. Ziegler, T. H. Zabel, and J. Hannah, IBM internal report,
1988.
19
For further details, see The New York Times, June 13,
1990, p. B6.
Received April 29, 1994; accepted for publication March 6, 1995
|