cdh@BBNCCX.ARPA (08/03/84)
From: "Carl D. Howe" <cdh@BBNCCX.ARPA>
The issue of whether error-detecting circuitry should be provided on
memory systems seems to be a bit of a religious argument. However,
given that one knows the architecture of THE SPECIFIC SYSTEM BEING
IMPLEMENTED, the tradeoffs can be computed numerically. Here are a
couple of examples:
NOTE: the following examples are based on real products. However,
details on chip counts and failure rates are rough estimates at best.
DO NOT TAKE THE RESULTS OF EITHER OF THESE EXAMPLES AS BEING
INDICATIVE OF THE TRUE RELIABILITY OF THESE PRODUCTS. Rather, use
them as examples of how you might do your own calculations.
Example 1:
Assume a system like the old IBM PC (i.e. the one that used 16K RAMs).
Such a system with 128K of memory requires 64 16K RAMs. Parity
checking at the byte level adds at least another 8 RAMs for that
amount of memory. A single parity checker is required (since the bus
path is only 8 bits wide). In addition to this memory system, assume
that the PC itself contains another 80 chips for display and another
20 chips or so for processor and EPROM. There is also a floppy
controller which we will ignore for the moment.
Assume the following order-of-magnitude failure rates for
components:
16K RAMs 20000 failures per billion hours
64K RAMs 200 failures per billion hours
256K RAMs 150 failures per billion hours
LS SSI-MSI logic 10 failures per billion hours
LSI and VLSI other than memory 50 failures per billion hours
If you were designing a system, you would try to get the real numbers
from the manufacturers. The numbers for RAMs are representative of
the industry averages and include soft failures. Note that RAMs are
getting LOTs better.
Compute the failure rate of an IBM PC containing 20 LSI and VLSI
chips, 80 SSI and 64 16K RAMs (i.e. ignore parity for the moment):
20 chips @ 50 failures per billion hours = 1000
80 chips @ 10 failures per billion hours = 800
64 chips @ 20000 failures per billion hours = 1280000
-------
System failures per billion hours
without parity checking = 1281800
or about 11.23 failures a year or one failure a month.
Again, this includes soft failures. The RAMs definitely dominate that
reliability equation. Now add parity checking to the reliability
numbers:
System failure rate without parity = 1281800
8 chips @ 20000 failures per billion hours = 160000
1 parity checker @ 10 failures s/billionhrs = 10
-------
System failues per billion hours
with parity checking = 1441810
or about 12.63 failures a year.
Note that your system failure rate has increased 12% (see, parity
doesn't come for free), but it now catches failures in the part of the
system that accounts for 99.87% of your failures. Failures not
detected by parity (i.e. everything other than the RAMs themselves)
are now only 1810 per billion hours or about 1 per 60 years.
If we assume that system cost is linear with respect to chip count, we
have increase the cost of the system by about 5% (chips are free
compared to the cost of board space, packaging, testing, etc.).
Now you see why IBM recently changed their system to use 64K RAMS
exclusively; it reduces their RAM soft error rate by two orders of
magnitude, making parity errors much rarer.
Example 2:
Assume an Apple Macintosh. In contrast to the IBM, a Macintosh never
has more than 16 memory chips, and they are either 64K or 256K RAMs.
Other VLSI on the board comes to about 40 chips (again, a guesstimate)
with about 10 chips of SSI/MSI. System reliability without parity is
computed as follows:
16 64K RAMs @ 200 failures per billion hours = 3200
40 VLSI chips @ 50 failures per billion hours = 2000
10 SSI/MSI chips @ 10 failures per billion hours = 100
----
System failures per billion hours = 5300
or about .05 failures per year. This still includes soft errors.
Now add parity:
System reliability without parity 5300
2 64K RAM chips for parity 400
2 parity generators/checkers (16 bit data path) 20
----
System failures per billion hours = 5720
or about .05 failures per year.
System failure rate has been increased by 8% by the addition of
parity. Moreover, this parity catches only 63% of all system
failures. Furthermore, the added chips constitute an increase in cost
of 6% (again, assuming system cost is linear to chip count). The
trade-off is much less obvious here.
Summary:
Reliability is a function of system design. Whether parity is worth
the cost it imposes on the system (both in terms of money and the
DECREASE in uptime) is a delicate tradeoff and should be weighed
carefully with some real failure rate numbers.
Let's avoid getting too emotional about hospitals and fire-station
applications. Anyone doing life-support equipment uses VERY different
design techniques than those in commercial equipment. Whether you use
parity or not in commercial applications is as much a result of
marketing push as of technical justification; when your failure rate
is down around 1 soft error in 50 years, it is hard to justify parity
on a strictly technical basis. On the other hand, the warm feeling it
gives customers may justify a 6% cost increase.
I encourage people to get true, hard reliability numbers for the parts
they use. Manufacturers rarely publish such data in spec sheets, but
usually have some idea of the real numbers. Avoid "accelerated"
failure rates (i.e failure numbers taken under thermal stress and
multiplied by a stress factor). MIL Handbook 217 has good pessimistic
numbers for failure rates of all kinds of components as well as
equations for figuring out your own numbers. The numbers I presented
here are only rule-of-thumb and vary according to manufacturer, type
of parts, etc.
I hope this has been of some help and interest.
Carl