cdh@BBNCCX.ARPA (08/03/84)
From: "Carl D. Howe" <cdh@BBNCCX.ARPA> The issue of whether error-detecting circuitry should be provided on memory systems seems to be a bit of a religious argument. However, given that one knows the architecture of THE SPECIFIC SYSTEM BEING IMPLEMENTED, the tradeoffs can be computed numerically. Here are a couple of examples: NOTE: the following examples are based on real products. However, details on chip counts and failure rates are rough estimates at best. DO NOT TAKE THE RESULTS OF EITHER OF THESE EXAMPLES AS BEING INDICATIVE OF THE TRUE RELIABILITY OF THESE PRODUCTS. Rather, use them as examples of how you might do your own calculations. Example 1: Assume a system like the old IBM PC (i.e. the one that used 16K RAMs). Such a system with 128K of memory requires 64 16K RAMs. Parity checking at the byte level adds at least another 8 RAMs for that amount of memory. A single parity checker is required (since the bus path is only 8 bits wide). In addition to this memory system, assume that the PC itself contains another 80 chips for display and another 20 chips or so for processor and EPROM. There is also a floppy controller which we will ignore for the moment. Assume the following order-of-magnitude failure rates for components: 16K RAMs 20000 failures per billion hours 64K RAMs 200 failures per billion hours 256K RAMs 150 failures per billion hours LS SSI-MSI logic 10 failures per billion hours LSI and VLSI other than memory 50 failures per billion hours If you were designing a system, you would try to get the real numbers from the manufacturers. The numbers for RAMs are representative of the industry averages and include soft failures. Note that RAMs are getting LOTs better. Compute the failure rate of an IBM PC containing 20 LSI and VLSI chips, 80 SSI and 64 16K RAMs (i.e. ignore parity for the moment): 20 chips @ 50 failures per billion hours = 1000 80 chips @ 10 failures per billion hours = 800 64 chips @ 20000 failures per billion hours = 1280000 ------- System failures per billion hours without parity checking = 1281800 or about 11.23 failures a year or one failure a month. Again, this includes soft failures. The RAMs definitely dominate that reliability equation. Now add parity checking to the reliability numbers: System failure rate without parity = 1281800 8 chips @ 20000 failures per billion hours = 160000 1 parity checker @ 10 failures s/billionhrs = 10 ------- System failues per billion hours with parity checking = 1441810 or about 12.63 failures a year. Note that your system failure rate has increased 12% (see, parity doesn't come for free), but it now catches failures in the part of the system that accounts for 99.87% of your failures. Failures not detected by parity (i.e. everything other than the RAMs themselves) are now only 1810 per billion hours or about 1 per 60 years. If we assume that system cost is linear with respect to chip count, we have increase the cost of the system by about 5% (chips are free compared to the cost of board space, packaging, testing, etc.). Now you see why IBM recently changed their system to use 64K RAMS exclusively; it reduces their RAM soft error rate by two orders of magnitude, making parity errors much rarer. Example 2: Assume an Apple Macintosh. In contrast to the IBM, a Macintosh never has more than 16 memory chips, and they are either 64K or 256K RAMs. Other VLSI on the board comes to about 40 chips (again, a guesstimate) with about 10 chips of SSI/MSI. System reliability without parity is computed as follows: 16 64K RAMs @ 200 failures per billion hours = 3200 40 VLSI chips @ 50 failures per billion hours = 2000 10 SSI/MSI chips @ 10 failures per billion hours = 100 ---- System failures per billion hours = 5300 or about .05 failures per year. This still includes soft errors. Now add parity: System reliability without parity 5300 2 64K RAM chips for parity 400 2 parity generators/checkers (16 bit data path) 20 ---- System failures per billion hours = 5720 or about .05 failures per year. System failure rate has been increased by 8% by the addition of parity. Moreover, this parity catches only 63% of all system failures. Furthermore, the added chips constitute an increase in cost of 6% (again, assuming system cost is linear to chip count). The trade-off is much less obvious here. Summary: Reliability is a function of system design. Whether parity is worth the cost it imposes on the system (both in terms of money and the DECREASE in uptime) is a delicate tradeoff and should be weighed carefully with some real failure rate numbers. Let's avoid getting too emotional about hospitals and fire-station applications. Anyone doing life-support equipment uses VERY different design techniques than those in commercial equipment. Whether you use parity or not in commercial applications is as much a result of marketing push as of technical justification; when your failure rate is down around 1 soft error in 50 years, it is hard to justify parity on a strictly technical basis. On the other hand, the warm feeling it gives customers may justify a 6% cost increase. I encourage people to get true, hard reliability numbers for the parts they use. Manufacturers rarely publish such data in spec sheets, but usually have some idea of the real numbers. Avoid "accelerated" failure rates (i.e failure numbers taken under thermal stress and multiplied by a stress factor). MIL Handbook 217 has good pessimistic numbers for failure rates of all kinds of components as well as equations for figuring out your own numbers. The numbers I presented here are only rule-of-thumb and vary according to manufacturer, type of parts, etc. I hope this has been of some help and interest. Carl