[net.micro] An analysis: is parity really worth it?

cdh@BBNCCX.ARPA (08/03/84)
From:  "Carl D. Howe" <cdh@BBNCCX.ARPA>

The  issue  of whether error-detecting circuitry should be provided on
memory systems seems to be a bit of a  religious  argument.   However,
given  that  one  knows  the architecture of THE SPECIFIC SYSTEM BEING
IMPLEMENTED, the tradeoffs can be computed numerically.   Here  are  a
couple of examples:

NOTE:   the  following  examples are based on real products.  However,
details on chip counts and failure rates are rough estimates at  best.
DO  NOT  TAKE  THE  RESULTS  OF  EITHER  OF  THESE  EXAMPLES  AS BEING
INDICATIVE OF THE TRUE RELIABILITY OF  THESE  PRODUCTS.   Rather,  use
them as examples of how you might do your own calculations. 


Example 1: 

Assume a system like the old IBM PC (i.e. the one that used 16K RAMs).
Such  a  system  with  128K  of  memory  requires 64 16K RAMs.  Parity
checking at the byte level adds at  least  another  8  RAMs  for  that
amount  of memory.  A single parity checker is required (since the bus
path is only 8 bits wide).  In addition to this memory system,  assume
that  the  PC itself contains another 80 chips for display and another
20 chips or so for processor  and  EPROM.   There  is  also  a  floppy
controller which we will ignore for the moment.

        Assume  the  following  order-of-magnitude  failure  rates for
components: 

	16K RAMs                    20000 failures per billion hours
	64K RAMs                      200 failures per billion hours
	256K RAMs                     150 failures per billion hours
	LS SSI-MSI logic               10 failures per billion hours
	LSI and VLSI other than memory 50 failures per billion hours

If  you were designing a system, you would try to get the real numbers
from the manufacturers.  The numbers for RAMs  are  representative  of
the  industry  averages and include soft failures.  Note that RAMs are
getting LOTs better. 

Compute the failure rate of an IBM  PC  containing  20  LSI  and  VLSI
chips, 80 SSI and 64 16K RAMs (i.e. ignore parity for the moment): 

	20 chips @ 50 failures per billion hours =       1000
	80 chips @ 10 failures per billion hours =        800
	64 chips @ 20000 failures per billion hours = 1280000
						      -------
        System failures per billion hours
	without parity checking                     = 1281800

	or about 11.23 failures a year or one failure a month.

Again, this includes soft failures.  The RAMs definitely dominate that
reliability  equation.   Now  add  parity  checking to the reliability
numbers: 

	System failure rate without parity          = 1281800
	8 chips @ 20000  failures per billion hours =  160000
	1 parity checker @ 10 failures s/billionhrs =      10
						      -------
	System failues per billion hours      
        with parity checking                        = 1441810

	or about 12.63 failures a year.

Note  that  your  system  failure  rate has increased 12% (see, parity
doesn't come for free), but it now catches failures in the part of the
system that accounts  for  99.87%  of  your  failures.   Failures  not
detected  by  parity  (i.e. everything other than the RAMs themselves)
are now only 1810 per billion hours or about 1 per 60 years. 

If we assume that system cost is linear with respect to chip count, we
have increase the cost of the system  by  about  5%  (chips  are  free
compared to the cost of board space, packaging, testing, etc.). 

Now  you  see  why  IBM  recently changed their system to use 64K RAMS
exclusively; it reduces their RAM soft error rate  by  two  orders  of
magnitude, making parity errors much rarer. 


Example 2: 

Assume  an Apple Macintosh.  In contrast to the IBM, a Macintosh never
has more than 16 memory chips, and they are either 64K or  256K  RAMs.
Other VLSI on the board comes to about 40 chips (again, a guesstimate)
with  about 10 chips of SSI/MSI.  System reliability without parity is
computed as follows: 

	16 64K RAMs @ 200 failures per billion hours    =   3200
	40 VLSI chips @ 50 failures per billion hours   =   2000
	10 SSI/MSI chips @ 10 failures per billion hours =   100
							    ----
	System failures per billion hours =		    5300

	or about .05 failures per year.  This still includes soft errors.

Now add parity:

	System reliability without parity                   5300
	2 64K RAM chips for parity                           400
	2 parity generators/checkers (16 bit data path)       20
							    ----
	System failures per billion hours =                 5720
	or about .05 failures per year.

System  failure  rate  has  been  increased  by  8% by the addition of
parity.   Moreover,  this  parity  catches  only  63%  of  all  system
failures.  Furthermore, the added chips constitute an increase in cost
of  6%  (again,  assuming  system  cost is linear to chip count).  The
trade-off is much less obvious here. 


Summary: 

Reliability is a function of system design.  Whether parity  is  worth
the  cost  it  imposes  on  the system (both in terms of money and the
DECREASE in uptime) is a  delicate  tradeoff  and  should  be  weighed
carefully with some real failure rate numbers. 

Let's  avoid  getting  too  emotional about hospitals and fire-station
applications.  Anyone doing life-support equipment uses VERY different
design techniques than those in commercial equipment.  Whether you use
parity or not in commercial  applications  is  as  much  a  result  of
marketing  push  as of technical justification; when your failure rate
is down around 1 soft error in 50 years, it is hard to justify  parity
on a strictly technical basis.  On the other hand, the warm feeling it
gives customers may justify a 6% cost increase. 

I encourage people to get true, hard reliability numbers for the parts
they  use.  Manufacturers rarely publish such data in spec sheets, but
usually have some idea  of  the  real  numbers.   Avoid  "accelerated"
failure  rates  (i.e  failure  numbers  taken under thermal stress and
multiplied by a stress factor).  MIL Handbook 217 has good pessimistic
numbers for failure rates of  all  kinds  of  components  as  well  as
equations  for figuring out your own numbers.  The numbers I presented
here are only rule-of-thumb and vary according to  manufacturer,  type
of parts, etc. 

I hope this has been of some help and interest. 

Carl