[net.micro] net.digital: RE: Is parity worth it?

krup@ihuxb.UUCP (Kevin O'Brien) (08/09/84)

>[] <- sacrificial offering.
>
>	While designing a 32032 processor board, it became apparent
>	that parity checking on a 512K/2048K memory system was
>	going to cost us 8 memory chips, 4 parity generator/checkers,
>	and a handful of buffers & control logic.  The actual merits of 
>	memory parity checking have always been somewhat dubious to me given
>	the generally inept way a fault is typically treated (system crash), 
>	the fact that one trades away some reliability in adding parity
>	circuitry (who checks the checkers?), and the relatively low
>	soft error rate claimed by semi manufacturers (some claim 1 error
>	per ~6 months/mbyte).  At our particular site, with 40 mBytes 
>	installed, we see a soft error rate of about 1 per 3-4 weeks.  
>
>	The question to net.digital: Given that the application is a general
>	purpose workstation (e.g. non critical in the sense that medical/
>	control system applications are) is parity worth the costs in board 
>	realestate, $$ to the end user, and the rather questionable 
>	increase in reliability?  I would especially like to hear from
>	any manufacturers who have put some research into this trade off..


Memory errors were never really a serious problem with 8-bit microprocessors
since you were limited to 64K. If your system exhibted erratic operation,
such as trashed data or unexplained crashes, the first thing that would come
to mind would be, "memory problem." Since the total amount of memory was
probably less than 64K, you could afford the time required to exhaustively
test each sub-block. When memory size increases, say up to 1Mb, diagnostic
programs as a practical matter must become less efficient. How many people
can afford to tie up their machines for weeks looking for a possible memory
problem? There really is no good test to find soft errors in large memory
spectrums. The only sensible solution is to design the hardware to check
itself as opposed to dreaming up ever more complicated diagnostic algorithms.

From a systems design standpoint, you have to consider how this thing is
going to be manufactured and who your customers are going to be. What happens
when it doesn't work in your factory? How about when its in your customers
hands? I can personally attest to the fact that troubleshooting any 32-bit
minicomputer or super microcomputer with intermittent failure modes is often
times like finding a needle in a haystack. I view adding a few extra DIPs
so that the hardware flags its own failures as an important system
requirement that ultimately saves customers and designers alike much grief.

As for soft errors that do not reoccur, they are infrequent and there's
really nothing you can do about them. They can be caused by, among other
things, cosmic rays. In any case, designing the system to freeze up on a
parity error is not necessarily an appropriate response. The important
thing is to somehow warn the user that he has got problems.

Parity checking is well suited for detecting single bit errors in large
memories. Eight data bits plus one parity bit comprises a 9-bit code, each
bit thereof is checked by the other 8-bits. Yes, the parity bit is checked
by the eight data bits. There is an 89% probability that a single bit error
within the 9-bit code will be a data bit, hence you GAIN reliability,
not trade it away.

The final point is any computer which its mean time to repair is lower
than its competitors is better product. And parity checking, being
cheap to implement can only help in this regard. Now if you wanted to use
Hamming, then the reliability/cost issue would be more significant. In general,
I think its little presumptuous for a system designer to judge the
integrity of his customers data as not being worthy of any safeguards
whatsoever because an error won't directly cause some one to die.
A soft memory failure that reoccurs often, unbeknownst to the user,
can over time corrupt his entire database. While this almost always is not
a matter of life and death, the guy who has to explain to his boss how
some hardware problem changed the figures around in the payroll records
(and the backup copies too!?) may wish he WERE dead. If I worked for a
company that was in the market for such a workstation and I found out
that the designers of the system I was considering didn't think my
applications were important enough to merit safeguarding, I'd think about
talking to some other vendors.


DISCLAIMER-
THE ABOVE ARTICLE IS A PERSONAL OPINION ONLY AND BEARS NO RELATION TO
ANY RESEARCH, EXISTING PRODUCTS OR PLANNED PRODUCTS OF MY COMPANY.