krup@ihuxb.UUCP (Kevin O'Brien) (08/09/84)
>[] <- sacrificial offering. > > While designing a 32032 processor board, it became apparent > that parity checking on a 512K/2048K memory system was > going to cost us 8 memory chips, 4 parity generator/checkers, > and a handful of buffers & control logic. The actual merits of > memory parity checking have always been somewhat dubious to me given > the generally inept way a fault is typically treated (system crash), > the fact that one trades away some reliability in adding parity > circuitry (who checks the checkers?), and the relatively low > soft error rate claimed by semi manufacturers (some claim 1 error > per ~6 months/mbyte). At our particular site, with 40 mBytes > installed, we see a soft error rate of about 1 per 3-4 weeks. > > The question to net.digital: Given that the application is a general > purpose workstation (e.g. non critical in the sense that medical/ > control system applications are) is parity worth the costs in board > realestate, $$ to the end user, and the rather questionable > increase in reliability? I would especially like to hear from > any manufacturers who have put some research into this trade off.. Memory errors were never really a serious problem with 8-bit microprocessors since you were limited to 64K. If your system exhibted erratic operation, such as trashed data or unexplained crashes, the first thing that would come to mind would be, "memory problem." Since the total amount of memory was probably less than 64K, you could afford the time required to exhaustively test each sub-block. When memory size increases, say up to 1Mb, diagnostic programs as a practical matter must become less efficient. How many people can afford to tie up their machines for weeks looking for a possible memory problem? There really is no good test to find soft errors in large memory spectrums. The only sensible solution is to design the hardware to check itself as opposed to dreaming up ever more complicated diagnostic algorithms. From a systems design standpoint, you have to consider how this thing is going to be manufactured and who your customers are going to be. What happens when it doesn't work in your factory? How about when its in your customers hands? I can personally attest to the fact that troubleshooting any 32-bit minicomputer or super microcomputer with intermittent failure modes is often times like finding a needle in a haystack. I view adding a few extra DIPs so that the hardware flags its own failures as an important system requirement that ultimately saves customers and designers alike much grief. As for soft errors that do not reoccur, they are infrequent and there's really nothing you can do about them. They can be caused by, among other things, cosmic rays. In any case, designing the system to freeze up on a parity error is not necessarily an appropriate response. The important thing is to somehow warn the user that he has got problems. Parity checking is well suited for detecting single bit errors in large memories. Eight data bits plus one parity bit comprises a 9-bit code, each bit thereof is checked by the other 8-bits. Yes, the parity bit is checked by the eight data bits. There is an 89% probability that a single bit error within the 9-bit code will be a data bit, hence you GAIN reliability, not trade it away. The final point is any computer which its mean time to repair is lower than its competitors is better product. And parity checking, being cheap to implement can only help in this regard. Now if you wanted to use Hamming, then the reliability/cost issue would be more significant. In general, I think its little presumptuous for a system designer to judge the integrity of his customers data as not being worthy of any safeguards whatsoever because an error won't directly cause some one to die. A soft memory failure that reoccurs often, unbeknownst to the user, can over time corrupt his entire database. While this almost always is not a matter of life and death, the guy who has to explain to his boss how some hardware problem changed the figures around in the payroll records (and the backup copies too!?) may wish he WERE dead. If I worked for a company that was in the market for such a workstation and I found out that the designers of the system I was considering didn't think my applications were important enough to merit safeguarding, I'd think about talking to some other vendors. DISCLAIMER- THE ABOVE ARTICLE IS A PERSONAL OPINION ONLY AND BEARS NO RELATION TO ANY RESEARCH, EXISTING PRODUCTS OR PLANNED PRODUCTS OF MY COMPANY.