[comp.arch] Parity

henry@utzoo.uucp (Henry Spencer) (09/08/89)

In article <7851@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
>Most PCs use partity checked memory, which lets them detect a memory error...

If they notice it.  Most PC applications either disable parity-error traps
or ignore them.  And at least one manufacturer was a little bit horrified
to discover that his parity checker didn't work and had never worked, and
nobody had noticed.  (The standard hardware does not provide any way to
*test* the parity subsystem.)
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

baum@Apple.COM (Allen J. Baum) (02/08/90)

[]
>In article <1911@sunquest.UUCP>terry@sunquest.UUCP (Terry Friedrichsen)writes:
>
>Now wait.  I've seen a couple of instances of memory parity errors on
>fairly new PCs.  Are you really trying to say that I'd be better off
>not knowing about memory parity errors, and I should just let programs
>quietly screw up?  Or am I missing your point in some way?  I know your
>article addressed soft chip errors exclusively, but your conclusion seems
>a bit strong.
>
>Memory reliability involves more than just the RAM chip itself; there's
>many a slip twixt the chip and the CPU.  I'll buy the idea that errors
>are so infrequent that ECC is unnecessary overkill, but sorry, I just
>GOTTA have that parity check ...

Actually, especially when it comes to PCs, parity migh be a loss rather
than a win, but not for the reasons you'd suspect. Often, it's the parity
generating circuitry that's the critical path, and it fails more often than
the memories. Thus, you get lots of parity errors that aren't really errors.
Its the parity checking and generating circuitry that's the weak point.


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/08/90)

In article <38420@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:

| Actually, especially when it comes to PCs, parity migh be a loss rather
| than a win, but not for the reasons you'd suspect. Often, it's the parity
| generating circuitry that's the critical path, and it fails more often than
| the memories. Thus, you get lots of parity errors that aren't really errors.
| Its the parity checking and generating circuitry that's the weak point.

  I can't say you're wrong about that because I don't have the detailed
stats on all the computers in the world, but I did ask one of the repair
guys here about memory vs. parity circuit failures, and he said it was
memory virtually all the time. They maintain at least 800 machines which
are PCs or PC clones with parity checking.

  Given a choice I wouldn't buy a computer without parity, because a
wrong answer is a lot worse than no answer at all to me.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

frazier@oahu.cs.ucla.edu (Greg Frazier) (02/09/90)

In article <2102@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
+In article <38420@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
+
+| Actually, especially when it comes to PCs, parity migh be a loss rather
+| than a win, but not for the reasons you'd suspect. Often, it's the parity
+| generating circuitry that's the critical path, and it fails more often than
+| the memories. Thus, you get lots of parity errors that aren't really errors.
+| Its the parity checking and generating circuitry that's the weak point.
+
+  I can't say you're wrong about that because I don't have the detailed
+stats on all the computers in the world, but I did ask one of the repair
+guys here about memory vs. parity circuit failures, and he said it was
+memory virtually all the time. They maintain at least 800 machines which
+are PCs or PC clones with parity checking.

Correct me if I'm wrong, but aren't we talking about soft
errors, here?  If so, the repair man isn't going to see
them.  I would be a bit surprised if parity hardware experienced
more soft errors than memory, but I'd be shocked if parity
hardware had more hard errors.

Greg Frazier
..................................................................
"They thought to use and shame me but I win out by nature, because a true
freak cannot be made.  A true freak must be born." - Geek Love

Greg Frazier	frazier@CS.UCLA.EDU	!{ucbvax,rutgers}!ucla-cs!frazier

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (02/09/90)

In article <38420@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>Actually, especially when it comes to PCs, parity migh be a loss rather
>than a win, but not for the reasons you'd suspect. Often, it's the parity
>generating circuitry that's the critical path, and it fails more often than
>the memories. Thus, you get lots of parity errors that aren't really errors.
>Its the parity checking and generating circuitry that's the weak point.

But surely the point is Total-undetected-error-count. Yes, it is annoying
to have errors introduced by the checking circuitary but that is secondary
to catching all real errors.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh185
Me? Represent other people? Don't make them laugh so hard.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/09/90)

In article <31673@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes:

| Correct me if I'm wrong, but aren't we talking about soft
| errors, here?  If so, the repair man isn't going to see
| them.  

  You make a good point. If I see an error infrequently, not in the
same location, I can accept that as a soft error. If I see failure at
one location, even only a few times a year, I assume that it is a
marginal part.

  There are several programs which allow changing the refresh rate on a
PC style machine. I find that "soft errors" go away when memory tests
are run on a system with slowed refresh, and the marginal parts are
replaced. I talked to the repair guy again, and he said that people are
not reporting intermittent errors. I don't know if that means they don't
mention them, or the memory tests find them. We do tend to replace
memory as bad if the error rate is non-zero, easier now that prices are down.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

chase@Ozona.orc.olivetti.com (David Chase) (02/09/90)

In article <2127@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>But surely the point is Total-undetected-error-count. Yes, it is annoying
>to have errors introduced by the checking circuitary but that is secondary
>to catching all real errors.

If you really mean what you say, then it is ok to have a parity
checker that (incorrectly, of course) always indicates that an error
has occurred.  I can guarantee you that it will catch every single
error that occurs, which is better than *all* error-detecting schemes
used in computers today (which will only guarantee that the
probability of a missed error is very low, which is not the same as
catching all errors).

Of course, my error-detector won't let you get much done with your
computer, but we've already agreed that that is a "secondary" concern.

Seriously, there are tradeoffs.  What we're probably talking about
here is "given (a) parity checking, what trade-off would you make
between (b) a slower clock and (c) frequent crashes due to
misidentified errors?"  Probably (a) and (c) is unacceptable -- people
would decide your machine was flaky, so the choice is (a) and (b),
which means that your machine is slower.  (How much slower I don't
know -- one could also spend more money on the checking circuits, but
then the computer gets more expensive).

Either way, a computer without parity looks better most of the time
(until that error comes along which silently trashes bezillions of
dollars worth of data), which means that the computer without parity
is the one that sells.  Tandem, of course, makes money selling
reliable computers, but they aren't selling PCs.

David

terry@uts.amdahl.com (Lewis T. Flynn) (02/22/90)

In article <4139@ganymede.inmos.co.uk> davidb@inmos.co.uk (David Boreham) writes:
>No. Parity is intended to find soft errors. The majority of failures
>may well be memory problems but I'd bet that they were nothing to do 
>with soft errors. 
>
>You can very easily check on the state of your memory chips using a
>power-on test. A decent power-on-selftest can also look for marginal
>failures.
>
>Still no parity needed.

This works well for some cases but not others. Some applications require
7x24 uptime. These folks are happy to pay for single bit error correction and
double bit error detection, and get real upset if they have to power their
machines down for any reason. One system with which I was personally associated
was booted for any reason only three times in 19 months and the processor was
never powered down during that time.

Terry

Disclaimer: I'm not a hardware type and I don't know what Amdahl's views on
this subject might be (although I bet I could guess 8-).

davidb@braa.inmos.co.uk (David Boreham) (02/28/90)

In article <08b.02x48aub01@amdahl.uts.amdahl.com> terry@amdahl.uts.amdahl.com (Lewis T. Flynn) writes:
>double bit error detection, and get real upset if they have to power their
>machines down for any reason. One system with which I was personally associated
>was booted for any reason only three times in 19 months and the processor was
>never powered down during that time.
>this subject might be (although I bet I could guess 8-).


Absolutely, this however does nothing for the argument for simple parity checking.
There you only get a red light on and a halted machine.
As I said in my original posting, the soft-error rates exhibited by modern
DRAMs are slow low that there are plenty of other similarly likely sources
of error in PCs and workstations and therefore I don't believe that parity 
is necessary in those kinds of machines.

If you have a Gbyte of DRAM then you should expect a soft error every year or so.
If you want to have a good chance of staying up for three months then you need
ECC on your DRAM. However, without all kinds of other redundancy you should not
expect the rest of your system to stay up for three months either.

David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |     (us): uunet!inmos.com!davidb
+44 454 616616 ex 547        | Internet: davidb@inmos.com