[comp.sys.amiga.tech] Parity Checking / ECC RAM on the A3000

lphillips@lpami.wimsey.bc.ca (Larry Phillips) (05/27/90)

In <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
>
>The Amiga 3000 is capable of holding (at least supporting) much more memory
>than a Cray 1, and the size of gates in modern memory is much smaller and
>thus they are more susceptable to alpha radiation induced parity errors than
>were the gates of the Cray memory.
>
>To take an Amiga seriously as a commercial machine in the "workstation,
>large memory" market, I'd guess error correcting code will turn out to be
>vital.  It would be a shame to have a big production run of the hardware
>installed and on the street, only to have parity problems give the machine
>a reputation as an unreliable machine, to be avoided in droves.  Better if
>the problem is solved before the reputation is besmirched.

Right.. ECC is not parity, and vice versa. Parity checking is totally,
completely, and utterly useless.

>But then, what do I know after 29 years in the field about what people who
>buy the machines look for in a large processor?  My computer purchases were
>limited to a couple of million bucks worth, down in the noise level in the
>marketplace.  ;-)

Geez... out-yeared me by 3. :-)

-larry

--
The raytracer of justice recurses slowly, but it renders exceedingly fine.
+-----------------------------------------------------------------------+ 
|   //   Larry Phillips                                                 |
| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
|        COMPUSERVE: 76703,4322  -or-  76703.4322@compuserve.com        |
+-----------------------------------------------------------------------+

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/27/90)

In article <756@bilver.UUCP> alex@bilver.UUCP (Alex Matulich) writes:
>
>However, the fast RAM chips are replaceable by the 4 meg variety.  You can
>stick 16 of them in the A3000.  That's a potential for 16 parity errors
>per day!
>
>I think I would worry about that.  I wouldn't want a scientific experiment
>or financial program go off using erroneous data without knowing it.
>Three extra parity bits per byte would allow detection of up to 2 bits and
>correction of one (per byte).  This is very complicated to design, however.
>Single-bit parity error detection (like one has on IBM compatibles) is
>relatively easy.

Computer folk wisdom has it (actually, I had this from someone at the NCAR
site where the beast was then installed) that Seymore Cray built Cray 1
number 1 without parity checking.  The error rate in that much memory was
insufferable, so the machine became a sort of demo machine; when you ordered
a Cray 1, you first got serial number 1 installed, on which you could build
and test your code in a lossy environment, until a machine with parity could
be built for you and swapped for the useless toy first delivered.

The Amiga 3000 is capable of holding (at least supporting) much more memory
than a Cray 1, and the size of gates in modern memory is much smaller and
thus they are more susceptable to alpha radiation induced parity errors than
were the gates of the Cray memory.

To take an Amiga seriously as a commercial machine in the "workstation,
large memory" market, I'd guess error correcting code will turn out to be
vital.  It would be a shame to have a big production run of the hardware
installed and on the street, only to have parity problems give the machine
a reputation as an unreliable machine, to be avoided in droves.  Better if
the problem is solved before the reputation is besmirched.

But then, what do I know after 29 years in the field about what people who
buy the machines look for in a large processor?  My computer purchases were
limited to a couple of million bucks worth, down in the noise level in the
marketplace.  ;-)

Kent, the man from xanth, now zooming to the net from Zorch.
(xanthian@zorch.sf-bay.org)

lphillips@lpami.wimsey.bc.ca (Larry Phillips) (05/29/90)

In <dillon.3992@overload.UUCP>, dillon@overload.UUCP (Matthew Dillon) writes:
>>In article <3620@tymix.UUCP> pnelson@hobbes.uucp (Phil Nelson) writes:
>>In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes:
>>
>>>Right.. ECC is not parity, and vice versa. Parity checking is totally,
>>>completely, and utterly useless.
>>
>>Oh really? Please explain why parity checking would not have saved me much
>>
>> The advantage of parity checking is diagnostic, intermittent problems on
>
>
>    I have a tendancy to agree.  ECC is cute but expensive.  It takes 7
>    bits of ECC to detect and correct 1 bit in a 32 bit wide word.  The way
>    you think about 1-bit-ECC is that you need enough codes to generate the
>    address of the incorrect bit, plus a no-error code, plus a parity bit.
>    Unfortunately, that no-error code takes us from 5 to 6 bits, then one
>    more to parity-check the ECC code itself.  A 1-bit ECC can correct 1 bit
>    errors and detect 2-bit errors.

In the context of the posting I replied to, that of a life supporting or Very
Important Application implementation, parity is indeed useless, and ECC can be
seen as coming as close to mandatory as anything can be.  An exception to this
might be if it is implemented as multiple 'majority rules', identical
computers.

The fact that a properly designed ECC scheme can correct errors in the ECC bits
themselves makes it far more desirable for reliability and recoverability,
though at a greater cost.

Parity schemes, on the other hand, cannot detect the failure of a parity bit
itself, and thus reduces the overall reliability as a tradeoff for knowing when
you had an error, even if that error is meaningless and would not have happened
without the parity bit being present.  Statistically speaking, if parity is
checked on a byte basis, 1/9 of all single bit errors could be safely ignored,
and that takes into account ONLY the parity rams themselves, without taking
into account the current contents of the memory itself, the importance of the
application taking the hit, etc.

>    A simple 1-bit parity check is sufficient to detect the problem that
>    ECC would have corrected, and allow the processor to map the page out
>    with its MMU.  In anycase, this kind of failure occurs less often than
>    you think.	What most people come up against is a BAD DRam (i.e. cause
>    of problem is not alpha radiation), in which case it is not reliable
>    anyway and you simply have to replace the chip.

Assuming that an MMU is in place, and assuming that the error was a random
event caused by external forces (cosmic rays, whatever), the page may or may
not require mapping out, though with parity checking only, you really don't
have a lot of choice. With an ECC scheme, the system can make note of the error
and keep using the memory, allowing it to map the page out when the number of
errors exceeds a threshhold over a predefined period of time. It will also
allow reporting of single bit errors to the operator, who can make a good
judgement as to the root cause, and take action as appropriate.

Parity is a heavy-handed beast, telling you little, and treating all memory as
equal. Should video memory be parity checked, assuming that you can readily
identify where the video is being displayed from? If so, should you crash and
burn because a picture has a pixel showing a bad colour? If not, can you trust
the figures your spreadsheet shows? If you don't crash and burn, should you
ignore the red light? Should you panic? Either way, you could be wrong.

Hardware is getting cheaper all the time. ECC is a little more expensive than
parity. In some ways, it can be said to be cheaper, if you count the lost
productivity when an error occurs  that cannot be corrected, and would not
matter to the application.

In Very Important Applications, I would go for ECC.  In other situations, I
would go for no checking at all. Parity is useless.

>    DRAMs these days are much more reliable than 10 years ago... even 5 years
>    ago.

You rest my case. :-)

-larry

--
The raytracer of justice recurses slowly, but it renders exceedingly fine.
+-----------------------------------------------------------------------+ 
|   //   Larry Phillips                                                 |
| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
|        COMPUSERVE: 76703,4322  -or-  76703.4322@compuserve.com        |
+-----------------------------------------------------------------------+

pnelson@hobbes.uucp (Phil Nelson) (05/29/90)

In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes:

>Right.. ECC is not parity, and vice versa. Parity checking is totally,
>completely, and utterly useless.

 Oh really? Please explain why parity checking would not have saved me much
time and trouble in 1986 when I bought a flaky Pacific Cypress RAM expansion
and then spent the next 2 months covincing them that the problem was not in
the then buggy Amiga software. It turned out that they needed larger bypass
caps in their memory array, since the problem was intermittent, it did not
show in their memory tests. If that box had parity, I and everyone else who
bought that box before they were finally convinced that they had a problem
and fixed it could have saved a whole lot of wasted time.

 The advantage of parity checking is diagnostic, intermittent problems on
complex systems can be very difficult to diagnose, particularly by end
users like me, who, even if they have a certain amount of expertise, have
not the time and equipment isolate the tough ones.



>| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |


Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508

	The words of the wicked lie in wait for blood,
           but the mouth of the upright delivers men.   -Proverbs 12:6

sysop@tlvx.UUCP (SysOp) (05/29/90)

In article <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
> In article <756@bilver.UUCP> alex@bilver.UUCP (Alex Matulich) writes:
> >
> >However, the fast RAM chips are replaceable by the 4 meg variety.  You can
> >stick 16 of them in the A3000.  That's a potential for 16 parity errors
> >per day!
...
[story about Cray 1 deleted]
> The Amiga 3000 is capable of holding (at least supporting) much more memory
> than a Cray 1, and the size of gates in modern memory is much smaller and
> thus they are more susceptable to alpha radiation induced parity errors than
> were the gates of the Cray memory.

Even "more susceptable"?  Ok, why?  How?  If it were really that bad, then
people using the A3000 now should be at least occasionally noticing weird
things happening, right?  But are they?  What about my A1000 with 2.5 megs?
(Of course, I don't have parity, so how can I tell?  Sigh.)

> 
> To take an Amiga seriously as a commercial machine in the "workstation,
> large memory" market, I'd guess error correcting code will turn out to be
....
> But then, what do I know after 29 years in the field about what people who
> buy the machines look for in a large processor?  My computer purchases were
> limited to a couple of million bucks worth, down in the noise level in the
> marketplace.  ;-)

It's not that I don't believe you, but... I would like some real concrete
information.

Explain to me this:  I've used a 20 MHz AST 386 for at least 1.5 years at
work, the last year or so with a total of 3 megs.  While I've had problems
(I AM developing software :-), no parity errors have ever appeared.  Is it
possible that the AST has no parity checking, or is it possible that parity
errors are much more rare than some people think?

Sure, I see nothing wrong with making a system more "reliable", but if it's
not really doing any good, then it's a waste of time and money.  If parity is
truely necessary, perhaps concrete proof is going to be needed to convince
others that it's necessary.  The Commodore engineers read these newsgroups, and
I'm sure they've thought about this.  Since there doesn't seem to be a large
cry for it, and Commodore doesn't think it's necessary, just one or 2 people
saying, "You need Parity or else you're not a Real Machine (TM)," isn't going
to change anything.  This isn't a flame, it's just that there were already
a lot of messages on this subject, and as any subject, past a certain point,
you need to go beyond opinion and start with the hard cold facts.  Since I
don't know the rate of errors, I could learn something myself.  (Earlier
messages didn't convince me either way; my mind is still open to debate.
Convince me!!! :-)

If it's only the denser chips that have the errors, then the question is,
will such memory improve with technology such that this won't be a concern,
like the less-dense chips (after some period of time)?

> 
> Kent, the man from xanth, now zooming to the net from Zorch.
> (xanthian@zorch.sf-bay.org)

--
Gary Wolfe
uflorida!unf7!tlvx!sysop, unf7!tlvx!sysop@bikini.cis.ufl.edu

jesup@cbmvax.commodore.com (Randell Jesup) (05/29/90)

>In <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

>>The Amiga 3000 is capable of holding (at least supporting) much more memory
>>than a Cray 1, and the size of gates in modern memory is much smaller and
>>thus they are more susceptable to alpha radiation induced parity errors than
>>were the gates of the Cray memory.

	I think you're making a pretty tenuous comparison here.  I could be
wrong, but I don't see suns or the like using ECC (I'm not even certain all
of them are using parity).  I don't see that the marketplace has shown it
to be important for desktop machines in the sun/whatever class, let alone
the low-end sun/high-end amiga level.

>>To take an Amiga seriously as a commercial machine in the "workstation,
>>large memory" market, I'd guess error correcting code will turn out to be
>>vital.  It would be a shame to have a big production run of the hardware
>>installed and on the street, only to have parity problems give the machine
>>a reputation as an unreliable machine, to be avoided in droves.  Better if
>>the problem is solved before the reputation is besmirched.

	I think ECC is one of our less important problems at the moment.
If people care, they can drop in an ECC memory card (cpu slot for max
speed or Z-III) and put all their fast ram there.  An opportunity for 3rd-
party hardware companies - or it would be if anyone cared about ECC, which
they (for the most part) don't.

>Right.. ECC is not parity, and vice versa. Parity checking is totally,
>completely, and utterly useless.

	Yup (or very close).

>>But then, what do I know after 29 years in the field about what people who
>>buy the machines look for in a large processor?  My computer purchases were
>>limited to a couple of million bucks worth, down in the noise level in the
>>marketplace.  ;-)

	Amiga's are "large processors", then?  ;-)  BTW, Commodore just sold
about $10M of Amigas to the government (as reported in WSJ, I think).  We
(as part of a Sears business center deal) won a subcontract for supplying
multitasking computers to the government.  Apparently this is suprising, since
we've only been trying to sell to the government for 6 months, and many firms
don't make sales for 18 months, due to long product cycles.  (Taken from the
WSJ article, not anything internal.)

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

dillon@overload.UUCP (Matthew Dillon) (05/29/90)

>In article <3620@tymix.UUCP> pnelson@hobbes.uucp (Phil Nelson) writes:
>In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes:
>
>>Right.. ECC is not parity, and vice versa. Parity checking is totally,
>>completely, and utterly useless.
>
>Oh really? Please explain why parity checking would not have saved me much
>
> The advantage of parity checking is diagnostic, intermittent problems on


    I have a tendancy to agree.  ECC is cute but expensive.  It takes 7
    bits of ECC to detect and correct 1 bit in a 32 bit wide word.  The way
    you think about 1-bit-ECC is that you need enough codes to generate the
    address of the incorrect bit, plus a no-error code, plus a parity bit.
    Unfortunately, that no-error code takes us from 5 to 6 bits, then one
    more to parity-check the ECC code itself.  A 1-bit ECC can correct 1 bit
    errors and detect 2-bit errors.

    2+ bit correct is MUCH more difficult (think of the # of codes required...
    at least double the number of bits as for 1 bit ECC but the analogy I
    used above no longer holds, so it's even more!).  When you get into >1
    bit ECC you generally switch to burst-error correction (which requires
    fewer correct-codes and thus fewer bits of ECC).  Unfortunately, burst
    error correction is useless when the medium is memory.

    A simple 1-bit parity check is sufficient to detect the problem that
    ECC would have corrected, and allow the processor to map the page out
    with its MMU.  In anycase, this kind of failure occurs less often than
    you think.	What most people come up against is a BAD DRam (i.e. cause
    of problem is not alpha radiation), in which case it is not reliable
    anyway and you simply have to replace the chip.

    DRAMs these days are much more reliable than 10 years ago... even 5 years
    ago.

--


    Matthew Dillon	    uunet.uu.net!overload!dillon
    891 Regal Rd.
    Berkeley, Ca. 94708
    USA

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/30/90)

In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes:
>In <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
>>
[you already saw my part twice]
>
>Right.. ECC is not parity, and vice versa.

Sort of.  Error Correcting Circuitry always (?) contains Error
Detecting Circuitry, which is an extension of the parity error
detection concept to more complex errors.

>Parity checking is totally, completely, and utterly useless.

I disagree!

That depends on what you're trying to accomplish.  Even parity
checking that does nothing more than crash the machine with a
"parity fault at 0Xnnnnnnnn", which at least tells you that you
there is a problem, beats all hollow having a bit flipped in a
critical datum, and receiving no warning at all, potentially
until after you have made a costly (and wrong) decision based
on the erroneous result.  Since parity checking is so well
known a part of the state of the hardware engineering art, it
is not clear to me that a company could escape unscathed from
a lawsuit for consequential damages if it were omitted from
a computer design of a machine offered for commercial use in
this day and age.  Most of the losers in those suits were the
folks who thought it was safe to ignore Best Engineering
Practice.

>>But then, what do I know after 29 years in the field [...]
>
>Geez... out-yeared me by 3. :-)

I started at 17, young in those days for a beginning programmer,
over the hill today. ;-)

Kent, the man from xanth.
(xanthian@zorch.sf-bay.org)

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/30/90)

In article <321@tlvx.UUCP> sysop@tlvx.UUCP (SysOp) writes:
>kent> [...] the size of gates in modern memory is much smaller and
>kent> thus they are more susceptable to alpha radiation induced parity
>kent> errors than were the gates of the Cray memory.
>
>Even "more susceptable"?  Ok, why?  How?  If it were really that bad, then
>people using the A3000 now should be at least occasionally noticing weird
>things happening, right?  But are they?  What about my A1000 with 2.5 megs?
>(Of course, I don't have parity, so how can I tell?  Sigh.)

Sorry; its amazing how susceptible I am to thinking that just because I
have known something for a decade or more, that it therefore is common
knowledge.  Herewith the (stupidly) omitted explanation:

Alpha radiation (a fast moving, stripped helium nucleus) originates within
the naturally occuring radioactive impurities of the memory chip itself.  By
their nature (big, bumbling and slow compared to other kinds of radioactivity)
alpha particles have very limited penetrating power; they do all their
mischief near their point of origin.  For our purposes, the thing of interest
about their action is that, being an extremely positively charged particle
among atoms essentially in neutral balance, they have a large effect on
the outer shell (conduction) electrons, pulling large numbers after them in
their wake.  They cause a parity error when these entrained electrons are
deposited in a spot that causes a gate to shift its state from 0 to 1 or
vice versa, for instance on one of the control lines.

The older memory chips had very little susceptability to alpha radiation
induced parity errors.  Although the alpha radiation exists constantly
at a low level in every chip, "large number of electrons" above must be
considered relative to the number of electrons required to switch a gate.
The older memory chips, with larger "wiring" and component sizes, used
larger switching currents; significantly larger than the amount of charge
moved by one alpha particle.

Since dynamic RAM means the memory is refreshed repeatedly by renewing the
control charges holding the state (0 or 1) of each gate, there is not usually
time for the charges carried by individual alpha particles to accumulate
from several events to switch a gate, before the refresh cycle sets the charge
back to its nominal value.

In contrast, in denser, newer memory chips with smaller "wiring" and
components, the charge delivered by a rogue alpha particle is of
comparable size to the holding charge on a gate, and so the gate may
be switched before a refresh cycle can correct the problem.  Making
the refresh cycles faster (than they are, not than the old circuits) is not
an option, because most computer chips these days are heat limited, and
more refreshes means more heat.

So for an individual gate, denser memory means a larger chance of a bit
being flipped _from_this_one_cause_.  Still, chips are not highly
radioactive, so for a single bit, this is a very low probability.

The problem comes when you accumulate megabytes of these bits together;
the chances of all of them avoiding errors tail off rapidly as their
number increases, in math similar to that  the birthday paradox employs.

I'm a bit shakey on the numbers here, since I was last a hardware
practicioner in 1972 and things have changed a trifle, but to the
best of my understanding, with today's component sizes, speeds, and
numbers of megabytes, you can expect to get in trouble somewhere
between 1 and 100 megabytes.  I defer to today's hardware practitioners
for better data.

As to why you don't see problems in your 3Meg AT, well, for one thing, as
you mentioned, you don't have parity checking, so they could get by.
Next, most of the software you run (or at least what I ran when using a
5 Meg '386 box) is unused by most applications, still stuck at the 640K
limit.  Again, at least in my Amiga, about 5 megabytes is loaded with
software I may not use from boot to boot, but keep around because it is
convenient.  More, in running code lots of the code space is never
touched (use a file zapper; lots of it is huge blocks of zeros).  Again,
stuff such as screen memory, if you get a bit flipped, you may never
notice before you switch screens or windows in a screen and rewrite the
soft parity error with good data.  Similarly, would you really be likely
to notice a one bit error in a sampled sound data block?

Besides the above, your machine may sit idle 20 hours a day, not even
powered up.  In summary, there are lots of reasons why alpha induced
parity errors would not be a big enough problem to become noticable.

Yet.

But like the birthday paradox, you don't have too far to go in terms of
bigger applications exercising more of the machine, full time unattended
operation (e.g. raytracing, doing accounts), more memory, more critical
applications, and so on, before you run into Seymore Cray's problem.
Parity checking is a necessity in large machines, just to be able to
rely on the results the machine gives you.  Error correcting circuitry
is a necessity in large machines, to get the kind of uptime and through-
put the machine's raw speed and memory size seem to promise.

That's probably more than you wanted to know, and please excuse any
details that might not be "just so".  Since I stopped doing this stuff
for a living, I'm a fairly casual student of the art.  More is available
in IEEE pubs, Scientific American, and so on.

Kent, the man from xanth.
(xanthian@zorch.sf-bay.org)

kevin@cbmvax.commodore.com (Kevin Klop) (05/30/90)

Please excuse this disagreement.  I'm not a hardware designer, but am making
what seem to me to be logical inferences and deductions.  If I err, please
correct me gently 8^).

In article <1990May29.204550.27961@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

	[ Explanation of how alpha particles affect chips omitted
	  in the interests of brevity ]

>The problem comes when you accumulate megabytes of these bits together;
>the chances of all of them avoiding errors tail off rapidly as their
>number increases, in math similar to that  the birthday paradox employs.
>
>I'm a bit shakey on the numbers here, since I was last a hardware
>practicioner in 1972 and things have changed a trifle, but to the
>best of my understanding, with today's component sizes, speeds, and
>numbers of megabytes, you can expect to get in trouble somewhere
>between 1 and 100 megabytes.  I defer to today's hardware practitioners
>for better data.
>
>As to why you don't see problems in your 3Meg AT, well, for one thing, as
>you mentioned, you don't have parity checking

actually, AT's _do_ have parity checked ram.  I used to run one system
with 8 megs of RAM, all of it parity checked (that's why there was 9
chips per bank rather than 8)

>, so they could get by.
>Next, most of the software you run (or at least what I ran when using a
>5 Meg '386 box) is unused by most applications, still stuck at the 640K
>limit.

True, but that's immaterial to parity checked memory such as what's in an
AT - if a memory chip flips then the parity check on that row will reveal
a problem, regardless of whether your current application is using that
memory or not.

And, once you DO start using that memory, any bit flips prior to your usage
is immaterial as the first thing that a program should do is initialize
memory that it is using, and thus won't know that a bit got flipped,
assuming that there's no parity check to have discovered this first.

	[ stuff about unused memory and/or machines not showing
	  errors ]

>But like the birthday paradox, you don't have too far to go in terms of
>bigger applications exercising more of the machine, full time unattended
>operation (e.g. raytracing, doing accounts), more memory, more critical
>applications, and so on, before you run into Seymore Cray's problem.
>Parity checking is a necessity in large machines, just to be able to
>rely on the results the machine gives you.  Error correcting circuitry
>is a necessity in large machines, to get the kind of uptime and through-
>put the machine's raw speed and memory size seem to promise.

I ran an 8 meg AT as a XENIX system that was using all of its memory
constantly.  In 4 years of operation, I never once got a memory parity
error (Although a second AT with a lot less memory seemed to get them
regularly - but once I got one, they would show up in droves until I
replaced the memory card or chip that was causing me problems).

Now, I admit that arguing from a statistical sampling of two machines
can hardly be thought of as a valid sampled universe, however it
does make me wonder whether the chances are all that great of such
errors happening.  Yes, ERCC circuitry would make things more reliable,
but I wonder if all that many applications truly require this, and if
it does, then whether there's a market for add-on memory that does its own
ERCC.

>Kent, the man from xanth.
>(xanthian@zorch.sf-bay.org)

				-- Kevin --


Kevin Klop		{uunet|rutgers|amiga}!cbmvax!kevin
Commodore-Amiga, Inc.

The number, 111-111-1111 has been changed.  The new number is:
134-253-2452-243556-678893-3567875645434-4456789432576-385972

Disclaimer: _I_ don't know what I said, much less my employer.

jesup@cbmvax.commodore.com (Randell Jesup) (05/30/90)

In article <1990May29.204550.27961@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
>alpha particles have very limited penetrating power; they do all their
>mischief near their point of origin.
...
>The older memory chips had very little susceptability to alpha radiation
>induced parity errors.
...
>Since dynamic RAM means the memory is refreshed repeatedly by renewing the
>control charges holding the state (0 or 1) of each gate, there is not usually
>time for the charges carried by individual alpha particles to accumulate
>from several events to switch a gate, before the refresh cycle sets the charge
>back to its nominal value.

	All well and true, however advances have been made in reducing
susceptability to alpha errors, I think perhaps enough to offset the reduction
in charge storage.  For example, plastic-packed parts have less alpha
problems, as I remember, due to lower radioactivity rates.  All 1Mb and
higher ram I've seen is plastic, though there could well be some ceramic
out there somewhere.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

LEEK@QUCDN.QueensU.CA (05/31/90)

From the articles in the ECC/Parity bit thread that I have been reading so far,
it seems to me that memory error is far less likely to be the cause of system
realibility concern...  The realibility of a system is only as good as the
weakest component.

  I have seen my machine crashing more often due to programming bugs and bad
programming.  How clean is the power source that we plug our trusty machine into
?   Can we trust the power company to deliver 100% regulated power free of power
surges and brown outs ?  (Answer is NO !!)  Can we trust the other appliances
in the building for not producing power surges ?  That's the reason why some
companies make a fortune selling UPS and power conditioner.

  How much do we trust our CPU to not to fail ?  Is there some hidden bugs in
the CPU or perpherial chips that would fail under some conditions ?

Intel got a few of these nasty bugs in their early batches of 386 CPUs
chips etc.. (I am sure things like this would pop up once in a while.) Some
companies insists on running parts outside their specified range.  This might
potentially cause problems when mixed with other out of spec designs.

The list of things that can go wrong can go on forever.  My point is that the
memory system is one of the less probable cause of system failure.  Given the
cost of ECC, it might be more worth while to spend that money to prevent other
more likely cause of failure...

K. C. Lee

lphillips@lpami.wimsey.bc.ca (Larry Phillips) (06/01/90)

In <dillon.4072@overload.UUCP>, dillon@overload.UUCP (Matthew Dillon) writes:
>>Parity schemes, on the other hand, cannot detect the failure of a parity bit
>>itself, and thus reduces the overall reliability as a tradeoff for knowing when
>
>    A parity scheme will detect all one bit errors, even if the bit that
>    error'd is the parity bit itself.  The parity scheme does not know *which*
>    bit err'd, or whether it was the parity bit itself that err'd, but it
>    will detect any single bit error.
>
>    A reasonable ECC scheme (7 bits to correct 32 bits as I mentioned in my
>    previous posting) will detect and correct all 1 bit errors where that 1
>    bit is any one of the 32 bits.  It will detect any single bit error in
>    the ECC code itself in which case the real data is assumed to be valid
>    and no other action is taken.  I believe the scheme will also detect any
>    two bit errors (through all 39 bits).
>
>    One should never think of an ECC scheme in terms of whether the erronous
>    bits are in the ECC part or the real-data part.  Or, at least, I never
>    think of it that way.  You tend to produce weak algorithms when you
>    consider cases that depend on the meaning of bits rather than work on
>    a general algorithm that can do a better job all around.

Right. One does not need to think about which bit has failed in an ECC memory
access, because the data is always assumed to be correct when it arrives at the
destination. With ECC, all bits must be assumed to be equally important, since
you are depending on all bits in order to make the above assumption.

My comment had to do with parity schemes, where there is no choice. Since you
cannot know which bit failed, you cannot assume the data is intact at the
destination. From this, you can unequivocally state that the addition of 1/8
more bits to the memory has done one thing for you, and one thing to you. The
thing it has done for you is to tell you that _some_ bit did not read out
correctly. The thing it has done _to_ you is increased the chances of a bit
being read out wrong. The point is not whether you should know if a prity bit
fails, but that if it did happen to be a parity bit that failed, it was a
'useless' error; one that would not have happened if you did not have parity
checking. Again, on average, 1/9 of the errors will fall into this category,
though you will not know which ones they are.

>>you had an error, even if that error is meaningless and would not have happened
>>without the parity bit being present.	Statistically speaking, if parity is
>
>    Thinking of things that way will wind you into a corner fast!

Not at all. It will only get you into trouble if you try to divine which bit
failed, and act upon it.  We don't get ourselves in trouble just because we
posess the knowledge that approximately 1/9 of all errors are spurious. :-)

>>have a lot of choice. With an ECC scheme, the system can make note of the error
>>and keep using the memory, allowing it to map the page out when the number of
>>errors exceeds a threshhold over a predefined period of time. It will also
>>allow reporting of single bit errors to the operator, who can make a good
>>judgement as to the root cause, and take action as appropriate.
>
>    This is one good use of ECC.. .to detect failing memory.

Some of the memories I have worked on have had literally thousands of memory
chips (imagine 96 megs worth of 4KBit chips), and the ECC was invaluable for
detecting a degenerating chip. We kept accurate records of all single bit
errors, provided by the memory itself and stored for readout at PM time, and it
was quite easy to discard the 'random event' failures, and to catch anything
that was on its way to being a ssolid error. Replacing the chip before it got
to be a solid problem meant a saving of many hours trying to track down
double-bit errors, which were MUCH harder to isolate.

-larry

--
The raytracer of justice recurses slowly, but it renders exceedingly fine.
+-----------------------------------------------------------------------+ 
|   //   Larry Phillips                                                 |
| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
|        COMPUSERVE: 76703,4322  -or-  76703.4322@compuserve.com        |
+-----------------------------------------------------------------------+

dillon@overload.UUCP (Matthew Dillon) (06/01/90)

>The fact that a properly designed ECC scheme can correct errors in the ECC bits
>themselves makes it far more desirable for reliability and recoverability,
>though at a greater cost.
>
>Parity schemes, on the other hand, cannot detect the failure of a parity bit
>itself, and thus reduces the overall reliability as a tradeoff for knowing when

    A parity scheme will detect all one bit errors, even if the bit that
    error'd is the parity bit itself.  The parity scheme does not know *which*
    bit err'd, or whether it was the parity bit itself that err'd, but it
    will detect any single bit error.

    A reasonable ECC scheme (7 bits to correct 32 bits as I mentioned in my
    previous posting) will detect and correct all 1 bit errors where that 1
    bit is any one of the 32 bits.  It will detect any single bit error in
    the ECC code itself in which case the real data is assumed to be valid
    and no other action is taken.  I believe the scheme will also detect any
    two bit errors (through all 39 bits).

    One should never think of an ECC scheme in terms of whether the erronous
    bits are in the ECC part or the real-data part.  Or, at least, I never
    think of it that way.  You tend to produce weak algorithms when you
    consider cases that depend on the meaning of bits rather than work on
    a general algorithm that can do a better job all around.

    An interesting extension to ECC for anybody interested is to consider
    the general-expansion case... to correct N bits of error in the data
    portion of the code (32 bits), and to detect and ignore one and two bit
    errors in the ECC itself.

	32 bits + 7 bits ECC		    corrects any single bit error in
					    the 32 bits (7 = lg(32+1) + 1)

	\__________________/ + 7 bits ECC   corrects any single bit error in
					    the 40 bits, which means this
					    corrects any two-bit errors that
					    occur in the first 39 bits, since
					    it will correct one and the 7 bit
					    ECC will correct the other.

					    (7 = lg(39+1) + 1)


    And so on.	The number of bits of ECC required for each level goes up
    according to the log of the number of bits requiring correction.  To
    correct you start out the outmost level and move inward.  Also, there is
    another term which I have not described which needs to be added to
    detect multi-bit errors in the outer ECC codes to keep the algorithm
    a general N bit detect and correct.  IT can get messy.

>you had an error, even if that error is meaningless and would not have happened
>without the parity bit being present.	Statistically speaking, if parity is

    Thinking of things that way will wind you into a corner fast!

>have a lot of choice. With an ECC scheme, the system can make note of the error
>and keep using the memory, allowing it to map the page out when the number of
>errors exceeds a threshhold over a predefined period of time. It will also
>allow reporting of single bit errors to the operator, who can make a good
>judgement as to the root cause, and take action as appropriate.

    This is one good use of ECC.. .to detect failing memory.

>In Very Important Applications, I would go for ECC.  In other situations, I
>would go for no checking at all. Parity is useless.

    If the machine must stay up for months at a time, ECC does get to be
    important.

>|   //   Larry Phillips						 |
>| \X/	  lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
>|	  COMPUSERVE: 76703,4322  -or-	76703.4322@compuserve.com	 |
>+-----------------------------------------------------------------------+

				-Matt

--


    Matthew Dillon	    uunet.uu.net!overload!dillon
    891 Regal Rd.
    Berkeley, Ca. 94708
    USA

eachus@linus.mitre.org (Robert I. Eachus) (06/07/90)

In article <1655@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes:

> In Very Important Applications, I would go for ECC.  In other situations, I
> would go for no checking at all. Parity is useless.

>> DRAMs these days are much more reliable than 10 years ago... even 5 years
>> ago.

>  You rest my case. :-)

   This is more in the nature of an agreement than a flame, but there
are circumstatnces where ECC is LESS relaible than no checking
currently...specifically when speed limits are being pushed.

    Modern DRAMs are fairly well protected against cosmic ray induced
errors, and other transients, but if you use ECC circutry, the overall
reliability of a memory system has to include the possibility that the
ECC circutry returns the wrong value or (much more common) does not
assert the correct value soon enough.  On most ECC memory systems this
is the most frequent cause of uncorrected error (even though it most
frequently occurs when a bit has been flipped).  Worse, if such an
error occur when the memory is read correctly, it is not even detected
by most tranisent fault counters.  It takes a lot of extra logic to
look for signal changes on the buss immediately after the correct
value is expected to be asserted.

    I used to work at Stratus Computer, and (due to duplexed ECC
memory boards) we could detect and count both types of faults. (It's
only a failure if the program sees bad data...)  That is the minimum I
would recommend for life critical systems.  We did see both kinds of
faults, and I would imagine that without very careful design, today
ECC memory does not significantly improve reliability.

    (Before the flames start...If a system has a transient memory
failure every ten months with ECC and every three months without, that
is not a significant difference.  As a fault-tolerant designer, I
would want to push it out to several years, preferably a century or
two.  You can't do that with simple ECC.)

    If you have a friend with an IBM compatible with parity ask him
when he last had a parity error.  My guess is that with 256K or 1 Meg
parts, it should be significantly less than 1 per Megabyte per year.

--

					Robert I. Eachus

   Amiga 3000 - The hardware makes it great, the software makes it
                awesome, and the price will make it ubiquitous.

pnelson@hobbes.uucp (Phil Nelson) (06/08/90)

In article <90151.123059LEEK@QUCDN.BITNET> LEEK@QUCDN.QueensU.CA writes:
|From the articles in the ECC/Parity bit thread that I have been reading so far,
|it seems to me that memory error is far less likely to be the cause of system
|realibility concern...  The realibility of a system is only as good as the
|weakest component.

 You may want to consider that what you have reading is the opinion of some
people that memory chips are so reliable that "parity is useless". The facts
(if we had any) may be otherwise. If the Amiga had parity, it would easy to
get good data on the reliability of the memory IN THE BOX (not in some chip
test lab) and IN THE FIELD (not some clean, quiet final test area).

|  I have seen my machine crashing more often due to programming bugs and bad
|programming.  How clean is the power source that we plug our trusty machine into
|?   Can we trust the power company to deliver 100% regulated power free of power
|surges and brown outs ?  (Answer is NO !!)  Can we trust the other appliances
|in the building for not producing power surges ?  That's the reason why some
|companies make a fortune selling UPS and power conditioner.

These are good points. I think it very likely that the memory system is not
the greatest cause of unreliability on the Amiga. Certainly not if you
include software bugs. This does not prove that parity checking is useless,
but that other measures are needed too. The order in which to take measures
to improve reliability is not determined exclusively by which is the worst
problem, it may be reasonable to start with a problem that is not the worst,
if a solution is easily implimented (memory parity checking, for example).

|  How much do we trust our CPU to not to fail ?  Is there some hidden bugs in
|the CPU or perpherial chips that would fail under some conditions ?
|
|Intel got a few of these nasty bugs in their early batches of 386 CPUs
|chips etc.. (I am sure things like this would pop up once in a while.) Some
|companies insists on running parts outside their specified range.  This might
|potentially cause problems when mixed with other out of spec designs.
|
|The list of things that can go wrong can go on forever.  My point is that the
|memory system is one of the less probable cause of system failure.  Given the
|cost of ECC, it might be more worth while to spend that money to prevent other
|more likely cause of failure...

 I think the cost of ECC cannot be justified on the Amiga, unless for special
applications. The added cost of simple parity checking (not very great) might
easily by justified because it would help by allowing the early detection
and repair of machines with memory problems. It would be especially useful
for machines with flaky, intermittent memory.


|K. C. Lee


Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508

  If you thought prohibition was fun, you're gonna LOVE gun control.

charles@hpcvca.CV.HP.COM (Charles Brown) (06/08/90)

>    This is more in the nature of an agreement than a flame, but there
> are circumstatnces where ECC is LESS relaible than no checking
> currently...specifically when speed limits are being pushed.

>     Modern DRAMs are fairly well protected against cosmic ray induced
> errors, and other transients, but if you use ECC circutry, the overall
> reliability of a memory system has to include the possibility that the
> ECC circutry returns the wrong value or (much more common) does not
> assert the correct value soon enough.
	...
If the ECC RAM returns the correct value but too late, it is not
designed correctly.  Part of the task of design is to make sure there
is enough margin.  So you have not demonstrated your point.  What you
have demonstrated is that: Poorly designed ECC RAM is sometimes less
reliable than well designed RAM w/o ECC.  So what.

>     If you have a friend with an IBM compatible with parity ask him
> when he last had a parity error.  My guess is that with 256K or 1 Meg
> parts, it should be significantly less than 1 per Megabyte per year.
> --
> 					Robert I. Eachus

But I agree that a well designed RAM should have few errors.  The
HP9000/350 with 16MB RAM that I use at work has parity.  (Later models
come with ECC.)  It seems to be averaging one parity error every six
months.  For my computer uses that is good.  For life critical uses
that would not be good enough.
--
	Charles Brown	charles@cv.hp.com or charles%hpcvca@hplabs.hp.com
			or hplabs!hpcvca!charles or "Hey you!"
	Not representing my employer.

lphillips@lpami.wimsey.bc.ca (Larry Phillips) (06/08/90)

In <3649@tymix.UUCP>, pnelson@hobbes.uucp (Phil Nelson) writes:
>
> You may want to consider that what you have reading is the opinion of some
>people that memory chips are so reliable that "parity is useless". The facts
>(if we had any) may be otherwise. If the Amiga had parity, it would easy to
>get good data on the reliability of the memory IN THE BOX (not in some chip
>test lab) and IN THE FIELD (not some clean, quiet final test area).

Since I am the one that used the words "parity is useless", I think I will say
that you should refrain from placing words in my mouth that were never there. I
did _not_ say make that statement because I think that memory is too reliable.
I said it because I see no real use for adding extra memory, at extra cost,
thereby statistically reducing reliability, for the sole purpose of either (a)
informing the user that a partity error has occurred, or (b) crashing the
program or system.


>These are good points. I think it very likely that the memory system is not
>the greatest cause of unreliability on the Amiga. Certainly not if you
>include software bugs. This does not prove that parity checking is useless,
>but that other measures are needed too. The order in which to take measures
>to improve reliability is not determined exclusively by which is the worst
>problem, it may be reasonable to start with a problem that is not the worst,
>if a solution is easily implimented (memory parity checking, for example).

In what way do you see parity checking as 'measures to improve reliability'?
I think you are confusing reliability with some other parameter. Parity
checking, if it only informs you of a parity error, does not change the
reliability of a system at all. If it is used to halt a task or a system, it
does, in fact, reduce reliability.

You might want to ask yourself what the benefits of parity checking are, vs.
the cost of it.

Benefits:

	Information. You know you had a memory error, and have the option of
rerunning anything that might possibly have been affected by it.

  Information. You know that after running any particular program, if you were
not informed of a parity error, that any errors you may have, were caused by
something else. Note that the lack of a parity error says nothing about the
accuracy of your results, and that the presence of a parity error likewise says
nothing about the accuracy of your results.

Costs:
  Parts.

  Wasted time/resources. If a parity error occurred in a non-important part of
memory (including the parity bit memory itself), you have no way of knowing
that you didn't need to rerun a program. The mere presence of a parity error
indication tells you nothing but that there was a parity error, but encurages
users to rerun things, and lulls them when the little light doesn't come on.

> I think the cost of ECC cannot be justified on the Amiga, unless for special
>applications. The added cost of simple parity checking (not very great) might
>easily by justified because it would help by allowing the early detection
>and repair of machines with memory problems. It would be especially useful
>for machines with flaky, intermittent memory.

The most useful thing for machines with flaky, intermittent memory is a trip to
the repair shop. Flaky, intermittent memory will show up in other ways, without
having to add more flaky, intermittent memory.

-larry

--
The raytracer of justice recurses slowly, but it renders exceedingly fine.
+-----------------------------------------------------------------------+ 
|   //   Larry Phillips                                                 |
| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
|        COMPUSERVE: 76703,4322  -or-  76703.4322@compuserve.com        |
+-----------------------------------------------------------------------+

<LEEK@QUCDN.QueensU.CA> (06/08/90)

In article <3649@tymix.UUCP>, pnelson@hobbes.uucp (Phil Nelson) says:

>(if we had any) may be otherwise. If the Amiga had parity, it would easy to
>get good data on the reliability of the memory IN THE BOX (not in some chip
>test lab) and IN THE FIELD (not some clean, quiet final test area).
>

some stuff deleted...
>
>These are good points. I think it very likely that the memory system is not
>the greatest cause of unreliability on the Amiga. Certainly not if you
>include software bugs. This does not prove that parity checking is useless,
>but that other measures are needed too. The order in which to take measures
>to improve reliability is not determined exclusively by which is the worst
>problem, it may be reasonable to start with a problem that is not the worst,
>if a solution is easily implimented (memory parity checking, for example).
>
> I think the cost of ECC cannot be justified on the Amiga, unless for special
>applications. The added cost of simple parity checking (not very great) might
>easily by justified because it would help by allowing the early detection
>and repair of machines with memory problems. It would be especially useful
>for machines with flaky, intermittent memory.
>
>
>|K. C. Lee
>
>
>Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508
>
>  If you thought prohibition was fun, you're gonna LOVE gun control.

The problem with parity bit (vs ECC) is that it only do single bit error
detection.  If there are even # of bits having error, it doesn't know.
The other thing have to do with design/economic/space.  I can give you
an example for myy particular set up.

 I have 4 meg of ram for my 18MHz 020.  32 chips of 256K*4 chips.  For a parityy
scheme, I would need 1 parity bit per byte.  680x0 is byte addressable so to do
this properly is to have a parity for each 8 bit groups.  Due to design and
realibility problem, I would need a spearate chip for each 8 bit groups.

Reasoning...
Unless the 4 parity bits are squeezed into a 256k*4 chip and have a separate
scheme for addressing that 4 parity bits and multiplex/demultiplex the
parity bits to allow for byte/16-bit word/32-bit long word access. Sorry I
can't do that without violating timing constraints unless faster chips are
used instead of 60nS.  Since the parity bits are accessed in different way -
different gate delay which esult in slightly different timing.  The hardware
should also latch/synchronous the data bits before doing a checksum...
All this mess to save space is not funny.  To complicate things a bit more,
I am running the 32 chips in a 4 bank page interleave mode - so i can't
use 4 1meg*1 chip for the whole group either.

That's 1 256K*1 chip per every 2 256K*4 chips.  Excluding extra hardware   (
and software exception handler..)  That takes up 50% more space in my
particular piece of memory board.. and all these trouble just to be able to
warn me of a possible parity bit error ???  if I (or someone at C=)
were to gone through the trouble to add parity, might as well go for ECC.
A couple more bits per 32-bit word and you got Automatic Error detection
and correction (and available with almost off the shelve parts eg. DRAM
controller and ECC chip set from National Semiconductor)

ECC/memory parity bit to boost realibility is like replacing all the wiring
of a stereo (with cheap speakers) with solid silver strips - the sound
certainly improves, but it might be more cost effective to replace the
speakers.  The speaker is a mechanical system and it generate more
distortions than the electronic components.

Most people would agree the above example.  Now substitute the words below..
(stere= -> computer system, silver as wiring -> ECC, speakers -> programs,
 mechanical system -> software, distortion -> crashes)

Sure if one have plenty of $$$ after upgrading the speakers to match the amp,
one can spend $$$ on super conductors wire :) to hook up the speakers and
may be a UPS for the stereo too. :)

K. C. Lee
( I don't know much about stereo (only know the electronic side of it) , so
don't flame me for using the above incorrectly as an example)

eachus@linus.mitre.org (Robert I. Eachus) (06/09/90)

In article <1410047@hpcvca.CV.HP.COM> charles@hpcvca.CV.HP.COM (Charles Brown) writes:

> If the ECC RAM returns the correct value but too late, it is not
> designed correctly.  Part of the task of design is to make sure there
> is enough margin.  So you have not demonstrated your point.  What you
> have demonstrated is that: Poorly designed ECC RAM is sometimes less
> reliable than well designed RAM w/o ECC.  So what.

     Way off... With modern memory parts everything is quantitized and
statistical since the charges being moved are on the order of 20,000
times the charge of an electron.  In theory (but very, very rarely)
sometimes you will get no electons willing to move and read a one as a
zero.  Much more likely is that the charge in a particular cell is
represented by significantly fewer than the average number of
electrons.  This will result in a slower rise-time when the cell is
read, so you must allow some slack in your design for this "jitter" in
the rise time.  How much? 6 Sigma? 10 Sigma?  Whatever number you
choose there is some statistical chance that and error will occur
because the values were latched too early.

     EDAC is usually done without latching the values, since latching
will usually add a clock cycle to the memory delay.  If there is an
complete error in a single bit in EDAC memory no problem.  However, a
"slow" bit can result in the output of an EDAC PAL being wrong at
precisely the wrong time, even though it was correct a nanosecond
earlier, and will be correct a nanosecond later. (The logic paths
through a PAL are often different lengths, and other effects can also
add jitter to the signal.)  This was the effect I was referring to
when I said that these late bit errors more often occur when a bit is
being corrected.

     Since the EDAC circutry adds timing uncertainty to a memory
system, and also slows it down, it is much more difficult to allow a
10 Sigma margin on EDAC circutry.  (Six sigma gives you one error per
billion, or about one every 3 minutes on a memory read at 5 MegaHertz.
At nine sigma a fifty per cent increase in MARGIN which might be an
added 3 ns. delay without EDAC, or an added 10 ns. with EDAC, an error
will occur every 20,000 years.)

     To repeat myself, a good EDAC memory today is unlikely to be
significantly better than a well designed memory with no checking.

     But now to change the subject: If you have all accounting data
entered then verified by a separate operator.  You can get the error
rate down to 1 in 10,000 for keyboard input, so increasing the
probability of error by one-millionth by using a machine without
parity checking is in the noise, especially since, when accounting is
done on PC's usually the operator checks his or her own input.  So
when Bob Silverman down the hall wants to factor a 100+ digit number
using several decades of machine time, he cares about self-checking
and memory error rates.  Accountants don't.
--

					Robert I. Eachus

with STANDARD_DISCLAIMER;
use  STANDARD_DISCLAIMER;
function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...

LEEK@QUCDN.QueensU.CA (06/09/90)

Some asked me a question about why I could use ECC for the whole 32-bit word
and not a parity bit per 32-bit word...  I guess I should bring my data books
to work next time so that when I goof off to do comp.sys.amiga, I would have
all the reference I need.  The following is my reply to the EMAIL
-----------------------------------------------------------------------------

Subject: Parity bit business

The 680x0 family can do either 8-bit, 16-bit (and the 020 and above can do 32
bit access)

MSB                          LSB

33222222 22221111 111111
10987654 32109876 54321098 76543210
|<---->| |<---->| |<---->| |<---->|   Byte access
|<------------->| |<------------->|   16-bit words
|<------------------------------->|   32-bit long words

The thing is the CPU can address each of the 4 byte individually.  One would
 ot be able to use a single parity bit for the whole group.  The parity for
the MSB access would not be the same as the one for the LSB.  The only way to
use a single parity bit is to some how access the whole 32-bit chunk at a time.
This is messy.    One would need a parity bit for each byte for performance
reasons  (see below)  The PC/AT implement parity bit this way.

This is how they do the ECC for 32-bit memory.  (I though they have a easier
way. :( )  The memory write cycle is changed into a Read-Modify-Write cycle.
The read cycle let the ECC chip reads in the 39-bit word (32-bit for data,
and 7 bit for ECC parity bits).  The ECC now look at the byte/word to be
written and compute the 7 bit parity for the whole 32-bit word.  The
data and ECC parity bits are passed to the memory in the write cycle.
Hmmm..  This is not a pretty sight as a Read-Modify-Write cycle imposes
quite a bit of speed penality.  It is quite a bit longer than just a write
cycle.  The thing is for 32-bit ECC, there are special ECC chips that works
with the DRAM controller and are designed to provide necessary timings and
other stuff.  There are no parity bit chips at VLSI level for parity bits.
If one only want parity bit, one have to sit down and build from descrete
components (MSI or PALs) and it is ususally much easier to have 1 parity
bit per byte rather than 1 parity bit for the 32-bit word.

Sorry I guess I should have made the article at home rather than at work.
The fact still remains the same - it is more worth while to use ECC than
parity bit(s) due to availability of components.  There are VLSI designed
for 32-bit buses vs TTL MSI/PAL decrete chips design for parity bit(s) only
designs.(By the time one get down to using only 1 parity bit/32-bit word,
the amount of MSI/PALs would be as complicated if the thing was designed
with 1 parity/byte.  The added penality for Write going to a R-M-W cycle
throw away the saving in number of RAM chips.)   Due to performance/designs,
one usually have 1 parity bit per byte. (eg. PC/AT)  I guess the memory
controller market people do not believe in parity bit - either it is ECC or
nothing at all.  (same here for me)

Hope this clear up the mess.

----------------------------------------------------------------------------
K. C.

I guess I should drink coffee when I work.  Hmmm that doesn't make sense as
I would be awake during working hours...   Zzzzz away.

charles@hpcvca.CV.HP.COM (Charles Brown) (06/12/90)

>> If the ECC RAM returns the correct value but too late, it is not
>> designed correctly.  Part of the task of design is to make sure there
>> is enough margin.  So you have not demonstrated your point.  What you
>> have demonstrated is that: Poorly designed ECC RAM is sometimes less
>> reliable than well designed RAM w/o ECC.  So what.

>      Way off... With modern memory parts everything is quantitized and
> statistical since the charges being moved are on the order of 20,000
> times the charge of an electron.  In theory (but very, very rarely)
> sometimes you will get no electons willing to move and read a one as a
> zero.
      ... [lots of irrelevant details deleted]
> 					Robert I. Eachus

If you put ECC into your RAM system, you must use faster RAM chips to
achieve the same overall system RAM speed.  This is to allow the ECC
circuitry time to make an corrections.  The effective sample time
coming out of the RAM is different for the two systems in order to get
the same effective system RAM access time.  The reliability of the
data coming out of the fast RAM chips at their spec access time will
be approximately the same as the reliability of the data coming out of
the slower RAM chips at their spec access time.  Quantization is
irrelevant (and probably not a factor anyway).

But all of this arguing is silly.  As others have pointed out the main
reliability concern in the Amiga is the software.  It really makes
little sense to worry about rare problems such as bad RAM data when
the software is typically so poor.  I assume you agree with this
statement.  In any case I will discontinue bickering about ECC.  It is
just not very important.
--
	Charles Brown	charles@cv.hp.com or charles%hpcvca@hplabs.hp.com
			or hplabs!hpcvca!charles or "Hey you!"
	Not representing my employer.

pnelson@hobbes.uucp (Phil Nelson) (06/13/90)

Messages from this account are the responsibility of the sender only, and do
not represent the opinion or policy of BT Tymnet, except by coincidence, or
when explicitly so stated.

In article <1710@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes:
>In <3649@tymix.UUCP>, pnelson@hobbes.uucp (Phil Nelson) writes:
>>
>> You may want to consider that what you have reading is the opinion of some
>>people that memory chips are so reliable that "parity is useless". The facts
>>(if we had any) may be otherwise. If the Amiga had parity, it would easy to
>>get good data on the reliability of the memory IN THE BOX (not in some chip
>>test lab) and IN THE FIELD (not some clean, quiet final test area).
>
>Since I am the one that used the words "parity is useless", I think I will say
>that you should refrain from placing words in my mouth that were never there. I
>did _not_ say make that statement because I think that memory is too reliable.
>I said it because I see no real use for adding extra memory, at extra cost,
>thereby statistically reducing reliability, for the sole purpose of either (a)
>informing the user that a partity error has occurred, or (b) crashing the
>program or system.

 If I have misrepresented what you have said, I apologize. That was not my
intent. There have been several comments on this matter, and many people
(including, I thought, you, Mr. Phillips) have said that modern memory is
so reliable that parity is not required. Your comment stuck in my mind. In
fact, I had intended to respond to your last post, in which you repeated
this assertion. Unfortunately my Amiga has not been well for some time, she
crashed while I was writing a reply. A few days later when I got some free
time again, your article had expired here.

 I was going to write that you have not answered my post describing a
situation (badly designed expansion memory box) where parity might have
been useful, but instead kept repeating the gross overgeneralization
"parity is useless". Possibly I was not polite enough in my original post,
if not, please consider that statements like "parity is useless" invite
intemperate responses, especially from people like me that have been
repeatedly burned by the poor quality control, poor design, and insufficient
diagnostic capability of many different kinds of personal computers.


>>These are good points. I think it very likely that the memory system is not
>>the greatest cause of unreliability on the Amiga. Certainly not if you
>>include software bugs. This does not prove that parity checking is useless,
>>but that other measures are needed too. The order in which to take measures
>>to improve reliability is not determined exclusively by which is the worst
>>problem, it may be reasonable to start with a problem that is not the worst,
>>if a solution is easily implimented (memory parity checking, for example).
>
>In what way do you see parity checking as 'measures to improve reliability'?
>I think you are confusing reliability with some other parameter. Parity
>checking, if it only informs you of a parity error, does not change the
>reliability of a system at all. If it is used to halt a task or a system, it
>does, in fact, reduce reliability.

No, I AM NOT CONFUSED! I am irritated, frustrated, discouraged, etc. that
practically the whole personal computer industry does not seem to grasp
the usefulness of discovering problems, both design and process, as early
as is practical.

 I understand perfectly that adding parity reduces the MTTF of a product.
How much depends on a lot of things, including what you do with the parity
error information. If all a parity error does is light a LED on the front
of the box, the MTTF should not be reduced much. I am not a fan of crashing
the machine on parity error, unless I can turn it off.

 I see parity as "measures to improve reliability" in just the same was as
a DVM, a scope, a final test procedure, or any number of other diagnostic
tools. Unlike many, it has the virtue of staying with the machine through
it's life, providing (for those few manufacturers who are interested)
feedback on how the design, parts, etc. REALLY perform in the field. It
improves reliability for any user has a memory problem that is not obviously
detectable in other ways, by allowing earlier detection and repair. It
improves confidence, which is really the same thing, for many people,
by reducing the probability of undetected corruption of data.

>
>You might want to ask yourself what the benefits of parity checking are, vs.
>the cost of it.
>
>Benefits:
>
>	Information. You know you had a memory error, and have the option of
>rerunning anything that might possibly have been affected by it.
>
>  Information. You know that after running any particular program, if you were
>not informed of a parity error, that any errors you may have, were caused by
>something else. Note that the lack of a parity error says nothing about the
>accuracy of your results, and that the presence of a parity error likewise says
>nothing about the accuracy of your results.
>

 Your 2nd statement is untrue. The presence of parity error detection in
memory will certainly increase the confidence in any data contained in that
memory. Not as much as ECC, of course, but significantly. Confidence is not
absolute, of course. Obviously the fact that I did not have a memory parity
error does not guarantee the data, there are many other places where it
might get garbled. It does increase confidence, though, by reducing the
probability of an undetected memory error. And that most definitely does
say something about the accuracy of my results.

 You have not included confidence in the hardware (in this case the memory),
which is my whole point. What you need to understand is that I don't care
about maximum confidence in the data. If I wanted more confidence in the
data I would be looking for bugs in the software first. I know when the
data cannot be trusted - it cannot be trusted when my machine is crashing
every few days. Even if the crash itself did not damage data directly, the
disorganisation brought on by having to recover from crashes constantly
would. I can deal with a little randomization of my data, if I couldn't, I
certainly would not be using an Amiga. I bet most other Amiga users can,
too. What I and a lot of other actual and potential Amiga users cannot deal
with easily is a flaky machine which cannot be easily fixed. What I propose
is that we all forget about trying to make each machine perfect, we are
obviously not close to that, and concentrate on attaining a resonable level
of reliability. I propose the following test: Every Amiga should be able
to run at least one month under normal usage without crashing. If it can't,
the cause of the problem (hardware or software) must be findable and
correctable by a reasonably competent diagnostician within 1 week. My
estimate of the hardware/software division to most efficiently aproach
the goal is 10/90. For hardware, I would start with memory parity checking,
because it is obvious, easy, and quick. For software, I would start testing
programs, to accumulate a database of interaction problems.

>Costs:
>  Parts.
>
>  Wasted time/resources. If a parity error occurred in a non-important part of
>memory (including the parity bit memory itself), you have no way of knowing
>that you didn't need to rerun a program. The mere presence of a parity error
>indication tells you nothing but that there was a parity error, but encurages
>users to rerun things, and lulls them when the little light doesn't come on.

I really doubt that most users think like this. I think most users are going
to keep running in spite of the error indication, unless the computer starts
crashing. When the machine has crashed for the 5th time in one day, and they
are really starting to get frustrated, hopefully they are going to start
thinking about what that little blinking red "error" light means.

Remembering that most users and many computer dealers have only a vague idea
of how to troubleshoot, consider the difference between Joe user calling the
computer store saying "my computer is crashing" and "what does it mean when
the PERR light keeps blinking?". The latter case is an obvious trip to the
shop, the former can be a months long odyssey in software swapping. I can
tell you from personal experience that such an odyssey can be extremely
irritating, time consuming, and generally likely to cause people to make
intemperate overgeneralizations about "flakiness".

>> I think the cost of ECC cannot be justified on the Amiga, unless for special
>>applications. The added cost of simple parity checking (not very great) might
>>easily by justified because it would help by allowing the early detection
>>and repair of machines with memory problems. It would be especially useful
>>for machines with flaky, intermittent memory.
>
>The most useful thing for machines with flaky, intermittent memory is a trip to
>the repair shop. Flaky, intermittent memory will show up in other ways, without
>having to add more flaky, intermittent memory.

Possibly you missed my earlier article, where I described the many weeks it
took to convince Pacific Cypress that they did in fact have a hardware
problem. They built the box, they tested the box, yet they assumed (no,
insisted) that the problem was software. To me, it was pretty obvious after
playing with my machine for a while that the problem was hardware, to
them, it was not. I suppose you could say that they should have known, but
we should not be designing machines to work with people as they should be,
but as they are. Consider also that I had a serial number around 50, so
there were quite a few other people out there having similar problems, yet
"no one else has this problem".

I certainly do not intend to claim that parity is a panacea, what I do
claim is that there is an obvious reliablity problem in this whole PC
industry, and that the Amiga is no better than average.

Because of the obvious problems, I, for one, will not be convinced to abandon
my advocacy of measures to improve the reliability of the Amiga, in
particular by adding parity error detection, by the fact that it parity
cannot guarantee my data, or by statements like "parity is useless".


>|   //   Larry Phillips                                                 |
>| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
>|        COMPUSERVE: 76703,4322  -or-  76703.4322@compuserve.com        |


--
Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508

	He who walks with wise men becomes wise,
          but the companion of fools will suffer harm.  -Proverbs 13:20