lphillips@lpami.wimsey.bc.ca (Larry Phillips) (05/27/90)
In <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: > >The Amiga 3000 is capable of holding (at least supporting) much more memory >than a Cray 1, and the size of gates in modern memory is much smaller and >thus they are more susceptable to alpha radiation induced parity errors than >were the gates of the Cray memory. > >To take an Amiga seriously as a commercial machine in the "workstation, >large memory" market, I'd guess error correcting code will turn out to be >vital. It would be a shame to have a big production run of the hardware >installed and on the street, only to have parity problems give the machine >a reputation as an unreliable machine, to be avoided in droves. Better if >the problem is solved before the reputation is besmirched. Right.. ECC is not parity, and vice versa. Parity checking is totally, completely, and utterly useless. >But then, what do I know after 29 years in the field about what people who >buy the machines look for in a large processor? My computer purchases were >limited to a couple of million bucks worth, down in the noise level in the >marketplace. ;-) Geez... out-yeared me by 3. :-) -larry -- The raytracer of justice recurses slowly, but it renders exceedingly fine. +-----------------------------------------------------------------------+ | // Larry Phillips | | \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | | COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | +-----------------------------------------------------------------------+
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/27/90)
In article <756@bilver.UUCP> alex@bilver.UUCP (Alex Matulich) writes: > >However, the fast RAM chips are replaceable by the 4 meg variety. You can >stick 16 of them in the A3000. That's a potential for 16 parity errors >per day! > >I think I would worry about that. I wouldn't want a scientific experiment >or financial program go off using erroneous data without knowing it. >Three extra parity bits per byte would allow detection of up to 2 bits and >correction of one (per byte). This is very complicated to design, however. >Single-bit parity error detection (like one has on IBM compatibles) is >relatively easy. Computer folk wisdom has it (actually, I had this from someone at the NCAR site where the beast was then installed) that Seymore Cray built Cray 1 number 1 without parity checking. The error rate in that much memory was insufferable, so the machine became a sort of demo machine; when you ordered a Cray 1, you first got serial number 1 installed, on which you could build and test your code in a lossy environment, until a machine with parity could be built for you and swapped for the useless toy first delivered. The Amiga 3000 is capable of holding (at least supporting) much more memory than a Cray 1, and the size of gates in modern memory is much smaller and thus they are more susceptable to alpha radiation induced parity errors than were the gates of the Cray memory. To take an Amiga seriously as a commercial machine in the "workstation, large memory" market, I'd guess error correcting code will turn out to be vital. It would be a shame to have a big production run of the hardware installed and on the street, only to have parity problems give the machine a reputation as an unreliable machine, to be avoided in droves. Better if the problem is solved before the reputation is besmirched. But then, what do I know after 29 years in the field about what people who buy the machines look for in a large processor? My computer purchases were limited to a couple of million bucks worth, down in the noise level in the marketplace. ;-) Kent, the man from xanth, now zooming to the net from Zorch. (xanthian@zorch.sf-bay.org)
lphillips@lpami.wimsey.bc.ca (Larry Phillips) (05/29/90)
In <dillon.3992@overload.UUCP>, dillon@overload.UUCP (Matthew Dillon) writes: >>In article <3620@tymix.UUCP> pnelson@hobbes.uucp (Phil Nelson) writes: >>In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes: >> >>>Right.. ECC is not parity, and vice versa. Parity checking is totally, >>>completely, and utterly useless. >> >>Oh really? Please explain why parity checking would not have saved me much >> >> The advantage of parity checking is diagnostic, intermittent problems on > > > I have a tendancy to agree. ECC is cute but expensive. It takes 7 > bits of ECC to detect and correct 1 bit in a 32 bit wide word. The way > you think about 1-bit-ECC is that you need enough codes to generate the > address of the incorrect bit, plus a no-error code, plus a parity bit. > Unfortunately, that no-error code takes us from 5 to 6 bits, then one > more to parity-check the ECC code itself. A 1-bit ECC can correct 1 bit > errors and detect 2-bit errors. In the context of the posting I replied to, that of a life supporting or Very Important Application implementation, parity is indeed useless, and ECC can be seen as coming as close to mandatory as anything can be. An exception to this might be if it is implemented as multiple 'majority rules', identical computers. The fact that a properly designed ECC scheme can correct errors in the ECC bits themselves makes it far more desirable for reliability and recoverability, though at a greater cost. Parity schemes, on the other hand, cannot detect the failure of a parity bit itself, and thus reduces the overall reliability as a tradeoff for knowing when you had an error, even if that error is meaningless and would not have happened without the parity bit being present. Statistically speaking, if parity is checked on a byte basis, 1/9 of all single bit errors could be safely ignored, and that takes into account ONLY the parity rams themselves, without taking into account the current contents of the memory itself, the importance of the application taking the hit, etc. > A simple 1-bit parity check is sufficient to detect the problem that > ECC would have corrected, and allow the processor to map the page out > with its MMU. In anycase, this kind of failure occurs less often than > you think. What most people come up against is a BAD DRam (i.e. cause > of problem is not alpha radiation), in which case it is not reliable > anyway and you simply have to replace the chip. Assuming that an MMU is in place, and assuming that the error was a random event caused by external forces (cosmic rays, whatever), the page may or may not require mapping out, though with parity checking only, you really don't have a lot of choice. With an ECC scheme, the system can make note of the error and keep using the memory, allowing it to map the page out when the number of errors exceeds a threshhold over a predefined period of time. It will also allow reporting of single bit errors to the operator, who can make a good judgement as to the root cause, and take action as appropriate. Parity is a heavy-handed beast, telling you little, and treating all memory as equal. Should video memory be parity checked, assuming that you can readily identify where the video is being displayed from? If so, should you crash and burn because a picture has a pixel showing a bad colour? If not, can you trust the figures your spreadsheet shows? If you don't crash and burn, should you ignore the red light? Should you panic? Either way, you could be wrong. Hardware is getting cheaper all the time. ECC is a little more expensive than parity. In some ways, it can be said to be cheaper, if you count the lost productivity when an error occurs that cannot be corrected, and would not matter to the application. In Very Important Applications, I would go for ECC. In other situations, I would go for no checking at all. Parity is useless. > DRAMs these days are much more reliable than 10 years ago... even 5 years > ago. You rest my case. :-) -larry -- The raytracer of justice recurses slowly, but it renders exceedingly fine. +-----------------------------------------------------------------------+ | // Larry Phillips | | \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | | COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | +-----------------------------------------------------------------------+
pnelson@hobbes.uucp (Phil Nelson) (05/29/90)
In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes: >Right.. ECC is not parity, and vice versa. Parity checking is totally, >completely, and utterly useless. Oh really? Please explain why parity checking would not have saved me much time and trouble in 1986 when I bought a flaky Pacific Cypress RAM expansion and then spent the next 2 months covincing them that the problem was not in the then buggy Amiga software. It turned out that they needed larger bypass caps in their memory array, since the problem was intermittent, it did not show in their memory tests. If that box had parity, I and everyone else who bought that box before they were finally convinced that they had a problem and fixed it could have saved a whole lot of wasted time. The advantage of parity checking is diagnostic, intermittent problems on complex systems can be very difficult to diagnose, particularly by end users like me, who, even if they have a certain amount of expertise, have not the time and equipment isolate the tough ones. >| \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508 The words of the wicked lie in wait for blood, but the mouth of the upright delivers men. -Proverbs 12:6
sysop@tlvx.UUCP (SysOp) (05/29/90)
In article <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: > In article <756@bilver.UUCP> alex@bilver.UUCP (Alex Matulich) writes: > > > >However, the fast RAM chips are replaceable by the 4 meg variety. You can > >stick 16 of them in the A3000. That's a potential for 16 parity errors > >per day! ... [story about Cray 1 deleted] > The Amiga 3000 is capable of holding (at least supporting) much more memory > than a Cray 1, and the size of gates in modern memory is much smaller and > thus they are more susceptable to alpha radiation induced parity errors than > were the gates of the Cray memory. Even "more susceptable"? Ok, why? How? If it were really that bad, then people using the A3000 now should be at least occasionally noticing weird things happening, right? But are they? What about my A1000 with 2.5 megs? (Of course, I don't have parity, so how can I tell? Sigh.) > > To take an Amiga seriously as a commercial machine in the "workstation, > large memory" market, I'd guess error correcting code will turn out to be .... > But then, what do I know after 29 years in the field about what people who > buy the machines look for in a large processor? My computer purchases were > limited to a couple of million bucks worth, down in the noise level in the > marketplace. ;-) It's not that I don't believe you, but... I would like some real concrete information. Explain to me this: I've used a 20 MHz AST 386 for at least 1.5 years at work, the last year or so with a total of 3 megs. While I've had problems (I AM developing software :-), no parity errors have ever appeared. Is it possible that the AST has no parity checking, or is it possible that parity errors are much more rare than some people think? Sure, I see nothing wrong with making a system more "reliable", but if it's not really doing any good, then it's a waste of time and money. If parity is truely necessary, perhaps concrete proof is going to be needed to convince others that it's necessary. The Commodore engineers read these newsgroups, and I'm sure they've thought about this. Since there doesn't seem to be a large cry for it, and Commodore doesn't think it's necessary, just one or 2 people saying, "You need Parity or else you're not a Real Machine (TM)," isn't going to change anything. This isn't a flame, it's just that there were already a lot of messages on this subject, and as any subject, past a certain point, you need to go beyond opinion and start with the hard cold facts. Since I don't know the rate of errors, I could learn something myself. (Earlier messages didn't convince me either way; my mind is still open to debate. Convince me!!! :-) If it's only the denser chips that have the errors, then the question is, will such memory improve with technology such that this won't be a concern, like the less-dense chips (after some period of time)? > > Kent, the man from xanth, now zooming to the net from Zorch. > (xanthian@zorch.sf-bay.org) -- Gary Wolfe uflorida!unf7!tlvx!sysop, unf7!tlvx!sysop@bikini.cis.ufl.edu
jesup@cbmvax.commodore.com (Randell Jesup) (05/29/90)
>In <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: >>The Amiga 3000 is capable of holding (at least supporting) much more memory >>than a Cray 1, and the size of gates in modern memory is much smaller and >>thus they are more susceptable to alpha radiation induced parity errors than >>were the gates of the Cray memory. I think you're making a pretty tenuous comparison here. I could be wrong, but I don't see suns or the like using ECC (I'm not even certain all of them are using parity). I don't see that the marketplace has shown it to be important for desktop machines in the sun/whatever class, let alone the low-end sun/high-end amiga level. >>To take an Amiga seriously as a commercial machine in the "workstation, >>large memory" market, I'd guess error correcting code will turn out to be >>vital. It would be a shame to have a big production run of the hardware >>installed and on the street, only to have parity problems give the machine >>a reputation as an unreliable machine, to be avoided in droves. Better if >>the problem is solved before the reputation is besmirched. I think ECC is one of our less important problems at the moment. If people care, they can drop in an ECC memory card (cpu slot for max speed or Z-III) and put all their fast ram there. An opportunity for 3rd- party hardware companies - or it would be if anyone cared about ECC, which they (for the most part) don't. >Right.. ECC is not parity, and vice versa. Parity checking is totally, >completely, and utterly useless. Yup (or very close). >>But then, what do I know after 29 years in the field about what people who >>buy the machines look for in a large processor? My computer purchases were >>limited to a couple of million bucks worth, down in the noise level in the >>marketplace. ;-) Amiga's are "large processors", then? ;-) BTW, Commodore just sold about $10M of Amigas to the government (as reported in WSJ, I think). We (as part of a Sears business center deal) won a subcontract for supplying multitasking computers to the government. Apparently this is suprising, since we've only been trying to sell to the government for 6 months, and many firms don't make sales for 18 months, due to long product cycles. (Taken from the WSJ article, not anything internal.) -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
dillon@overload.UUCP (Matthew Dillon) (05/29/90)
>In article <3620@tymix.UUCP> pnelson@hobbes.uucp (Phil Nelson) writes: >In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes: > >>Right.. ECC is not parity, and vice versa. Parity checking is totally, >>completely, and utterly useless. > >Oh really? Please explain why parity checking would not have saved me much > > The advantage of parity checking is diagnostic, intermittent problems on I have a tendancy to agree. ECC is cute but expensive. It takes 7 bits of ECC to detect and correct 1 bit in a 32 bit wide word. The way you think about 1-bit-ECC is that you need enough codes to generate the address of the incorrect bit, plus a no-error code, plus a parity bit. Unfortunately, that no-error code takes us from 5 to 6 bits, then one more to parity-check the ECC code itself. A 1-bit ECC can correct 1 bit errors and detect 2-bit errors. 2+ bit correct is MUCH more difficult (think of the # of codes required... at least double the number of bits as for 1 bit ECC but the analogy I used above no longer holds, so it's even more!). When you get into >1 bit ECC you generally switch to burst-error correction (which requires fewer correct-codes and thus fewer bits of ECC). Unfortunately, burst error correction is useless when the medium is memory. A simple 1-bit parity check is sufficient to detect the problem that ECC would have corrected, and allow the processor to map the page out with its MMU. In anycase, this kind of failure occurs less often than you think. What most people come up against is a BAD DRam (i.e. cause of problem is not alpha radiation), in which case it is not reliable anyway and you simply have to replace the chip. DRAMs these days are much more reliable than 10 years ago... even 5 years ago. -- Matthew Dillon uunet.uu.net!overload!dillon 891 Regal Rd. Berkeley, Ca. 94708 USA
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/30/90)
In article <1641@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes: >In <1990May27.101258.24470@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: >> [you already saw my part twice] > >Right.. ECC is not parity, and vice versa. Sort of. Error Correcting Circuitry always (?) contains Error Detecting Circuitry, which is an extension of the parity error detection concept to more complex errors. >Parity checking is totally, completely, and utterly useless. I disagree! That depends on what you're trying to accomplish. Even parity checking that does nothing more than crash the machine with a "parity fault at 0Xnnnnnnnn", which at least tells you that you there is a problem, beats all hollow having a bit flipped in a critical datum, and receiving no warning at all, potentially until after you have made a costly (and wrong) decision based on the erroneous result. Since parity checking is so well known a part of the state of the hardware engineering art, it is not clear to me that a company could escape unscathed from a lawsuit for consequential damages if it were omitted from a computer design of a machine offered for commercial use in this day and age. Most of the losers in those suits were the folks who thought it was safe to ignore Best Engineering Practice. >>But then, what do I know after 29 years in the field [...] > >Geez... out-yeared me by 3. :-) I started at 17, young in those days for a beginning programmer, over the hill today. ;-) Kent, the man from xanth. (xanthian@zorch.sf-bay.org)
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/30/90)
In article <321@tlvx.UUCP> sysop@tlvx.UUCP (SysOp) writes: >kent> [...] the size of gates in modern memory is much smaller and >kent> thus they are more susceptable to alpha radiation induced parity >kent> errors than were the gates of the Cray memory. > >Even "more susceptable"? Ok, why? How? If it were really that bad, then >people using the A3000 now should be at least occasionally noticing weird >things happening, right? But are they? What about my A1000 with 2.5 megs? >(Of course, I don't have parity, so how can I tell? Sigh.) Sorry; its amazing how susceptible I am to thinking that just because I have known something for a decade or more, that it therefore is common knowledge. Herewith the (stupidly) omitted explanation: Alpha radiation (a fast moving, stripped helium nucleus) originates within the naturally occuring radioactive impurities of the memory chip itself. By their nature (big, bumbling and slow compared to other kinds of radioactivity) alpha particles have very limited penetrating power; they do all their mischief near their point of origin. For our purposes, the thing of interest about their action is that, being an extremely positively charged particle among atoms essentially in neutral balance, they have a large effect on the outer shell (conduction) electrons, pulling large numbers after them in their wake. They cause a parity error when these entrained electrons are deposited in a spot that causes a gate to shift its state from 0 to 1 or vice versa, for instance on one of the control lines. The older memory chips had very little susceptability to alpha radiation induced parity errors. Although the alpha radiation exists constantly at a low level in every chip, "large number of electrons" above must be considered relative to the number of electrons required to switch a gate. The older memory chips, with larger "wiring" and component sizes, used larger switching currents; significantly larger than the amount of charge moved by one alpha particle. Since dynamic RAM means the memory is refreshed repeatedly by renewing the control charges holding the state (0 or 1) of each gate, there is not usually time for the charges carried by individual alpha particles to accumulate from several events to switch a gate, before the refresh cycle sets the charge back to its nominal value. In contrast, in denser, newer memory chips with smaller "wiring" and components, the charge delivered by a rogue alpha particle is of comparable size to the holding charge on a gate, and so the gate may be switched before a refresh cycle can correct the problem. Making the refresh cycles faster (than they are, not than the old circuits) is not an option, because most computer chips these days are heat limited, and more refreshes means more heat. So for an individual gate, denser memory means a larger chance of a bit being flipped _from_this_one_cause_. Still, chips are not highly radioactive, so for a single bit, this is a very low probability. The problem comes when you accumulate megabytes of these bits together; the chances of all of them avoiding errors tail off rapidly as their number increases, in math similar to that the birthday paradox employs. I'm a bit shakey on the numbers here, since I was last a hardware practicioner in 1972 and things have changed a trifle, but to the best of my understanding, with today's component sizes, speeds, and numbers of megabytes, you can expect to get in trouble somewhere between 1 and 100 megabytes. I defer to today's hardware practitioners for better data. As to why you don't see problems in your 3Meg AT, well, for one thing, as you mentioned, you don't have parity checking, so they could get by. Next, most of the software you run (or at least what I ran when using a 5 Meg '386 box) is unused by most applications, still stuck at the 640K limit. Again, at least in my Amiga, about 5 megabytes is loaded with software I may not use from boot to boot, but keep around because it is convenient. More, in running code lots of the code space is never touched (use a file zapper; lots of it is huge blocks of zeros). Again, stuff such as screen memory, if you get a bit flipped, you may never notice before you switch screens or windows in a screen and rewrite the soft parity error with good data. Similarly, would you really be likely to notice a one bit error in a sampled sound data block? Besides the above, your machine may sit idle 20 hours a day, not even powered up. In summary, there are lots of reasons why alpha induced parity errors would not be a big enough problem to become noticable. Yet. But like the birthday paradox, you don't have too far to go in terms of bigger applications exercising more of the machine, full time unattended operation (e.g. raytracing, doing accounts), more memory, more critical applications, and so on, before you run into Seymore Cray's problem. Parity checking is a necessity in large machines, just to be able to rely on the results the machine gives you. Error correcting circuitry is a necessity in large machines, to get the kind of uptime and through- put the machine's raw speed and memory size seem to promise. That's probably more than you wanted to know, and please excuse any details that might not be "just so". Since I stopped doing this stuff for a living, I'm a fairly casual student of the art. More is available in IEEE pubs, Scientific American, and so on. Kent, the man from xanth. (xanthian@zorch.sf-bay.org)
kevin@cbmvax.commodore.com (Kevin Klop) (05/30/90)
Please excuse this disagreement. I'm not a hardware designer, but am making what seem to me to be logical inferences and deductions. If I err, please correct me gently 8^). In article <1990May29.204550.27961@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: [ Explanation of how alpha particles affect chips omitted in the interests of brevity ] >The problem comes when you accumulate megabytes of these bits together; >the chances of all of them avoiding errors tail off rapidly as their >number increases, in math similar to that the birthday paradox employs. > >I'm a bit shakey on the numbers here, since I was last a hardware >practicioner in 1972 and things have changed a trifle, but to the >best of my understanding, with today's component sizes, speeds, and >numbers of megabytes, you can expect to get in trouble somewhere >between 1 and 100 megabytes. I defer to today's hardware practitioners >for better data. > >As to why you don't see problems in your 3Meg AT, well, for one thing, as >you mentioned, you don't have parity checking actually, AT's _do_ have parity checked ram. I used to run one system with 8 megs of RAM, all of it parity checked (that's why there was 9 chips per bank rather than 8) >, so they could get by. >Next, most of the software you run (or at least what I ran when using a >5 Meg '386 box) is unused by most applications, still stuck at the 640K >limit. True, but that's immaterial to parity checked memory such as what's in an AT - if a memory chip flips then the parity check on that row will reveal a problem, regardless of whether your current application is using that memory or not. And, once you DO start using that memory, any bit flips prior to your usage is immaterial as the first thing that a program should do is initialize memory that it is using, and thus won't know that a bit got flipped, assuming that there's no parity check to have discovered this first. [ stuff about unused memory and/or machines not showing errors ] >But like the birthday paradox, you don't have too far to go in terms of >bigger applications exercising more of the machine, full time unattended >operation (e.g. raytracing, doing accounts), more memory, more critical >applications, and so on, before you run into Seymore Cray's problem. >Parity checking is a necessity in large machines, just to be able to >rely on the results the machine gives you. Error correcting circuitry >is a necessity in large machines, to get the kind of uptime and through- >put the machine's raw speed and memory size seem to promise. I ran an 8 meg AT as a XENIX system that was using all of its memory constantly. In 4 years of operation, I never once got a memory parity error (Although a second AT with a lot less memory seemed to get them regularly - but once I got one, they would show up in droves until I replaced the memory card or chip that was causing me problems). Now, I admit that arguing from a statistical sampling of two machines can hardly be thought of as a valid sampled universe, however it does make me wonder whether the chances are all that great of such errors happening. Yes, ERCC circuitry would make things more reliable, but I wonder if all that many applications truly require this, and if it does, then whether there's a market for add-on memory that does its own ERCC. >Kent, the man from xanth. >(xanthian@zorch.sf-bay.org) -- Kevin -- Kevin Klop {uunet|rutgers|amiga}!cbmvax!kevin Commodore-Amiga, Inc. The number, 111-111-1111 has been changed. The new number is: 134-253-2452-243556-678893-3567875645434-4456789432576-385972 Disclaimer: _I_ don't know what I said, much less my employer.
jesup@cbmvax.commodore.com (Randell Jesup) (05/30/90)
In article <1990May29.204550.27961@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: >alpha particles have very limited penetrating power; they do all their >mischief near their point of origin. ... >The older memory chips had very little susceptability to alpha radiation >induced parity errors. ... >Since dynamic RAM means the memory is refreshed repeatedly by renewing the >control charges holding the state (0 or 1) of each gate, there is not usually >time for the charges carried by individual alpha particles to accumulate >from several events to switch a gate, before the refresh cycle sets the charge >back to its nominal value. All well and true, however advances have been made in reducing susceptability to alpha errors, I think perhaps enough to offset the reduction in charge storage. For example, plastic-packed parts have less alpha problems, as I remember, due to lower radioactivity rates. All 1Mb and higher ram I've seen is plastic, though there could well be some ceramic out there somewhere. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
LEEK@QUCDN.QueensU.CA (05/31/90)
From the articles in the ECC/Parity bit thread that I have been reading so far, it seems to me that memory error is far less likely to be the cause of system realibility concern... The realibility of a system is only as good as the weakest component. I have seen my machine crashing more often due to programming bugs and bad programming. How clean is the power source that we plug our trusty machine into ? Can we trust the power company to deliver 100% regulated power free of power surges and brown outs ? (Answer is NO !!) Can we trust the other appliances in the building for not producing power surges ? That's the reason why some companies make a fortune selling UPS and power conditioner. How much do we trust our CPU to not to fail ? Is there some hidden bugs in the CPU or perpherial chips that would fail under some conditions ? Intel got a few of these nasty bugs in their early batches of 386 CPUs chips etc.. (I am sure things like this would pop up once in a while.) Some companies insists on running parts outside their specified range. This might potentially cause problems when mixed with other out of spec designs. The list of things that can go wrong can go on forever. My point is that the memory system is one of the less probable cause of system failure. Given the cost of ECC, it might be more worth while to spend that money to prevent other more likely cause of failure... K. C. Lee
lphillips@lpami.wimsey.bc.ca (Larry Phillips) (06/01/90)
In <dillon.4072@overload.UUCP>, dillon@overload.UUCP (Matthew Dillon) writes: >>Parity schemes, on the other hand, cannot detect the failure of a parity bit >>itself, and thus reduces the overall reliability as a tradeoff for knowing when > > A parity scheme will detect all one bit errors, even if the bit that > error'd is the parity bit itself. The parity scheme does not know *which* > bit err'd, or whether it was the parity bit itself that err'd, but it > will detect any single bit error. > > A reasonable ECC scheme (7 bits to correct 32 bits as I mentioned in my > previous posting) will detect and correct all 1 bit errors where that 1 > bit is any one of the 32 bits. It will detect any single bit error in > the ECC code itself in which case the real data is assumed to be valid > and no other action is taken. I believe the scheme will also detect any > two bit errors (through all 39 bits). > > One should never think of an ECC scheme in terms of whether the erronous > bits are in the ECC part or the real-data part. Or, at least, I never > think of it that way. You tend to produce weak algorithms when you > consider cases that depend on the meaning of bits rather than work on > a general algorithm that can do a better job all around. Right. One does not need to think about which bit has failed in an ECC memory access, because the data is always assumed to be correct when it arrives at the destination. With ECC, all bits must be assumed to be equally important, since you are depending on all bits in order to make the above assumption. My comment had to do with parity schemes, where there is no choice. Since you cannot know which bit failed, you cannot assume the data is intact at the destination. From this, you can unequivocally state that the addition of 1/8 more bits to the memory has done one thing for you, and one thing to you. The thing it has done for you is to tell you that _some_ bit did not read out correctly. The thing it has done _to_ you is increased the chances of a bit being read out wrong. The point is not whether you should know if a prity bit fails, but that if it did happen to be a parity bit that failed, it was a 'useless' error; one that would not have happened if you did not have parity checking. Again, on average, 1/9 of the errors will fall into this category, though you will not know which ones they are. >>you had an error, even if that error is meaningless and would not have happened >>without the parity bit being present. Statistically speaking, if parity is > > Thinking of things that way will wind you into a corner fast! Not at all. It will only get you into trouble if you try to divine which bit failed, and act upon it. We don't get ourselves in trouble just because we posess the knowledge that approximately 1/9 of all errors are spurious. :-) >>have a lot of choice. With an ECC scheme, the system can make note of the error >>and keep using the memory, allowing it to map the page out when the number of >>errors exceeds a threshhold over a predefined period of time. It will also >>allow reporting of single bit errors to the operator, who can make a good >>judgement as to the root cause, and take action as appropriate. > > This is one good use of ECC.. .to detect failing memory. Some of the memories I have worked on have had literally thousands of memory chips (imagine 96 megs worth of 4KBit chips), and the ECC was invaluable for detecting a degenerating chip. We kept accurate records of all single bit errors, provided by the memory itself and stored for readout at PM time, and it was quite easy to discard the 'random event' failures, and to catch anything that was on its way to being a ssolid error. Replacing the chip before it got to be a solid problem meant a saving of many hours trying to track down double-bit errors, which were MUCH harder to isolate. -larry -- The raytracer of justice recurses slowly, but it renders exceedingly fine. +-----------------------------------------------------------------------+ | // Larry Phillips | | \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | | COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | +-----------------------------------------------------------------------+
dillon@overload.UUCP (Matthew Dillon) (06/01/90)
>The fact that a properly designed ECC scheme can correct errors in the ECC bits >themselves makes it far more desirable for reliability and recoverability, >though at a greater cost. > >Parity schemes, on the other hand, cannot detect the failure of a parity bit >itself, and thus reduces the overall reliability as a tradeoff for knowing when A parity scheme will detect all one bit errors, even if the bit that error'd is the parity bit itself. The parity scheme does not know *which* bit err'd, or whether it was the parity bit itself that err'd, but it will detect any single bit error. A reasonable ECC scheme (7 bits to correct 32 bits as I mentioned in my previous posting) will detect and correct all 1 bit errors where that 1 bit is any one of the 32 bits. It will detect any single bit error in the ECC code itself in which case the real data is assumed to be valid and no other action is taken. I believe the scheme will also detect any two bit errors (through all 39 bits). One should never think of an ECC scheme in terms of whether the erronous bits are in the ECC part or the real-data part. Or, at least, I never think of it that way. You tend to produce weak algorithms when you consider cases that depend on the meaning of bits rather than work on a general algorithm that can do a better job all around. An interesting extension to ECC for anybody interested is to consider the general-expansion case... to correct N bits of error in the data portion of the code (32 bits), and to detect and ignore one and two bit errors in the ECC itself. 32 bits + 7 bits ECC corrects any single bit error in the 32 bits (7 = lg(32+1) + 1) \__________________/ + 7 bits ECC corrects any single bit error in the 40 bits, which means this corrects any two-bit errors that occur in the first 39 bits, since it will correct one and the 7 bit ECC will correct the other. (7 = lg(39+1) + 1) And so on. The number of bits of ECC required for each level goes up according to the log of the number of bits requiring correction. To correct you start out the outmost level and move inward. Also, there is another term which I have not described which needs to be added to detect multi-bit errors in the outer ECC codes to keep the algorithm a general N bit detect and correct. IT can get messy. >you had an error, even if that error is meaningless and would not have happened >without the parity bit being present. Statistically speaking, if parity is Thinking of things that way will wind you into a corner fast! >have a lot of choice. With an ECC scheme, the system can make note of the error >and keep using the memory, allowing it to map the page out when the number of >errors exceeds a threshhold over a predefined period of time. It will also >allow reporting of single bit errors to the operator, who can make a good >judgement as to the root cause, and take action as appropriate. This is one good use of ECC.. .to detect failing memory. >In Very Important Applications, I would go for ECC. In other situations, I >would go for no checking at all. Parity is useless. If the machine must stay up for months at a time, ECC does get to be important. >| // Larry Phillips | >| \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | >| COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | >+-----------------------------------------------------------------------+ -Matt -- Matthew Dillon uunet.uu.net!overload!dillon 891 Regal Rd. Berkeley, Ca. 94708 USA
eachus@linus.mitre.org (Robert I. Eachus) (06/07/90)
In article <1655@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes: > In Very Important Applications, I would go for ECC. In other situations, I > would go for no checking at all. Parity is useless. >> DRAMs these days are much more reliable than 10 years ago... even 5 years >> ago. > You rest my case. :-) This is more in the nature of an agreement than a flame, but there are circumstatnces where ECC is LESS relaible than no checking currently...specifically when speed limits are being pushed. Modern DRAMs are fairly well protected against cosmic ray induced errors, and other transients, but if you use ECC circutry, the overall reliability of a memory system has to include the possibility that the ECC circutry returns the wrong value or (much more common) does not assert the correct value soon enough. On most ECC memory systems this is the most frequent cause of uncorrected error (even though it most frequently occurs when a bit has been flipped). Worse, if such an error occur when the memory is read correctly, it is not even detected by most tranisent fault counters. It takes a lot of extra logic to look for signal changes on the buss immediately after the correct value is expected to be asserted. I used to work at Stratus Computer, and (due to duplexed ECC memory boards) we could detect and count both types of faults. (It's only a failure if the program sees bad data...) That is the minimum I would recommend for life critical systems. We did see both kinds of faults, and I would imagine that without very careful design, today ECC memory does not significantly improve reliability. (Before the flames start...If a system has a transient memory failure every ten months with ECC and every three months without, that is not a significant difference. As a fault-tolerant designer, I would want to push it out to several years, preferably a century or two. You can't do that with simple ECC.) If you have a friend with an IBM compatible with parity ask him when he last had a parity error. My guess is that with 256K or 1 Meg parts, it should be significantly less than 1 per Megabyte per year. -- Robert I. Eachus Amiga 3000 - The hardware makes it great, the software makes it awesome, and the price will make it ubiquitous.
pnelson@hobbes.uucp (Phil Nelson) (06/08/90)
In article <90151.123059LEEK@QUCDN.BITNET> LEEK@QUCDN.QueensU.CA writes: |From the articles in the ECC/Parity bit thread that I have been reading so far, |it seems to me that memory error is far less likely to be the cause of system |realibility concern... The realibility of a system is only as good as the |weakest component. You may want to consider that what you have reading is the opinion of some people that memory chips are so reliable that "parity is useless". The facts (if we had any) may be otherwise. If the Amiga had parity, it would easy to get good data on the reliability of the memory IN THE BOX (not in some chip test lab) and IN THE FIELD (not some clean, quiet final test area). | I have seen my machine crashing more often due to programming bugs and bad |programming. How clean is the power source that we plug our trusty machine into |? Can we trust the power company to deliver 100% regulated power free of power |surges and brown outs ? (Answer is NO !!) Can we trust the other appliances |in the building for not producing power surges ? That's the reason why some |companies make a fortune selling UPS and power conditioner. These are good points. I think it very likely that the memory system is not the greatest cause of unreliability on the Amiga. Certainly not if you include software bugs. This does not prove that parity checking is useless, but that other measures are needed too. The order in which to take measures to improve reliability is not determined exclusively by which is the worst problem, it may be reasonable to start with a problem that is not the worst, if a solution is easily implimented (memory parity checking, for example). | How much do we trust our CPU to not to fail ? Is there some hidden bugs in |the CPU or perpherial chips that would fail under some conditions ? | |Intel got a few of these nasty bugs in their early batches of 386 CPUs |chips etc.. (I am sure things like this would pop up once in a while.) Some |companies insists on running parts outside their specified range. This might |potentially cause problems when mixed with other out of spec designs. | |The list of things that can go wrong can go on forever. My point is that the |memory system is one of the less probable cause of system failure. Given the |cost of ECC, it might be more worth while to spend that money to prevent other |more likely cause of failure... I think the cost of ECC cannot be justified on the Amiga, unless for special applications. The added cost of simple parity checking (not very great) might easily by justified because it would help by allowing the early detection and repair of machines with memory problems. It would be especially useful for machines with flaky, intermittent memory. |K. C. Lee Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508 If you thought prohibition was fun, you're gonna LOVE gun control.
charles@hpcvca.CV.HP.COM (Charles Brown) (06/08/90)
> This is more in the nature of an agreement than a flame, but there > are circumstatnces where ECC is LESS relaible than no checking > currently...specifically when speed limits are being pushed. > Modern DRAMs are fairly well protected against cosmic ray induced > errors, and other transients, but if you use ECC circutry, the overall > reliability of a memory system has to include the possibility that the > ECC circutry returns the wrong value or (much more common) does not > assert the correct value soon enough. ... If the ECC RAM returns the correct value but too late, it is not designed correctly. Part of the task of design is to make sure there is enough margin. So you have not demonstrated your point. What you have demonstrated is that: Poorly designed ECC RAM is sometimes less reliable than well designed RAM w/o ECC. So what. > If you have a friend with an IBM compatible with parity ask him > when he last had a parity error. My guess is that with 256K or 1 Meg > parts, it should be significantly less than 1 per Megabyte per year. > -- > Robert I. Eachus But I agree that a well designed RAM should have few errors. The HP9000/350 with 16MB RAM that I use at work has parity. (Later models come with ECC.) It seems to be averaging one parity error every six months. For my computer uses that is good. For life critical uses that would not be good enough. -- Charles Brown charles@cv.hp.com or charles%hpcvca@hplabs.hp.com or hplabs!hpcvca!charles or "Hey you!" Not representing my employer.
lphillips@lpami.wimsey.bc.ca (Larry Phillips) (06/08/90)
In <3649@tymix.UUCP>, pnelson@hobbes.uucp (Phil Nelson) writes: > > You may want to consider that what you have reading is the opinion of some >people that memory chips are so reliable that "parity is useless". The facts >(if we had any) may be otherwise. If the Amiga had parity, it would easy to >get good data on the reliability of the memory IN THE BOX (not in some chip >test lab) and IN THE FIELD (not some clean, quiet final test area). Since I am the one that used the words "parity is useless", I think I will say that you should refrain from placing words in my mouth that were never there. I did _not_ say make that statement because I think that memory is too reliable. I said it because I see no real use for adding extra memory, at extra cost, thereby statistically reducing reliability, for the sole purpose of either (a) informing the user that a partity error has occurred, or (b) crashing the program or system. >These are good points. I think it very likely that the memory system is not >the greatest cause of unreliability on the Amiga. Certainly not if you >include software bugs. This does not prove that parity checking is useless, >but that other measures are needed too. The order in which to take measures >to improve reliability is not determined exclusively by which is the worst >problem, it may be reasonable to start with a problem that is not the worst, >if a solution is easily implimented (memory parity checking, for example). In what way do you see parity checking as 'measures to improve reliability'? I think you are confusing reliability with some other parameter. Parity checking, if it only informs you of a parity error, does not change the reliability of a system at all. If it is used to halt a task or a system, it does, in fact, reduce reliability. You might want to ask yourself what the benefits of parity checking are, vs. the cost of it. Benefits: Information. You know you had a memory error, and have the option of rerunning anything that might possibly have been affected by it. Information. You know that after running any particular program, if you were not informed of a parity error, that any errors you may have, were caused by something else. Note that the lack of a parity error says nothing about the accuracy of your results, and that the presence of a parity error likewise says nothing about the accuracy of your results. Costs: Parts. Wasted time/resources. If a parity error occurred in a non-important part of memory (including the parity bit memory itself), you have no way of knowing that you didn't need to rerun a program. The mere presence of a parity error indication tells you nothing but that there was a parity error, but encurages users to rerun things, and lulls them when the little light doesn't come on. > I think the cost of ECC cannot be justified on the Amiga, unless for special >applications. The added cost of simple parity checking (not very great) might >easily by justified because it would help by allowing the early detection >and repair of machines with memory problems. It would be especially useful >for machines with flaky, intermittent memory. The most useful thing for machines with flaky, intermittent memory is a trip to the repair shop. Flaky, intermittent memory will show up in other ways, without having to add more flaky, intermittent memory. -larry -- The raytracer of justice recurses slowly, but it renders exceedingly fine. +-----------------------------------------------------------------------+ | // Larry Phillips | | \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | | COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | +-----------------------------------------------------------------------+
<LEEK@QUCDN.QueensU.CA> (06/08/90)
In article <3649@tymix.UUCP>, pnelson@hobbes.uucp (Phil Nelson) says: >(if we had any) may be otherwise. If the Amiga had parity, it would easy to >get good data on the reliability of the memory IN THE BOX (not in some chip >test lab) and IN THE FIELD (not some clean, quiet final test area). > some stuff deleted... > >These are good points. I think it very likely that the memory system is not >the greatest cause of unreliability on the Amiga. Certainly not if you >include software bugs. This does not prove that parity checking is useless, >but that other measures are needed too. The order in which to take measures >to improve reliability is not determined exclusively by which is the worst >problem, it may be reasonable to start with a problem that is not the worst, >if a solution is easily implimented (memory parity checking, for example). > > I think the cost of ECC cannot be justified on the Amiga, unless for special >applications. The added cost of simple parity checking (not very great) might >easily by justified because it would help by allowing the early detection >and repair of machines with memory problems. It would be especially useful >for machines with flaky, intermittent memory. > > >|K. C. Lee > > >Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508 > > If you thought prohibition was fun, you're gonna LOVE gun control. The problem with parity bit (vs ECC) is that it only do single bit error detection. If there are even # of bits having error, it doesn't know. The other thing have to do with design/economic/space. I can give you an example for myy particular set up. I have 4 meg of ram for my 18MHz 020. 32 chips of 256K*4 chips. For a parityy scheme, I would need 1 parity bit per byte. 680x0 is byte addressable so to do this properly is to have a parity for each 8 bit groups. Due to design and realibility problem, I would need a spearate chip for each 8 bit groups. Reasoning... Unless the 4 parity bits are squeezed into a 256k*4 chip and have a separate scheme for addressing that 4 parity bits and multiplex/demultiplex the parity bits to allow for byte/16-bit word/32-bit long word access. Sorry I can't do that without violating timing constraints unless faster chips are used instead of 60nS. Since the parity bits are accessed in different way - different gate delay which esult in slightly different timing. The hardware should also latch/synchronous the data bits before doing a checksum... All this mess to save space is not funny. To complicate things a bit more, I am running the 32 chips in a 4 bank page interleave mode - so i can't use 4 1meg*1 chip for the whole group either. That's 1 256K*1 chip per every 2 256K*4 chips. Excluding extra hardware ( and software exception handler..) That takes up 50% more space in my particular piece of memory board.. and all these trouble just to be able to warn me of a possible parity bit error ??? if I (or someone at C=) were to gone through the trouble to add parity, might as well go for ECC. A couple more bits per 32-bit word and you got Automatic Error detection and correction (and available with almost off the shelve parts eg. DRAM controller and ECC chip set from National Semiconductor) ECC/memory parity bit to boost realibility is like replacing all the wiring of a stereo (with cheap speakers) with solid silver strips - the sound certainly improves, but it might be more cost effective to replace the speakers. The speaker is a mechanical system and it generate more distortions than the electronic components. Most people would agree the above example. Now substitute the words below.. (stere= -> computer system, silver as wiring -> ECC, speakers -> programs, mechanical system -> software, distortion -> crashes) Sure if one have plenty of $$$ after upgrading the speakers to match the amp, one can spend $$$ on super conductors wire :) to hook up the speakers and may be a UPS for the stereo too. :) K. C. Lee ( I don't know much about stereo (only know the electronic side of it) , so don't flame me for using the above incorrectly as an example)
eachus@linus.mitre.org (Robert I. Eachus) (06/09/90)
In article <1410047@hpcvca.CV.HP.COM> charles@hpcvca.CV.HP.COM (Charles Brown) writes: > If the ECC RAM returns the correct value but too late, it is not > designed correctly. Part of the task of design is to make sure there > is enough margin. So you have not demonstrated your point. What you > have demonstrated is that: Poorly designed ECC RAM is sometimes less > reliable than well designed RAM w/o ECC. So what. Way off... With modern memory parts everything is quantitized and statistical since the charges being moved are on the order of 20,000 times the charge of an electron. In theory (but very, very rarely) sometimes you will get no electons willing to move and read a one as a zero. Much more likely is that the charge in a particular cell is represented by significantly fewer than the average number of electrons. This will result in a slower rise-time when the cell is read, so you must allow some slack in your design for this "jitter" in the rise time. How much? 6 Sigma? 10 Sigma? Whatever number you choose there is some statistical chance that and error will occur because the values were latched too early. EDAC is usually done without latching the values, since latching will usually add a clock cycle to the memory delay. If there is an complete error in a single bit in EDAC memory no problem. However, a "slow" bit can result in the output of an EDAC PAL being wrong at precisely the wrong time, even though it was correct a nanosecond earlier, and will be correct a nanosecond later. (The logic paths through a PAL are often different lengths, and other effects can also add jitter to the signal.) This was the effect I was referring to when I said that these late bit errors more often occur when a bit is being corrected. Since the EDAC circutry adds timing uncertainty to a memory system, and also slows it down, it is much more difficult to allow a 10 Sigma margin on EDAC circutry. (Six sigma gives you one error per billion, or about one every 3 minutes on a memory read at 5 MegaHertz. At nine sigma a fifty per cent increase in MARGIN which might be an added 3 ns. delay without EDAC, or an added 10 ns. with EDAC, an error will occur every 20,000 years.) To repeat myself, a good EDAC memory today is unlikely to be significantly better than a well designed memory with no checking. But now to change the subject: If you have all accounting data entered then verified by a separate operator. You can get the error rate down to 1 in 10,000 for keyboard input, so increasing the probability of error by one-millionth by using a machine without parity checking is in the noise, especially since, when accounting is done on PC's usually the operator checks his or her own input. So when Bob Silverman down the hall wants to factor a 100+ digit number using several decades of machine time, he cares about self-checking and memory error rates. Accountants don't. -- Robert I. Eachus with STANDARD_DISCLAIMER; use STANDARD_DISCLAIMER; function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...
LEEK@QUCDN.QueensU.CA (06/09/90)
Some asked me a question about why I could use ECC for the whole 32-bit word and not a parity bit per 32-bit word... I guess I should bring my data books to work next time so that when I goof off to do comp.sys.amiga, I would have all the reference I need. The following is my reply to the EMAIL ----------------------------------------------------------------------------- Subject: Parity bit business The 680x0 family can do either 8-bit, 16-bit (and the 020 and above can do 32 bit access) MSB LSB 33222222 22221111 111111 10987654 32109876 54321098 76543210 |<---->| |<---->| |<---->| |<---->| Byte access |<------------->| |<------------->| 16-bit words |<------------------------------->| 32-bit long words The thing is the CPU can address each of the 4 byte individually. One would ot be able to use a single parity bit for the whole group. The parity for the MSB access would not be the same as the one for the LSB. The only way to use a single parity bit is to some how access the whole 32-bit chunk at a time. This is messy. One would need a parity bit for each byte for performance reasons (see below) The PC/AT implement parity bit this way. This is how they do the ECC for 32-bit memory. (I though they have a easier way. :( ) The memory write cycle is changed into a Read-Modify-Write cycle. The read cycle let the ECC chip reads in the 39-bit word (32-bit for data, and 7 bit for ECC parity bits). The ECC now look at the byte/word to be written and compute the 7 bit parity for the whole 32-bit word. The data and ECC parity bits are passed to the memory in the write cycle. Hmmm.. This is not a pretty sight as a Read-Modify-Write cycle imposes quite a bit of speed penality. It is quite a bit longer than just a write cycle. The thing is for 32-bit ECC, there are special ECC chips that works with the DRAM controller and are designed to provide necessary timings and other stuff. There are no parity bit chips at VLSI level for parity bits. If one only want parity bit, one have to sit down and build from descrete components (MSI or PALs) and it is ususally much easier to have 1 parity bit per byte rather than 1 parity bit for the 32-bit word. Sorry I guess I should have made the article at home rather than at work. The fact still remains the same - it is more worth while to use ECC than parity bit(s) due to availability of components. There are VLSI designed for 32-bit buses vs TTL MSI/PAL decrete chips design for parity bit(s) only designs.(By the time one get down to using only 1 parity bit/32-bit word, the amount of MSI/PALs would be as complicated if the thing was designed with 1 parity/byte. The added penality for Write going to a R-M-W cycle throw away the saving in number of RAM chips.) Due to performance/designs, one usually have 1 parity bit per byte. (eg. PC/AT) I guess the memory controller market people do not believe in parity bit - either it is ECC or nothing at all. (same here for me) Hope this clear up the mess. ---------------------------------------------------------------------------- K. C. I guess I should drink coffee when I work. Hmmm that doesn't make sense as I would be awake during working hours... Zzzzz away.
charles@hpcvca.CV.HP.COM (Charles Brown) (06/12/90)
>> If the ECC RAM returns the correct value but too late, it is not >> designed correctly. Part of the task of design is to make sure there >> is enough margin. So you have not demonstrated your point. What you >> have demonstrated is that: Poorly designed ECC RAM is sometimes less >> reliable than well designed RAM w/o ECC. So what. > Way off... With modern memory parts everything is quantitized and > statistical since the charges being moved are on the order of 20,000 > times the charge of an electron. In theory (but very, very rarely) > sometimes you will get no electons willing to move and read a one as a > zero. ... [lots of irrelevant details deleted] > Robert I. Eachus If you put ECC into your RAM system, you must use faster RAM chips to achieve the same overall system RAM speed. This is to allow the ECC circuitry time to make an corrections. The effective sample time coming out of the RAM is different for the two systems in order to get the same effective system RAM access time. The reliability of the data coming out of the fast RAM chips at their spec access time will be approximately the same as the reliability of the data coming out of the slower RAM chips at their spec access time. Quantization is irrelevant (and probably not a factor anyway). But all of this arguing is silly. As others have pointed out the main reliability concern in the Amiga is the software. It really makes little sense to worry about rare problems such as bad RAM data when the software is typically so poor. I assume you agree with this statement. In any case I will discontinue bickering about ECC. It is just not very important. -- Charles Brown charles@cv.hp.com or charles%hpcvca@hplabs.hp.com or hplabs!hpcvca!charles or "Hey you!" Not representing my employer.
pnelson@hobbes.uucp (Phil Nelson) (06/13/90)
Messages from this account are the responsibility of the sender only, and do not represent the opinion or policy of BT Tymnet, except by coincidence, or when explicitly so stated. In article <1710@lpami.wimsey.bc.ca> lphillips@lpami.wimsey.bc.ca (Larry Phillips) writes: >In <3649@tymix.UUCP>, pnelson@hobbes.uucp (Phil Nelson) writes: >> >> You may want to consider that what you have reading is the opinion of some >>people that memory chips are so reliable that "parity is useless". The facts >>(if we had any) may be otherwise. If the Amiga had parity, it would easy to >>get good data on the reliability of the memory IN THE BOX (not in some chip >>test lab) and IN THE FIELD (not some clean, quiet final test area). > >Since I am the one that used the words "parity is useless", I think I will say >that you should refrain from placing words in my mouth that were never there. I >did _not_ say make that statement because I think that memory is too reliable. >I said it because I see no real use for adding extra memory, at extra cost, >thereby statistically reducing reliability, for the sole purpose of either (a) >informing the user that a partity error has occurred, or (b) crashing the >program or system. If I have misrepresented what you have said, I apologize. That was not my intent. There have been several comments on this matter, and many people (including, I thought, you, Mr. Phillips) have said that modern memory is so reliable that parity is not required. Your comment stuck in my mind. In fact, I had intended to respond to your last post, in which you repeated this assertion. Unfortunately my Amiga has not been well for some time, she crashed while I was writing a reply. A few days later when I got some free time again, your article had expired here. I was going to write that you have not answered my post describing a situation (badly designed expansion memory box) where parity might have been useful, but instead kept repeating the gross overgeneralization "parity is useless". Possibly I was not polite enough in my original post, if not, please consider that statements like "parity is useless" invite intemperate responses, especially from people like me that have been repeatedly burned by the poor quality control, poor design, and insufficient diagnostic capability of many different kinds of personal computers. >>These are good points. I think it very likely that the memory system is not >>the greatest cause of unreliability on the Amiga. Certainly not if you >>include software bugs. This does not prove that parity checking is useless, >>but that other measures are needed too. The order in which to take measures >>to improve reliability is not determined exclusively by which is the worst >>problem, it may be reasonable to start with a problem that is not the worst, >>if a solution is easily implimented (memory parity checking, for example). > >In what way do you see parity checking as 'measures to improve reliability'? >I think you are confusing reliability with some other parameter. Parity >checking, if it only informs you of a parity error, does not change the >reliability of a system at all. If it is used to halt a task or a system, it >does, in fact, reduce reliability. No, I AM NOT CONFUSED! I am irritated, frustrated, discouraged, etc. that practically the whole personal computer industry does not seem to grasp the usefulness of discovering problems, both design and process, as early as is practical. I understand perfectly that adding parity reduces the MTTF of a product. How much depends on a lot of things, including what you do with the parity error information. If all a parity error does is light a LED on the front of the box, the MTTF should not be reduced much. I am not a fan of crashing the machine on parity error, unless I can turn it off. I see parity as "measures to improve reliability" in just the same was as a DVM, a scope, a final test procedure, or any number of other diagnostic tools. Unlike many, it has the virtue of staying with the machine through it's life, providing (for those few manufacturers who are interested) feedback on how the design, parts, etc. REALLY perform in the field. It improves reliability for any user has a memory problem that is not obviously detectable in other ways, by allowing earlier detection and repair. It improves confidence, which is really the same thing, for many people, by reducing the probability of undetected corruption of data. > >You might want to ask yourself what the benefits of parity checking are, vs. >the cost of it. > >Benefits: > > Information. You know you had a memory error, and have the option of >rerunning anything that might possibly have been affected by it. > > Information. You know that after running any particular program, if you were >not informed of a parity error, that any errors you may have, were caused by >something else. Note that the lack of a parity error says nothing about the >accuracy of your results, and that the presence of a parity error likewise says >nothing about the accuracy of your results. > Your 2nd statement is untrue. The presence of parity error detection in memory will certainly increase the confidence in any data contained in that memory. Not as much as ECC, of course, but significantly. Confidence is not absolute, of course. Obviously the fact that I did not have a memory parity error does not guarantee the data, there are many other places where it might get garbled. It does increase confidence, though, by reducing the probability of an undetected memory error. And that most definitely does say something about the accuracy of my results. You have not included confidence in the hardware (in this case the memory), which is my whole point. What you need to understand is that I don't care about maximum confidence in the data. If I wanted more confidence in the data I would be looking for bugs in the software first. I know when the data cannot be trusted - it cannot be trusted when my machine is crashing every few days. Even if the crash itself did not damage data directly, the disorganisation brought on by having to recover from crashes constantly would. I can deal with a little randomization of my data, if I couldn't, I certainly would not be using an Amiga. I bet most other Amiga users can, too. What I and a lot of other actual and potential Amiga users cannot deal with easily is a flaky machine which cannot be easily fixed. What I propose is that we all forget about trying to make each machine perfect, we are obviously not close to that, and concentrate on attaining a resonable level of reliability. I propose the following test: Every Amiga should be able to run at least one month under normal usage without crashing. If it can't, the cause of the problem (hardware or software) must be findable and correctable by a reasonably competent diagnostician within 1 week. My estimate of the hardware/software division to most efficiently aproach the goal is 10/90. For hardware, I would start with memory parity checking, because it is obvious, easy, and quick. For software, I would start testing programs, to accumulate a database of interaction problems. >Costs: > Parts. > > Wasted time/resources. If a parity error occurred in a non-important part of >memory (including the parity bit memory itself), you have no way of knowing >that you didn't need to rerun a program. The mere presence of a parity error >indication tells you nothing but that there was a parity error, but encurages >users to rerun things, and lulls them when the little light doesn't come on. I really doubt that most users think like this. I think most users are going to keep running in spite of the error indication, unless the computer starts crashing. When the machine has crashed for the 5th time in one day, and they are really starting to get frustrated, hopefully they are going to start thinking about what that little blinking red "error" light means. Remembering that most users and many computer dealers have only a vague idea of how to troubleshoot, consider the difference between Joe user calling the computer store saying "my computer is crashing" and "what does it mean when the PERR light keeps blinking?". The latter case is an obvious trip to the shop, the former can be a months long odyssey in software swapping. I can tell you from personal experience that such an odyssey can be extremely irritating, time consuming, and generally likely to cause people to make intemperate overgeneralizations about "flakiness". >> I think the cost of ECC cannot be justified on the Amiga, unless for special >>applications. The added cost of simple parity checking (not very great) might >>easily by justified because it would help by allowing the early detection >>and repair of machines with memory problems. It would be especially useful >>for machines with flaky, intermittent memory. > >The most useful thing for machines with flaky, intermittent memory is a trip to >the repair shop. Flaky, intermittent memory will show up in other ways, without >having to add more flaky, intermittent memory. Possibly you missed my earlier article, where I described the many weeks it took to convince Pacific Cypress that they did in fact have a hardware problem. They built the box, they tested the box, yet they assumed (no, insisted) that the problem was software. To me, it was pretty obvious after playing with my machine for a while that the problem was hardware, to them, it was not. I suppose you could say that they should have known, but we should not be designing machines to work with people as they should be, but as they are. Consider also that I had a serial number around 50, so there were quite a few other people out there having similar problems, yet "no one else has this problem". I certainly do not intend to claim that parity is a panacea, what I do claim is that there is an obvious reliablity problem in this whole PC industry, and that the Amiga is no better than average. Because of the obvious problems, I, for one, will not be convinced to abandon my advocacy of measures to improve the reliability of the Amiga, in particular by adding parity error detection, by the fact that it parity cannot guarantee my data, or by statements like "parity is useless". >| // Larry Phillips | >| \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | >| COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | -- Phil Nelson . uunet!pyramid!oliveb!tymix!hobbes!pnelson . Voice:408-922-7508 He who walks with wise men becomes wise, but the companion of fools will suffer harm. -Proverbs 13:20