jackk@shasta.stanford.edu (Jack Kouloheris) (08/04/90)
I'm a bit puzzled by the lack of any type of memory error detection/ correction on many workstations and high-end PCs. These workstations are beginning to have memories that rival or exceed those of the previous generation of minicomputers, which almost always used some sort of ECC protection. Do manufacturers feel that it isn't needed any more ? A 1Mbit DRAM chip may have a typical soft error rate of .001-.005 PPM/KPOH/bit. Suppose we have a workstation with 16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This yields a memory system error rate of .671 errors/KPOH, a non-negligible number. Servers may have even more memory than this, and may be running continually, so some errors are bound to occur. What happens if a bit flips, and then the data is paged out or written to a file ? The error is now permanent and can propagate. Why does no one worry about this ? Some SUNs have parity checking on the memory system, but what does the OS do when a parity error occurs, since correction is not possible ? Jack
bobeson@saturn.ucsc.edu (Robert Ellefson) (08/04/90)
The IBM RS/6000 line has full memory ECC. They use 40 bits/word, which gives 7 bits for correction, and 1 unused bit. All busses have 8-bit parity checking. They also 'scrub' the memory, which involves periodically reading and correcting 1-bit errors before they become uncorrectable 2-bit errors. For a good reference on this, see the "RS/6000 Technology" Book. -Bob
darrylo@hpnmdla.HP.COM (Darryl Okahata) (08/05/90)
In comp.arch, jackk@shasta.stanford.edu (Jack Kouloheris) writes: > I'm a bit puzzled by the lack of any type of memory error detection/ > correction on many workstations and high-end PCs. These workstations > are beginning to have memories that rival or exceed those of > the previous generation of minicomputers, which almost always used > some sort of ECC protection. Do manufacturers feel that it isn't needed > any more ? Just as a data point, ECC memory is available as an option on Hewlett-Packard workstations. For those HP workstations with only parity-checked memory, the system administrator can choose one of three actions upon the occurrence of a parity error: 1. Print a "Parity error" message to the console. 2. Print a "Parity error" message to the console, plus: If user state, it kills the current process (which may not always be the process which caused the error, as with a DMA card) and prints an error message to the tty. If supervisor state, it panics with a "parity error" message to the console. 3. Always panics with a "parity error" message to the console. The last one (#3 above) is the default action (with the other actions, data corruption could occur depending on where the RAM parity error occurred). -- Darryl Okahata UUCP: {hplabs!, hpcea!, hpfcla!} hpnmd!darrylo Internet: darrylo%hpnmd@hp-sde.sde.hp.com DISCLAIMER: this message is the author's personal opinion and does not constitute the support, opinion or policy of Hewlett-Packard or of the little green men that have been following him all day.
henry@zoo.toronto.edu (Henry Spencer) (08/05/90)
In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes: >I'm a bit puzzled by the lack of any type of memory error detection/ >correction on many workstations and high-end PCs. These workstations >are beginning to have memories that rival or exceed those of >the previous generation of minicomputers, which almost always used >some sort of ECC protection... DRAM chips have improved a great deal in the last decade. Thank heavens. Speed pressures have also increased a lot, and ECC in particular tends to incur speed penalties. And it's a tempting thing to leave off when timing or board space gets tight. After all, the thing still works... >Some SUNs have parity checking on the memory system, but what does >the OS do when a parity error occurs, since correction is not >possible ? Depends on the situation. A parity error in a code page is harmless -- just bring in a fresh copy from disk. A parity error in data in an ordinary user program can be dealt with by killing that program. You get into difficulties only when the error hits the kernel or some vital system daemon. If errors are rare enough, parity is adequate. (Many people -- e.g. the imbeciles who have their kernels kill processes at random when swap space is short -- overlook the fact that some of the daemons are every bit as vital to proper operation as the kernel. Fortunately they're often not all that large, and are less likely to get hit by memory errors than elephantine user programs.) If you want something to be concerned about, consider that while most PCs have parity, almost all PC software ignores parity errors. -- The 486 is to a modern CPU as a Jules | Henry Spencer at U of Toronto Zoology Verne reprint is to a modern SF novel. | henry@zoo.toronto.edu utzoo!henry
davec@nucleus.amd.com (Dave Christie) (08/07/90)
In article <1990Aug4.231129.1358@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes: >>I'm a bit puzzled by the lack of any type of memory error detection/ >>correction on many workstations and high-end PCs. These workstations >>are beginning to have memories that rival or exceed those of >>the previous generation of minicomputers, which almost always used >>some sort of ECC protection... > [some valid points about current dram quality and the temptation to not bother with the extra hardware deleted] >>Some SUNs have parity checking on the memory system, but what does >>the OS do when a parity error occurs, since correction is not >>possible ? > >Depends on the situation. A parity error in a code page is harmless -- >just bring in a fresh copy from disk. A parity error in data in an >ordinary user program can be dealt with by killing that program. You Spoken like a true sysadmin :-). >get into difficulties only when the error hits the kernel or some vital >system daemon. If errors are rare enough, parity is adequate. "Rare enough" is pretty relative - one has to consider the run time of one's programs. (John McCalpin was recently talking of runtimes on the order of months!) And since most cycles are spent running user programs (hopefully!) I think they deserve a little more consideration. But the workstation market is pretty cutthroat and cost/performance is critical - fault tolerance hardware tends to push that ratio in the wrong direction so there's some initiative to leave it out. When comparing the current workstations with previous systems, one has to consider that those systems consisted of many more parts, with a lot more interconnections - a significant cause of failure (especially unsoldered ones); today's increased densities have improved this. And such systems were more often used in enterprise situations, such as maintaining critical company records, rather than for single users. Certain segments of the market certainly do require more fault tolerance than one finds in unix/workstation systems, and if such systems want to penetrate those segments, they are going to have to learn a few lessons from the mainframe hardware and software world. (Gee, I can almost hear some people who think unix on a workstation is the be-all and end-all in computers systems gagging.) And of course is doesn't come for free (I've heard that the fault tolerance aspects of the 3081/3090 was as big a project as the rest of the system!). The RS/6000 has been mentioned: ECC on memory, with an extra bit which is used as a last resort to replace a hard failure that can't be scrubbed. This is what one would expect from a company such as IBM - fault tolerance is a way of life for all mainframe/mini manufacturers. And I bet the associated software is the larger part of the work - I wouldn't be overly surprised if it wasn't all supported yet. But all in all, the overall error rate for workstations relative to what the runtime of most applications that people are running must be satisfactory; it doesn't seem to be a big issue. I know that's true in my environment (uP design) - a few problems now and then, but not enough to push me over the edge and demand better hardware. --------------------------------- Dave Christie My opinions only. All purpose comp.arch disclaimer: It depends.
cprice@mips.COM (Charlie Price) (08/09/90)
In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes: >I'm a bit puzzled by the lack of any type of memory error detection/ >correction on many workstations and high-end PCs. These workstations >are beginning to have memories that rival or exceed those of >the previous generation of minicomputers, which almost always used >some sort of ECC protection. Do manufacturers feel that it isn't needed >any more ? >A 1Mbit DRAM chip may have a typical soft error rate of >.001-.005 PPM/KPOH/bit. Suppose we have a workstation with >16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This >yields a memory system error rate of .671 errors/KPOH, a non-negligible >number. Servers may have even more memory than this, and may >be running continually, so some errors are bound to occur. What >happens if a bit flips, and then the data is paged out or written to >a file ? The error is now permanent and can propagate. >Why does no one worry about this ? > >Some SUNs have parity checking on the memory system, but what does >the OS do when a parity error occurs, since correction is not >possible ? The answer seems to be that the user community "votes" for particular performance/reliability/cost configurations with their money and that is what gets produced. Successful vendors of general-purpose systems build systems that have market-success-defined "acceptable" error rates that sell for an "acceptable" amount of money. MIPS, for example, produces both systems with parity and with ECC. The "little" machines, the tower-like servers and the workstations, use parity. The tower-like machines have custom memory cards and the workstations use SIMMs. The bigger machines, the M/2000, the RC3260, and RC6280 all use ECC with 1-bit correction, 2-bit detection on large (9U) custom boards. The caches for all these machines are parity-protected (and with a write-through cache, you just refetch from main memory when you see a cache parity error). Parity detects most memory errors, at a moderate cost of an extra bit every now and then (typically per byte, bit it could be per word) and a fairly simple parity tree to check/generate parity. ECC is quite a bit more expensive than parity. You need several extra bits per word which makes SIMMS less easy to use, and you need a more complicated device to generate and check ECC. With a fast memory system you probably have to use multiple ECC chips (or VERY fast ECC chips) since you use multiple memory banks to achieve high bandwidth memory. This all adds to manufacturing cost, design cost, testing cost, software cost... Most PCs (including the MACs I've seen) don't have or at least don't use parity. They silently accept occasional wrong computations rather than stop a computation that gets a transient memory error. Cost seems to be extremely important for PCs. For some uses, real workstations among them, the acceptable level of error seems to be "occasionally" having a computation explicitly fail (system panic or process killed) rather than silently producing an erroneous result. Cost in workstations seems to be important for success. Parity is OK for this environment then (at least by demonstration). A server, or a system that needs to support more reliable computation, may include ECC to overcome alpha hits. Real fault tolerance is yet another topic, and though there are companies that do well in the marker, most of us don't want to pay for it. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/09/90)
In article <40694@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: | >A 1Mbit DRAM chip may have a typical soft error rate of | >.001-.005 PPM/KPOH/bit. Suppose we have a workstation with | >16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This | >yields a memory system error rate of .671 errors/KPOH, a non-negligible | >number. Servers may have even more memory than this, and may | >be running continually, so some errors are bound to occur. What | >happens if a bit flips, and then the data is paged out or written to | >a file ? The error is now permanent and can propagate. | >Why does no one worry about this ? The answer is that at those error rates the chances of a two bit error (to avoid parity checking) are so low that it is not worth worrying about. Not that paranoids like myself don't validate their files with an external 32 bit CRC program on a regular basis. | Most PCs (including the MACs I've seen) don't have or at least | don't use parity. | They silently accept occasional wrong computations rather than | stop a computation that gets a transient memory error. | Cost seems to be extremely important for PCs. The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the clone machines built by other vendors. This provides adequate protection. The Mac and Amiga don't use parity (at least the older ones don't). The term PC includes both business PCs, with minicomputer features, and machines intended primarily for games and home use, which are built as cheaply as possible for a customer base which doesn't understand or care about data security, and which is highly price concious. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
henry@zoo.toronto.edu (Henry Spencer) (08/11/90)
In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >| Most PCs (including the MACs I've seen) don't have or at least >| don't use parity. > The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the >clone machines built by other vendors. This provides adequate >protection... The term PC includes both business PCs, with minicomputer >features, and machines intended primarily for games and home use, which >are built as cheaply as possible... But, but, but... virtually all MSDOS software *explicitly ignores* parity errors. A friend of mine, working for a clone builder, had an interesting story to tell. They were horrified to discover that their parity circuit didn't work... after a good many of the machines were in the field and functioning fine! It hadn't been caught in the factory because there is no way that software can test the IBMPC parity system, and it hadn't been caught by the customers because all the commercial software just ignored it. People who think their MSDOS "business PCs" are somehow "protected" against memory errors by the parity hardware are kidding themselves. -- It is not possible to both understand | Henry Spencer at U of Toronto Zoology and appreciate Intel CPUs. -D.Wolfskill| henry@zoo.toronto.edu utzoo!henry
dhinds@portia.Stanford.EDU (David Hinds) (08/11/90)
In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >> The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the >>clone machines built by other vendors. This provides adequate >>protection... The term PC includes both business PCs, with minicomputer >>features, and machines intended primarily for games and home use, which >>are built as cheaply as possible... > >But, but, but... virtually all MSDOS software *explicitly ignores* >parity errors. A friend of mine, working for a clone builder, had >an interesting story to tell. They were horrified to discover that >their parity circuit didn't work... after a good many of the machines >were in the field and functioning fine! It hadn't been caught in >the factory because there is no way that software can test the IBMPC >parity system, and it hadn't been caught by the customers because all >the commercial software just ignored it. > Wait - what do you mean, the parity circuit didn't work? That it couldn't detect parity errors, or what? On the IBM PC, and most clones, I think, a parity error raises a non-maskable interrupt. Under DOS, this is not a recoverable error - i.e., a parity error hangs the system. DOS just prints some dumb message, and stops dead in its tracks. I suppose commercial software could patch the interrupt vector to try to recover from the error, but no one bothers. As far as I know, yes, there isn't a way for software to tell if the parity system is working, but then wouldn't that be a bit much to expect on a PC? -David Hinds dhinds@popserver.stanford.edu
dricejb@drilex.UUCP (Craig Jackson drilex1) (08/12/90)
In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: |In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: |>Someone else wrote: |>| Most PCs (including the MACs I've seen) don't have or at least |>| don't use parity. |> The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the |>clone machines built by other vendors. This provides adequate |>protection... The term PC includes both business PCs, with minicomputer |>features, and machines intended primarily for games and home use, which |>are built as cheaply as possible... | |But, but, but... virtually all MSDOS software *explicitly ignores* |parity errors. A friend of mine, working for a clone builder, had |an interesting story to tell. They were horrified to discover that |their parity circuit didn't work... after a good many of the machines |were in the field and functioning fine! It hadn't been caught in |the factory because there is no way that software can test the IBMPC |parity system, and it hadn't been caught by the customers because all |the commercial software just ignored it. While this may be a good story, I've never truely heard of software routinely disabling the parity check, or the NMIs it reports. Although I have not been associated with any mainstream applications, I know that nothing my company has delivered disables NMIs. There's really no reason to-- the 16k, 64k, and 256k chips used in most PCs just don't have that many errors. Lots of people will report that they have seen a parity error message from a PC, but only rarely. Parity in "personal" computers was one of the innovations of IBM--their corporate standards required it. Up until the PC came out, hardly any of the computers sold as "personal" computers (Apples, CP/M boxes) had parity. I'm not sure if even the contemporary Unix boxes (Onyxs) did. The PCjr was the first computer IBM ever shipped without parity--I'm sure that the angst nearly killed somebody. |People who think their MSDOS "business PCs" are somehow "protected" |against memory errors by the parity hardware are kidding themselves. Admittedly, modern computer users (both businesspersons and engineers) rarely view their hardware with the skepticism that it deserves... They haven't lived through the era of "If you don't like the answers, run it again. They might change." (CDC 6400, circa 1976) |-- |It is not possible to both understand | Henry Spencer at U of Toronto Zoology |and appreciate Intel CPUs. -D.Wolfskill| henry@zoo.toronto.edu utzoo!henry On this, I agree with Henry. Anybody who claims to appreciate the 80x8x line of Intel CPUs needs education, medical attention, or both. -- Craig Jackson dricejb@drilex.dri.mgh.com {bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}
henry@zoo.toronto.edu (Henry Spencer) (08/12/90)
In article <1990Aug10.223619.6223@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes: >>...They were horrified to discover that >>their parity circuit didn't work... > Wait - what do you mean, the parity circuit didn't work? That it >couldn't detect parity errors, or what? He wasn't specific, but the implication was that under at least some circumstances it falsely reported errors. >On the IBM PC, and most clones, >I think, a parity error raises a non-maskable interrupt. Under DOS, this >is not a recoverable error - i.e., a parity error hangs the system. DOS >just prints some dumb message, and stops dead in its tracks. I suppose >commercial software could patch the interrupt vector to try to recover >from the error, but no one bothers. What he said was that, based on their experience, *everyone* bothers -- or did at the time (this wasn't recent) -- and the "recovery" consisted of ignoring the error completely. Since I avoid Intel processors :-), I can't confirm or deny this myself. >As far as I know, yes, there isn't a >way for software to tell if the parity system is working, but then >wouldn't that be a bit much to expect on a PC? Had the Japanese designed it, you can bet it would have been testable. (Another input to the parity encoder, controlled by software or even a DIP switch, would suffice.) The only way you can improve quality is if you can measure it. -- It is not possible to both understand | Henry Spencer at U of Toronto Zoology and appreciate Intel CPUs. -D.Wolfskill| henry@zoo.toronto.edu utzoo!henry
landon@Apple.COM (Landon Dyer) (08/12/90)
>Depends on the situation. A parity error in a code page is harmless -- >just bring in a fresh copy from disk. Assuming it's fresh. Nearly all of the I/O systems I've seen on small computers lack end-to-end parity or ECC. For instance, SCSI data and commands are subject to mangling by poor termination, bad connections, transients, and firmware or hardware failure (e.g. an insane controller that wiggles a bus line at random). This (and to be fair, other real-world catasrophes including scrambled file systems, flakey packet routers, media decay and buggy drivers) is what causes some application writers to put checksum fields in their document formats. Q: Let's get this straight. The data on _disk_ is checksummed within an inch of its life. The data in _memory_ is ECC'd and can't be harmed. But going from disk to memory, the data is, ah, er ... A: Let's see what the standard sez ... [flip, flip] ... "Naked in the breeze?" -- Landon Dyer (landon@apple.com) ::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::: making the merry-go-round SPIN FASTER Apple Computer, Inc. :: so that everyone has to HOLD ON TIGHTER NOT THE VIEWS OF APPLE COMPUTER :: just to keep from being THROWN TO THE WOLVES
peter@stca77.stc.oz (Peter Jeremy) (08/13/90)
In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the >clone machines built by other vendors. Although the hardware is there, it relies on software to do anything useful, and the software generally isn't there. I have also heard (rumour time) that at least one PC has a design fault in its parity circuitry. This doesn't appear to have hindered that machine at all. > The Mac and Amiga don't use parity (at least the older ones >don't). None of the A500, A1000, A2000 or A3000 have provision for parity for built-in memory. There is nothing to stop you using parity on additional RAM (you would also need to add software to handle the errors sanely). As an additional comment: I use a Motorola Delta 1147 clone. It's basically a single-board 68030 with 8MB RAM. The RAM includes parity, but the checking is switchable. Apparently the parity checking is slow, so you have a choice of stretching the memory cycles by 1 clock to get the error reported correctly, or having the parity error reported on the following cycle. (You can also disable it totally). I run it with delayed parity (which means any parity error causes a PANIC) and haven't had any parity errors in 18 months operation. -- Peter Jeremy (VK2PJ) peter@stca77.stc.oz.AU Alcatel STC Australia ...!uunet!stca77.stc.oz!peter 240 Wyndham St peter%stca77.stc.oz@uunet.UU.NET ALEXANDRIA NSW 2015
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/13/90)
In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: | But, but, but... virtually all MSDOS software *explicitly ignores* | parity errors. Please cite any software (at least from the top 20 best seller list) which does this. We have have 800 or so PCs here running MS-DOS, PC-DOS and at least four flavors of UNIX, and every one of them seems to see parity errors, although most just stop dead when they do. Given an error rate of 4-5 cases a year in that many systems, I think that's a better thing to do than produce wrong answers. Ever. I have had people tell me that the Mac had better hardware because "they don't get those stupid parity errors," but I don't even try to explain, I just give their names to headhunters. | People who think their MSDOS "business PCs" are somehow "protected" | against memory errors by the parity hardware are kidding themselves. | -- | It is not possible to both understand | Henry Spencer at U of Toronto Zoology | and appreciate Intel CPUs. -D.Wolfskill| henry@zoo.toronto.edu utzoo!henry Note from the sig that Henry makes no pretension of being unbiased in this. The PC uses an Intel processor, is the hardware can't be faulted for not having parity, the software design must be corrupted by being run on a CPU made by Intel. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "This is your PC. This is your PC on OS/2. Any questions?"
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/14/90)
In article <1016@stca77.stc.oz> peter@stca77.stc.oz (Peter Jeremy) writes: | I run it with delayed parity (which means any parity error causes a PANIC) | and haven't had any parity errors in 18 months operation. I concluded some time ago that with memory as reliable as it is, and the cost of an undetected parity error as high as it could be, that while a panic is not the *best* way to handle parity errors, it is more acceptable than ignoring them. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "This is your PC. This is your PC on OS/2. Any questions?"
henry@zoo.toronto.edu (Henry Spencer) (08/15/90)
In article <2421@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >| But, but, but... virtually all MSDOS software *explicitly ignores* >| parity errors. > > Please cite any software (at least from the top 20 best seller list) >which does this... As I thought I made clear, my information is secondhand. As I probably should have made clearer, it is also a bit old. If the situation has changed, I am (a) pleased, and (b) surprised. :-) -- It is not possible to both understand | Henry Spencer at U of Toronto Zoology and appreciate Intel CPUs. -D.Wolfskill| henry@zoo.toronto.edu utzoo!henry
eli@aspasia.gang.umass.edu (Eli Brandt) (08/15/90)
In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >>| Most PCs (including the MACs I've seen) don't have or at least >>| don't use parity. >> The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the >>clone machines built by other vendors. This provides adequate >>protection... The term PC includes both business PCs, with minicomputer >>features, and machines intended primarily for games and home use, which >>are built as cheaply as possible... > >But, but, but... virtually all MSDOS software *explicitly ignores* >parity errors. A friend of mine, working for a clone builder, had >an interesting story to tell. They were horrified to discover that >their parity circuit didn't work... after a good many of the machines >were in the field and functioning fine! It hadn't been caught in >the factory because there is no way that software can test the IBMPC >parity system, and it hadn't been caught by the customers because all >the commercial software just ignored it. > >People who think their MSDOS "business PCs" are somehow "protected" >against memory errors by the parity hardware are kidding themselves. >-- >It is not possible to both understand | Henry Spencer at U of Toronto Zoology >and appreciate Intel CPUs. -D.Wolfskill| henry@zoo.toronto.edu utzoo!henry Almost all PC hardware that I know of detects parity errors and handles them - well, "handles" by crashing with a "Parity error" message. Better than a corrupted filesystem. The one exception that I know of is that some laptops leave off parity checking to save *weight*, of all things. How much can 11% of your DRAM weigh? It's possible that some fly-by-night cloners leave off parity checking, but I've never heard of any machines that do this. I can personally testify that PS/2's, at least, know about parity errors. I was playing around with the DRAM refresh rate and managed to get parity errors quite definitively. A parity error triggers an NMI which calls, I think, an INT 1. Not at all sure about that. I don't know of any commercial software that turns off parity checking (by trapping the interrupt, presumably). Can you name any?
davecb@yunexus.YorkU.CA (David Collier-Brown) (08/15/90)
henry@zoo.toronto.edu (Henry Spencer) writes: >>| But, but, but... virtually all MSDOS software *explicitly ignores* >>| parity errors. >In article <2421@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >> Please cite any software (at least from the top 20 best seller list) >>which does this... How about Lotus 1-2-3? The **newest** version may or may not do so, if it is not required to run on 8088-series processors, but older versions certainly did. One of its competitors, that I worked on, had to block the parity error logic in order to run reliably. It would have been unacceptable to merely crash every time we had a parity error since our compeditors didn't! We also usually got them hourly on our XT's, (if memory serves) Thud and Nermal. Maybe on some other machines, too... I don't remember the ATs having the problem. Strangely enough, this blocking seemed to have no effect on the program: a recalculation after a parity error yeilded the same answers as before. Subsequently it was suggested that the PC and XT chip addressing logic was having adressing errors, which were reported erroniously as parity errors. So maybe we never has parity errors at all (:-)). --dave -- David Collier-Brown, | davecb@Nexus.YorkU.CA, ...!yunexus!davecb or 72 Abitibi Ave., | {toronto area...}lethe!dave Willowdale, Ontario, | "And the next 8 man-months came up like CANADA. 416-223-8968 | thunder across the bay" --david kipling
seanf@sco.COM (Sean Fagan) (08/19/90)
In article <2421@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > I have had people tell me that the Mac had better hardware because >"they don't get those stupid parity errors," but I don't even try to >explain, I just give their names to headhunters. Wonderful stuff, PC parity. The two times I've gotten parity errors (that didn't clear up on a reboot), the chip I've had to replace was the parity chip. Yep. I'm so glad it's there. -- Sean Eric Fagan | "let's face it, finding yourself dead is one seanf@sco.COM | of life's more difficult moments." uunet!sco!seanf | -- Mark Leeper, reviewing _Ghost_ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/20/90)
In article <1990Aug18.210132.25203@sco.COM> seanf@sco.COM (Sean Fagan) writes: | The two times I've gotten parity errors (that didn't clear up on a reboot), | the chip I've had to replace was the parity chip. Yes, and I got called for jury duty for the tenth time last month. Isn't it wonderful how unlikely bad things happen so much more often than unlikely good things? If I could get stuck in an elevator with a beautiful woman as often as I get behind someone in the "cash only" checkout who is try to pay with an out of state thrid party check, I'd be content. In spite of all that, I'd rather have parity checking, because I have had real genuine errors in the data memory, and I want to know about it when it happens. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "This is your PC. This is your PC on OS/2. Any questions?"
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (08/20/90)
In article <14623@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes: >In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >|But, but, but... virtually all MSDOS software *explicitly ignores* >|parity errors. A friend of mine, working for a clone builder, had >|an interesting story to tell. They were horrified to discover that >|their parity circuit didn't work... after a good many of the machines >|were in the field and functioning fine! ... >While this may be a good story, I've never truely heard of software routinely >disabling the parity check, or the NMIs it reports. Ignoring interrupts used to be the norm back when polled I/O was common. Many micros ran with interrupts disabled because they could interfere with the activities of some "dumb" floppy disk controllers, etc., which depended on timing of memory accesses (e.g., the old NorthStar floppy disk controller and their BCD floating point board). ... >Parity in "personal" computers was one of the innovations of IBM--their >corporate standards required it. Up until the PC came out, hardly any >of the computers sold as "personal" computers (Apples, CP/M boxes) had >parity. I'm not sure if even the contemporary Unix boxes (Onyxs) did. >The PCjr was the first computer IBM ever shipped without parity--I'm sure >that the angst nearly killed somebody. Not so. Lots of CP/M machines had memory boards with byte parity long before the IBM PC. Note that I'm not saying people used it -- in fact, I vaguely recall at least one board which had sockets for parity RAM, but standardly came with that portion of the board unpopulated. Of course, one could argue that before the IBM PC, "hardly any" microprocessor-based computers of any kind were sold. ;-) BTW, none of old machines I've played with has ever had a parity error (i.e., bad RAM chip), although I've seen a fair number in newer machines. Remember the days when companies used to actually test machines *BEFORE* shipping them...? ;-) -hankd@ecn.purdue.edu
md89mch@cc.brunel.ac.uk (Martin Howe) (08/20/90)
In article <14623@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes: >On this, I agree with Henry. Anybody who claims to appreciate the 80x8x >line of Intel CPUs needs education, medical attention, or both. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OK, I'll take them, as long as YOU're paying ! (Never seen my .sig before ? (the cat speech)) -- - /| . . JCXZ ! MOVSB ! SGDT ! iAPX ! | "Good morning Citizens. I would \`O.O' . Martin Howe, Microelectronics| remind you that Armed Robbery ={___}= System Design MSc, Brunel U. | is illegal in Megacity One." - JD ` U ' Any unattributed opinions are mine -- Brunel U. can't afford them.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/22/90)
In article <1990Aug20.151438.27121@ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes: | BTW, none of old machines I've played with has ever had a parity error | (i.e., bad RAM chip), although I've seen a fair number in newer | machines. Remember the days when companies used to actually test | machines *BEFORE* shipping them...? ;-) You are either really lucky or had top quality machines not available to us mortals. We used to run memory test as the idle daemon in S100 systems, and right after boot and fully warm with our Intellec systems. I wrote my own memory test for the Z80, to force the M1 fetch into every byte, so I could try the worst case timing. I haven't had a parity error in a memory chip with more than four hours burn-in on any of my systems in quite a while, but I still have a ziplock full of 1702's from the "old days." Note I don't call them good. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/24/90)
In article <1990Aug20.151438.27121@ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes: >Ignoring interrupts used to be the norm back when polled I/O was >common. Many micros ran with interrupts disabled because they could >interfere with the activities of some "dumb" floppy disk controllers, >etc., which depended on timing of memory accesses (e.g., the old >NorthStar floppy disk controller and their BCD floating point board). When a feature is unused, it often doesn't actually work. At one time, Unix was the toughest diagnostic for PDP-11 MMU's... When my company built one of the first IBM PC clones, we had mysterious software crashes. It turned out that no one else was sending non-maskable interrrupts (NMIs) to their 8088s. So, we got to be the people who noticed the NMI hardware bug. Recall that the 8088 has prefix instructions, which change the addressing of the following instruction. An NMI could be honored between the two, but the interrupt return would "forget" the prefixing. While we're on the subject, RISC machines with branch delay slots have a similar problem. Of course, the easy instruction decoding means that they can push some of the work into the interrupt handlers. Does anyone want to describe how their favorite machine did this? -- Don D.C.Lindsay
cprice@mips.COM (Charlie Price) (08/24/90)
In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >... So, we got >to be the people who noticed the NMI hardware bug. Recall that the >8088 has prefix instructions, which change the addressing of the >following instruction. An NMI could be honored between the two, but >the interrupt return would "forget" the prefixing. > >While we're on the subject, RISC machines with branch delay slots >have a similar problem. Of course, the easy instruction decoding >means that they can push some of the work into the interrupt >handlers. Does anyone want to describe how their favorite machine >did this? Here is an answer for the MIPS R2000, R3000, and R6000. An exception causes a trap to kernel code and loads a couple registers: EPC - Exception Program Counter - the address at which execution should resume. Cause - various bits of information about the cause of the exception. Normally, the EPC points at the instruction that caused the exception or, in the case of interrupts, that was about to be fetched. If an exception occurs during execution of the instruction in a branch delay slot or "between" a branch and the instruction in the branch-delay slot, the Cause register has the Branch Delay (BD) bit set and the EPC register contains the address of the branch instruction. For interrupts, you don't generally care about this and no special processing is required. Only if you have to examine the instruction that caused the fault do you have to decide whether to look at the instruction pointed at by the EPC or the next instruction. This isn't *quite* the whole story. TLB misses, for instance, are special and have other hardware support so you don't have to look at the instruction to figure out the address that missed in the TLB. If the kernel has to emulate the instruction in the branch delay slot (wierd FP stuff for instance) then it can't re-execute the branch instruction and will need to emulate it as well. This is a rare case, so the performance is not a problem. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
jkenton@pinocchio.encore.com (Jeff Kenton) (08/24/90)
From article <10307@pt.cs.cmu.edu>, by lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay): > > An NMI could be honored between the two, but > the interrupt return would "forget" the prefixing. > > While we're on the subject, RISC machines with branch delay slots > have a similar problem. Of course, the easy instruction decoding > means that they can push some of the work into the interrupt > handlers. Does anyone want to describe how their favorite machine > did this? On the 88000 (my current favorite machine) the instruction pipeline has three stages -- XIP, NIP, FIP. These tell you where you've been, where you are and where you're going. Restoring the proper values gets you back exactly where you are supposed to be. Not really trouble. The only problem to beware of is single stepping or other debugging, where you may be looking at a program interrupted at a delay slot. In this case "go (or step) from the pc" can be ambiguous. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - jeff kenton --- temporarily at jkenton@pinocchio.encore.com --- always at (617) 894-4508 --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/25/90)
In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: | Recall that the | 8088 has prefix instructions, which change the addressing of the | following instruction. An NMI could be honored between the two, but | the interrupt return would "forget" the prefixing. It could have been worse, it could have remembered it. Then when servicing the interrupt the normal fetch from the CS:PC would have fetched an instruction byte from somewhere else. I would think disallowing interrupts after prefix is the best way to solve it, rather than try to hack things to save the state after the chip was designed. Was this problem only with NMI? I've run a lot of interrupts into an original XT and never seen a problem. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
colin@array.UUCP (Colin Plumb) (08/25/90)
In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: > While we're on the subject, RISC machines with branch delay slots > have a similar problem. Of course, the easy instruction decoding > means that they can push some of the work into the interrupt > handlers. Does anyone want to describe how their favorite machine > did this? The Am29000 just has two program counters, current and next. (it provides a previous, as well, but doesn't need it). Most faults leave the processor ready to continue with the next instruction; if you want to retry instead of emulating (e.g. data TLB miss), you have to back up the PC's a cycle. Excpetions: instruction-fetch errors (TLB miss, protection violation, etc.), illegal opcode (as opposed to software traps) and protection violation (a supervisor-only instruction). These retry the current instruction. I don't think there's any deep reason, it was just easier to do that way because it's detected during decode. To skip one of these instructions, you just point the current PC at a NOP somewhere and leave the next PC alone. -- -Colin
tim@proton.amd.com (Tim Olson) (08/25/90)
| In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: | >... So, we got | >to be the people who noticed the NMI hardware bug. Recall that the | >8088 has prefix instructions, which change the addressing of the | >following instruction. An NMI could be honored between the two, but | >the interrupt return would "forget" the prefixing. | > | >While we're on the subject, RISC machines with branch delay slots | >have a similar problem. Of course, the easy instruction decoding | >means that they can push some of the work into the interrupt | >handlers. Does anyone want to describe how their favorite machine | >did this? The Am29000 simply uses 2 PC buffer registers to hold the return address(es). Normally these are sequential, but if an interrupt or trap occurs between a branch and its delay slot, then PC1 points to the delay instruction and PC0 points to the branch target. The interrupt-return (IRET) instruction uses both these addresses (if required) to restart the instruciton stream correctly. In article <41066@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: | Here is an answer for the MIPS R2000, R3000, and R6000. | | An exception causes a trap to kernel code and loads a couple registers: | EPC - Exception Program Counter - the address at which execution | should resume. | Cause - various bits of information about the cause of the exception. | Normally, the EPC points at the instruction that caused the exception | or, in the case of interrupts, that was about to be fetched. | If an exception occurs during execution of the instruction | in a branch delay slot or "between" a branch and the | instruction in the branch-delay slot, | the Cause register has the Branch Delay (BD) bit set and the | EPC register contains the address of the branch instruction. Just curious -- what happens in the perverse case that someone tries a conditional branch-and-link instruction using the link register as a conditional source, i.e.: <r31 contains -1> bltzal r31, label <- interrupt here <delay operation> Does the link portion of the branch-and-link take place anyway, destroying the conditional information and preventing the branch from being restartable correctly? -- Tim Olson Advanced Micro Devices (tim@amd.com)
rwallace@vax1.tcd.ie (08/26/90)
In article <2434@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > In spite of all that, I'd rather have parity checking, because I have > had real genuine errors in the data memory, and I want to know about it > when it happens. If the operating system just told you about it when there was a parity error I'd agree with you, something like flashing up a message on the screen: "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ". However automatically crasing the computer is NOT acceptable behaviour: I'd much rather do without the parity checking. Consider: suppose a parity error occurs on a 640K machine. The error is probably in either an unused area of memory (e.g. at the DOS prompt) or in a section of code that isn't going to be executed on this session with the program (e.g. error handling code). (I'm talking only about transient errors here: the boot time memory check will get other kinds). So ignoring the parity error will probably have no effect. If it's in a section of code that will be executed, the machine will just crash which is what would have happened anyway. OK take the very unlikely case that it is in your data. For me that means in the source for a program I'm writing. This is no problem, I can just fix the one trashed character when the compiler barfs on the code. Much better than having the machine crash and lose several minutes' work. Or say the error is in a floating-point number in a spreadsheet. Chances are the program will crash with a floating-point error or at least produce obviously wrong results e.g. profit for 1989 was $-32198742.88888. The point is that ignoring a parity error is a pretty safe thing to do; there's very little chance of getting a misleading answer. Much better than crashing the computer, which is guaranteed to lose you whatever you had in memory. (Suppose you have a parity error while running a Speed Disk program: kiss your hard disk goodbye. Let's see, when did I do my last full backup?). So the PC parity protection is worse than useless. "To summarize the summary of the summary: people are a problem" Russell Wallace, Trinity College, Dublin rwallace@vax1.tcd.ie
jfc@athena.mit.edu (John F Carr) (08/27/90)
In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >While we're on the subject, RISC machines with branch delay slots >have a similar problem. Of course, the easy instruction decoding >means that they can push some of the work into the interrupt >handlers. Does anyone want to describe how their favorite machine >did this? The IBM RT doesn't allow interrupts between a branch-and-execute and the following instruction. This seems to me the best solution (doesn't require any special logic in the interrupt handling software or in the hardware to restart after an interrupt). The new IBM RISC machine avoids problems with branch delay and interrupts by not having delay slots. Instead, instruction prefetch follows branches. I think the 68040 also does this. -- --John Carr (jfc@athena.mit.edu)
don@zl2tnm.gp.govt.nz (Don Stokes) (08/28/90)
rwallace@vax1.tcd.ie writes: > If the operating system just told you about it when there was a parity error > I'd agree with you, something like flashing up a message on the screen: > "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ". > However automatically crasing the computer is NOT acceptable behaviour: I'd > much rather do without the parity checking. Consider: suppose a parity error > occurs on a 640K machine. The error is probably in either an unused area of > memory (e.g. at the DOS prompt) or in a section of code that isn't going to b > executed on this session with the program (e.g. error handling code). Since when was PC memory not used to the last bit? Every PC I have ever had to deal with suffered severe "ram cram" (why, oh why do people insist on treating '386s as 8086s, with their measly memory maps?). A memory mix of 100K of system software, 200K of application and 300K of data isn't unreasonable or atypical. > OK take the very unlikely case that > it is in your data. Unlikely? In the mix I give, I make it about 50%. "Unlikely" isn't how I'd describe it. > For me that means in the source for a program I'm writing Ah. Now we reveal true colours. I hate to bring the real world crashing about your ears, but the *vast* majority of PC users are *not* programmers. You take an extremely self-centered view. > This is no problem, I can just fix the one trashed character when the compile > barfs on the code. Even as a programmer, do you not fear unexepected bugs creeping into your code due to unexpected errors? I recall back in the dim darks in my Apple ][ days (no flames please, I was receiving good money for this), I ran into a problem where well tested code simply stopped working properly; it didn't give errors, just wrong answers (not nice when the incorrect answers are cheque totals to go into a general ledger). It turned out that there was a bug in the Apple-supplied BASIC line renumbering program, which would result in constants occasionally being "renumbered" as well as GOTO/GOSUB targets. It cost me over half a day to find the problem, fix it and fully verify that the problem had indeed been fixed. It would have been Real Nice if it had happened just before implementation, wouldn't it? Compilers often do not barf on single bit errors. Single bit errors are the difference between a '+' and '*', '.' and '/', 0 and 1, 'x' and 'y' (don't try to tell me you don't use single letter temporary variables!). The PC's I have to deal with are used for typesetting work as well as more "normal" applications such as running Lotus, word processing etc (although most users use the spreadsheet available on the VAX). > minutes' work. Or say the error is in a floating-point number in a spreadshee > Chances are the program will crash with a floating-point error or at least > produce obviously wrong results e.g. profit for 1989 was $-32198742.88888. ...or plus or minus $65,536, which in a ~$200,000 bottom line, is easily missed, but could represent the difference between a profit and a loss; not to mention the poor bean-counter's job when the mistake is discovered. > (Suppose you have a parity error while running a Speed Disk program: kiss you > hard disk goodbye. Let's see, when did I do my last full backup?). So the PC > parity protection is worse than useless. Implementation detail: Norton's Speed Disk writes the directory entry only after moving files; until the file pointers are finally changed, they still point to the old, valid copies of data. The machine crashes -- so what? Nothing has been lost. Now imagine an undetected single bit error while moving the BACKUP program, that causes backups to be soundlessly trashed. Now imagine the inevitable hard disk crash. Now imagine the trashed backups being your business records; your livelihood. Don Stokes, ZL2TNM / / Home: don@zl2tnm.gp.govt.nz Systems Programmer /GP/ Government Printing Office Work: don@gp.govt.nz __________________/ /__Wellington, New Zealand_____or:_PSI%(5301)47000028::DON
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/29/90)
In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes: | Consider: suppose a parity error | occurs on a 640K machine. The error is probably in either an unused area of | memory (e.g. at the DOS prompt) or in a section of code that isn't going to be | executed on this session with the program (e.g. error handling code). (I'm | talking only about transient errors here: the boot time memory check will get | other kinds). So ignoring the parity error will probably have no effect. You may be the only DOS user on the planet who has any bits unused. | OK take the very unlikely case that | it is in your data. For me that means in the source for a program I'm writing. | This is no problem, I can just fix the one trashed character when the compiler | barfs on the code. One bit changes 0 to 1, + to *, etc. Your compiler catches this? | Much better than having the machine crash and lose several | minutes' work. Or say the error is in a floating-point number in a spreadsheet. | Chances are the program will crash with a floating-point error or at least | produce obviously wrong results e.g. profit for 1989 was $-32198742.88888. | | The point is that ignoring a parity error is a pretty safe thing to do; there's | very little chance of getting a misleading answer. A little though should show you the error of that thought. Over half the word is dedicated to less significant bits of the mantissa. An error in any of those bits will result in a percent or less change. | Much better than crashing | the computer, which is guaranteed to lose you whatever you had in memory. Exactly. No answer is better than a wrong answer. What would anyone bother to run on a computer which is so valueless that they don't care if they get a right answer, just so that you get an answer? If that's the case, why not make one up? | (Suppose you have a parity error while running a Speed Disk program: kiss your | hard disk goodbye. Let's see, when did I do my last full backup?). So the PC | parity protection is worse than useless. In the first place, anyone who runs a program like that without running a backup first is really careless with their data. To quote my old sig "stupidity, like virtue, is its own reward." And if you have an error and *don't* catch it, you can blindly read in all the data on the disk, corrupt it, and rewrite it wrong. Is this better? A good disk packer will only lose a portion of the data on a crash, and it will be gone, not corrupted. A parity error will corrupt the data. I have worked with systems which didn't have parity, and I hated it. I don't have anything my system I don't need, so I care about all of it. While my data doesn't represent a threat to someone's life if it's wrong, it could be a threat to my income. That's important enough for me. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
antony@george.lbl.gov (Antony A. Courtney) (08/29/90)
In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes: >In article <2434@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > >However automatically crasing the computer is NOT acceptable behaviour: I'd >much rather do without the parity checking. Consider: suppose a parity error >occurs on a 640K machine. The error is probably in either an unused area of >memory (e.g. at the DOS prompt) or in a section of code that isn't going to be >executed on this session with the program (e.g. error handling code). > ... >OK take the very unlikely case that >it is in your data. For me that means in the source for a program I'm writing. >This is no problem, I can just fix the one trashed character when the compiler >barfs on the code. Much better than having the machine crash and lose several >minutes' work. Well, since you seem to be using the example of a compiler environment, I notice you like to claim that the error will be "easy to catch" regardless of where it occured. People have already pointed out loads of examples where this can cause problems, but seemed to have skipped the obvious one: Stray pointer problems. Suppose it DOES corrupt your data for your compiler. It could change the address of some pointer to god-knows-where. Worse yet, it might only change one of the more low-order bits, so that it only causes rare, spurious problems for you. (i.e. it still writes into data space, just into the WRONG chunk of data space). Worse YET, suppose the error occurs not in just the execution or test phase, but in the actual COMPILE phase, so that the address of some static pointer gets corrupted BEFORE the new binary gets written out to disk. That would be loads of fun to try and track down! :-) Stray pointer problems are hard enough to track when __I__ am at fault, I don't want to even THINK what it would be like to try and find them when the hardware is flakey and doesn't tell me about it! :-) Obviously written by someone who hasn't had to do enough debugging.... >The point is that ignoring a parity error is a pretty safe thing to do; there's >very little chance of getting a misleading answer. main() { static int g[4]={1,2,3,4}; printf("%d %d %d %d\n",g[0],g[1],g[2],g[3]); } output: segmentation fault(core dumped) I dunno about you, but I wouldn't exactly call this easy to catch... >Russell Wallace, Trinity College, Dublin >rwallace@vax1.tcd.ie ~antony -- ******************************************************************************* Antony A. Courtney antony@george.lbl.gov Advanced Development Group ucbvax!csam.lbl.gov!antony Lawrence Berkeley Laboratory (415) 486-6692
cprice@mips.COM (Charlie Price) (08/29/90)
In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes: > >If the operating system just told you about it when there was a parity error ... >The error is probably in either an unused area of >memory (e.g. at the DOS prompt) or in a section of code that isn't going to be >executed on this session with the program (e.g. error handling code). (I'm >talking only about transient errors here: the boot time memory check will get >other kinds). So ignoring the parity error will probably have no effect. If >it's in a section of code that will be executed, the machine will just crash >which is what would have happened anyway. [some stuff deleted] >The point is that ignoring a parity error is a pretty safe thing to do; there's >very little chance of getting a misleading answer. I sense some confusion. A boot-time memory check might detect a permanent error, and this seems to be what you are talking about, but this isn't what parity is mostly for. Mostly, parity is to detect transient errors caused by alpha particles (or some such). The memory chip doesn't have a permanent problem, it just forgot the value of a bit. Parity is computed and written for every store operation. For every data item fetched (data or instruction) the parity of the data bits is computed and compared to the parity bit(s) that were also fetched from memory. If the computed and stored bits are not the same, you have a parity error. A (parity) error in a memory location is only detected at the time it is fetched, so you are probably going to want the right data. so your argument about it being unimportant is basically invalid. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
powell@crg5.UUCP (Powell Quiring) (08/29/90)
In article <56qmo1w162w@zl2tnm.gp.govt.nz> don@zl2tnm.gp.govt.nz (Don Stokes) writes: >rwallace@vax1.tcd.ie writes: > >> If the operating system just told you about it when there was a parity error >> I'd agree with you, something like flashing up a message on the screen: >> "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ". >> However automatically crasing the computer is NOT acceptable behaviour: I'd >> much rather do without the parity checking. Consider: suppose a parity error >> occurs on a 640K machine. The error is probably in either an unused area of >> memory (e.g. at the DOS prompt) or in a section of code that isn't going to b >> executed on this session with the program (e.g. error handling code). The fact that you got a parity error indicates that the memory has been read. The only question is how much this incorrect value is going to screw you up.
peter@ficc.ferranti.com (Peter da Silva) (08/29/90)
In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > What would anyone > bother to run on a computer which is so valueless that they don't care > if they get a right answer, just so that you get an answer? Videogames. -- Peter da Silva. `-_-' +1 713 274 5180. 'U` peter@ferranti.com
hrich@emdeng.Dayton.NCR.COM (George.H.Harry.Rich) (08/29/90)
rwallace@vax1.tcd.ie writes: > hard disk goodbye. Let's see, when did I do my last full backup?). So the PC > parity protection is worse than useless. I've had three hard disks trashed by hardware problems. Every one of them went, not because systems failed to update the directory and allocation map, but because they updated the directory and allocation map from trashed memory and the trashing was undetected. Lest I create paranoia, two of the systems were prototypes, and one was an ancient, abused, non-standard, and flaky PDP-11/05. Given a properly organized disk caching system, loss of data already on disk due to system halts is a very low probability occurance. If it happens often, you should get a different caching system. (Sales pitch deleted). My own experience is that a parity error on a good workstation is a once in a blue moon occurance. (Sales pitch deleted). In my environment, where I'm doing a lot of changing and testing, stopages due to software bugs or incompatabilies are much more frequent. My suggestion is that you don't commit more than half an hour's work to ram or more than a day's work to hard disk. A day's contingency in a software schedule will take care of this. If you don't have a day's contingency, you should be making selective backups more often; you can do them during think time if you organize them properly. Regards, Harry Rich Disclaimer: The ideas expressed here are mine and not necessarily those of my employer.
tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) (08/29/90)
In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >No answer is better than a wrong answer. What would anyone >bother to run on a computer which is so valueless that they don't care >if they get a right answer, just so that you get an answer? I'm sorry, but I have to go into reality mode here. I can understand if you were running a simulation on the space shuttle you'd rather get no answer than a wrong answer. But let's say you were doing something more typical, like ... oh ... replying to a long article in news. You've been typing and researching for an hour now. I ask you this: would you rather I just blow away that entire article and crash your machine or change a single random character? Paul Chamberlain | I do NOT represent IBM tif@doorstop, sc30661@ausvm6 512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
rsc@merit.edu (Richard Conto) (08/29/90)
In article <3294@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes: >I'm sorry, but I have to go into reality mode here. I can understand >if you were running a simulation on the space shuttle you'd rather >get no answer than a wrong answer. But let's say you were doing something >more typical, like ... oh ... replying to a long article in news. You've >been typing and researching for an hour now. I ask you this: would you >rather I just blow away that entire article and crash your machine or change >a single random character? There's more choices than that. If your news is running on a multitasking machine, I'd hope that the kernel would be able to terminate the task (if the parity error occured in task-memory rather than kernel memory.) But think. A parity error MAY occur in the text being manipulated. But it can also occur in worse places. It could corrupt the datastructures in your news program, leading to an eventual core dump (but not right away.) If you're keeping track of core dumps (for whatever reason), do you want to waste time tracking down an obscure bug like that? If the memory is in user-space, the kernel should at the very least kill the task. If it doesn't check the page of memory that caused the parity error, it should (at the very least) never re-allocate that page, and log a nasty message on the operator's console. If the error occurs in kernel-space, it should try for as gracefull a shutdown as it can. Which may mean printing a very nasty message on the operator's console and halting, since it can't trust it's disk system anymore. --- Richard
dab@myrias.com (Danny Boulet) (08/30/90)
In article <3294@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes: >In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >>No answer is better than a wrong answer. What would anyone >>bother to run on a computer which is so valueless that they don't care >>if they get a right answer, just so that you get an answer? > >I'm sorry, but I have to go into reality mode here. I can understand >if you were running a simulation on the space shuttle you'd rather >get no answer than a wrong answer. But let's say you were doing something >more typical, like ... oh ... replying to a long article in news. You've >been typing and researching for an hour now. I ask you this: would you >rather I just blow away that entire article and crash your machine or change >a single random character? Gee. That depends. Consider the characters "1.23456e12". If the random change hits the '6' and turns the characters into "1.23452e12" then I probably don't mind. If the random change hits the exponent field and I get the characters "1.23456e92" (a one bit change) then I probably mind quite a bit. Similar effects occur with binary data (hitting the low order bit of a floating point number won't matter much but watch what happens when a high order bit or sign or exponent bit gets hit). I'd prefer that the question was: "would you prefer that I crash the machine and force you to use the backed up file produced by your editor or silently produce a wrong answer?". I'm very much in favour of answers that I trust. I know that there are an awful lot of ways that a computer can produce wrong answers. That is no excuse for failing to catch the ones that it is practical to catch. Adding an extra bit to each byte (or whatever) seems like a small price to pay for a bit more confidence in the results. Also, given the reliability of current memory and such, crashes due to parity errors would probably be a lot less frequent than crashes due to other random events (i.e. adding this feature probably wouldn't do much harm to the MTBF numbers for the system). One final note: a lot of small computers are used for business applications like payroll, accounting, inventory and such. This may not be as flashy as simulating the space shuttle but silent failures in these applications can be pretty devastating to the business. Unfortunately, the users of such systems are probably the least likely to appreciate the value of knowing that the computer detected an error and aborted rather than giving wrong answers.
limey@sparc2.hri.com (Craig Hughes) (08/30/90)
In the interest of keeping the machine up, even with a potentially fatal problem, how about having the hardware notify the OS (through some kind of exception) that there is a problem with a certain area of memory? The O/S could then dynamically remove that portion of physical memory from it's virtual map after copying the data to a new page, log the error, and continue processing. The corrupted data isn't fixed, but at least the machine is still running - and that can be imporant sometimes. (reminds me of a military computer I've seen with a big 'combat mode' button on the front - apparently when pressed all exceptions would be ignored.......) -- ------------------------------------------------------------------------ --------- Craig S. Hughes UUCP: ...bbn!hri.com!limey Horizon Research, Inc. INET: limey@hri.com 1432 Main Street Waltham, MA 02154 <- ------------- -> ------------------------------------------------------------------------ ---------
karsh@trifolium.esd.sgi.com (Bruce Karsh) (08/30/90)
>I know that there are an awful lot of ways that a computer can produce >wrong answers. That is no excuse for failing to catch the ones that it >is practical to catch. Adding an extra bit to each byte (or whatever) >seems like a small price to pay for a bit more confidence in the >results. But adding the extra bit has a reliability cost too: Memory boards need more pins on their connectors. Mechanical connections are a notorious failure point. More power is used so the system runs hotter. There may need to be more reliance on fans (which are also notorious) to cool the system. The component count is increased so there are more components which can potentially fail. Parity checking circuitry which can also fail has been added. Multiple bit errors may not be detected. These may all be small effects, but with a modern, well designed memory system, parity errors are a small effect as well. > Also, given the reliability of current memory and such, crashes >due to parity errors would probably be a lot less frequent than crashes >due to other random events (i.e. adding this feature probably wouldn't >do much harm to the MTBF numbers for the system). Given the reliability of current memory and such, how probably is the event that parity protects against. I don't have the answer to this, but I have to believe that someone has studied this problem. Are memory parity errors in any way a significant contributer to computer errors? It seems to me that there are so many other sources of computer error which are so much more significant that memory parity is just silly. We don't usually put parity on floating point processors or internal CPU data paths and registers. Putting it on memory seems like a very expensive "spit in the ocean". Is there some real hard data which shows that memory is so failure-prone that parity checking is called for? If so, why is it that a single bit of parity checking is adequate. Is the failure mode such that even-bit failures are by far the most common kind? The few memory failures that I've looked carefully at have been pretty massive, not single-bit. Has memory parity become a sensless security blanket for the insecure and uninformed? >One final note: a lot of small computers are used for business applications >like payroll, accounting, inventory and such. This may not be as flashy as >simulating the space shuttle but silent failures in these applications can >be pretty devastating to the business. Unfortunately, the users of such >systems are probably the least likely to appreciate the value of knowing that >the computer detected an error and aborted rather than giving wrong answers. True, but if the protection is from an extremely unlikely event, it makes sense to put the cost of protection into protecting against a more likely event. Or, alternatively, just leave it off entirely. You'll never make a perfectly reliable computer. You have to settle for some statistical level of reliability. I'd like to see a comparison of the probability of a memory parity error causing a business to make a significant financial mistake, versus the probability of a software error causing the mistake. Bruce Karsh karsh@sgi.com
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/30/90)
In article <1990Aug29.150017@sparc2.hri.com> limey@hri.com writes: | In the interest of keeping the machine up, even with a potentially fatal | problem, how about having the hardware notify the OS (through some kind | of exception) that there is a problem with a certain area of memory? The | O/S could then dynamically remove that portion of physical | memory from it's virtual map after copying the data to a new page, | log the error, and continue processing. This is how it usually works in a good O/S. If the page is instructions, it can be reloaded from disk, as can an unmodified data page. If the data page is dirty the process must be terminated. On a PC running XX-DOS, it makes more sense to take the whole system down, since there is no way to tell if the page is dirty or clean, and no memory mapping to do anything about it if you could. And no penalty, since the current task is the only task (unless extenders are running). -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
eli@smectos.gang.umass.edu (Eli Brandt) (08/31/90)
In article <19875@crg5.UUCP> powell@crg5.UUCP (Powell Quiring) writes: >In article <56qmo1w162w@zl2tnm.gp.govt.nz> don@zl2tnm.gp.govt.nz (Don Stokes) writes: >>rwallace@vax1.tcd.ie writes: >> >>> If the operating system just told you about it when there was a parity error >>> I'd agree with you, something like flashing up a message on the screen: >>> "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ". >>> However automatically crasing the computer is NOT acceptable behaviour: I'd >>> much rather do without the parity checking. Consider: suppose a parity error >>> occurs on a 640K machine. The error is probably in either an unused area of >>> memory (e.g. at the DOS prompt) or in a section of code that isn't going to b >>> executed on this session with the program (e.g. error handling code). > >The fact that you got a parity error indicates that the memory has >been read. The only question is how much this incorrect value >is going to screw you up. How 'bout when you get a parity error a little window pops up with the mangled byte and some context? That way you can fix it if it's in human-readable data and choose either to continue or reboot otherwise. I personally would always choose the latter - I don't want munged FP values, I don't want corrupted FATs written to disk, and I really don't want a 21h call changed from 25h to 35h. Of course, you could always add a few error-correcting bits, too. However, *I* wouldn't pay for the extra RAM/circuitry/design time because I've never had a genuine parity error.
chuckh@apex.UUCP (Chuck Huffington) (09/01/90)
In article <3294@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes: |In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: |>No answer is better than a wrong answer. What would anyone |>bother to run on a computer which is so valueless that they don't care |>if they get a right answer, just so that you get an answer? | |I'm sorry, but I have to go into reality mode here. I can understand |if you were running a simulation on the space shuttle you'd rather |get no answer than a wrong answer. But let's say you were doing something |more typical, like ... oh ... replying to a long article in news. You've |been typing and researching for an hour now. I ask you this: would you |rather I just blow away that entire article and crash your machine or change |a single random character? Two points: 1) How do feel about a single random character in an ilist or in a free block map? 2) Are there really that many workstations that are ONLY used to read news? And NEVER used to do anything critical? How do you prevent a toy workstation from being used in a critial application? And in the same line, what defines critical? It would be really nice to have fault tolerant systems, but failing that, I would usually prefer to have a system crash instead of corrupting its filesystems, or silently making "innocent" errors.
rwallace@vax1.tcd.ie (09/01/90)
Organization: Computer Laboratory, Trinity College Dublin Lines: 34 In article <2469@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes: > | OK take the very unlikely case that > | it is in your data. For me that means in the source for a program I'm writing. > | This is no problem, I can just fix the one trashed character when the compiler > | barfs on the code. > > One bit changes 0 to 1, + to *, etc. Your compiler catches this? > > | Much better than having the machine crash and lose several > | minutes' work. Or say the error is in a floating-point number in a spreadsheet. > | Chances are the program will crash with a floating-point error or at least > | produce obviously wrong results e.g. profit for 1989 was $-32198742.88888. > | > | The point is that ignoring a parity error is a pretty safe thing to do; there's > | very little chance of getting a misleading answer. > > A little though should show you the error of that thought. Over half > the word is dedicated to less significant bits of the mantissa. An error > in any of those bits will result in a percent or less change. OK fair point, and other people have pointed out that parity is only checked when reading/writing an individual memory word. I still think it would be MUCH more useful for the operating system to flash up a message telling you that a parity error had been detected and where it was, so I could decide for myself whether to reset the machine. (A likely response if doing stuff with very important data is to save your work under a different filename and then compare your modified version with the previous version to check if there are any differences other than in the section you modified). "To summarize the summary of the summary: people are a problem" Russell Wallace, Trinity College, Dublin rwallace@vax1.tcd.ie #! rnews 3591 Relay-Versi
leavitt@mordor.hw.stratus.com (Will Leavitt) (09/01/90)
Concerning the additional hardware needed for parity on memory, Bruce Karsh (karsh@sgi.com) writes: >But adding the extra bit has a reliability cost too: > > Memory boards need more pins on their connectors. Mechanical connections > are a notorious failure point. > > More power is used so the system runs hotter. There may need to be more > reliance on fans (which are also notorious) to cool the system. > > The component count is increased so there are more components which can > potentially fail. > > Parity checking circuitry which can also fail has been added. > > Multiple bit errors may not be detected. All true... but modern CMOS memory runs ridiculously cool, parity detects half of all multi bit errors, and as we'll see, there are very definite reasons why memory chips fail more often. >We don't usually put parity on floating point processors or internal >CPU data paths and registers. Putting it on memory seems like a very >expensive "spit in the ocean". > >Is there some real hard data which shows that memory is so failure-prone >that parity checking is called for? If so, why is it that a single bit >of parity checking is adequate. Is the failure mode such that even-bit >failures are by far the most common kind? The few memory failures that >I've looked carefully at have been pretty massive, not single-bit. There are at least 3 DRAM failure mechanisms that aren't applicable to floating point processors, CPU data paths, and registers. 1) alpha particle flipped bits 2) DRAMs come in difficult to solder-inspect packages 3) DRAMs (like all dynamic circuits) are prone to forgetfullness One at a time: 1) alpha particle flipped bits Quoting from a Seimens Information report #6: "Alpha particles are doubly charged helium nuclei emmitted in the radioactive decay of many radioactive elements (principly Uranium & Thorium). Naturally occuring alpha's range in energy from about 2 to 9 MeV and are treated as classical particles. An alpha interacts electronically with silicon creating a track of electron-hole pairs along the 25 um straight line path of the particle." A track of electron-hole pairs conducts, by the way. Data bits are stored as charge on a capacitor in the memory cell, and is read (sensed) by connecting the cell to a sense amplifier via a bit line shared by other cells. Thus if an alpha particle zips through your capacitor, it can flip a bit in memory. If it zips through the bitline while you are reading, you get wrong data, and the data gets writen back wrong. (internally, DRAM reads are destructive, and are always followed by a restore). For CMOS 1 Meg parts, a typical error rate is 270 failures per 10^9 device hours. According to Seimens, bit line failures now dominate alpha sensitivity. Both of these lead to INTERMITTANT SINGLE BIT ERRORS. Now, what is uranium doing next to your chips? It can be a contaiminant in the silicon or aluminum, or in the pacakge. There is a story where Amdahl built a series of mainframes with no error correction. Their DRAM vendor packaged a batch of DRAMs in ceramic DIPs with a good dose of uranium contaminating the ceramic, and the resulting mainframes wouldn't stay up for more than a day. Of course, they failed a different way each time. Amdahls now have ECC. 2) DRAMs come in difficult to solder-inspect packages Crack open the top of your Sparcstation or Iris for this one... The most popular package for DRAMs these days is the SOJ; the leads curl underneath the chip and are impossible to inspect. The most popular packages for logic are either through hole (like pin grid arrays), or gull wing (plastic quad flat pack). Those big gate arrays are PQFPs. Both are easy to inspect. Now if not quite enough solder gets squigied through the solder mask when they make the board, and/or if the chip has a slightly non-coplanar lead, then instead of being soldered to the board, the lead ends up resting on a bump of solder below it. Because of the springiness in the leads, this will work for a while, but eventually oxidation will cause intermitant contact. Now, most DRAMs used for main memory are 1 bit wide parts, so this results in INTERMITTANT SINGLE BIT ERRORS. Why are DRAMs packaged in impossible to inspect packages? Because they are denser than gull wing, and besides PARITY WILL DETECT ANY PROBLEMS ANYWAY. 3) DRAMs (like all dynamic circuits) are prone to forgetfullness DRAMs store a bit by either charging or not charging a tiny capacitor; the charge on the capacitor must be refreshed every 15ms before it disipates. Normally this works fine, but marginal chips are prone to data retention problems-- especially at high temperatures and out of spec voltage ranges. Dynamic circuits are used in many CMOS microprocessors as well, but typicaly refreshing is not a problem (it happens on every clock tick, for example). >Has memory parity become a sensless security blanket for the insecure and >uninformed? Probably. Pretty soon error correction will be standard on all machines with signifigant memory sizes. >I'd like to see a comparison of the probability of a memory parity error >causing a business to make a significant financial mistake, versus the >probability of a software error causing the mistake. True. But soft memory errors, like bad disk blocks, are a solved problem. Software errors are not. -will -- ----------------------------------------------------------------- leavitt@mordor.hw.stratus.com
karsh@trifolium.esd.sgi.com (Bruce Karsh) (09/01/90)
In article <2201@lectroid.sw.stratus.com> leavitt@mordor.sw.stratus.com (Will Leavitt) writes: >All true... but modern CMOS memory runs ridiculously cool, parity >detects half of all multi bit errors, and as we'll see, there are very >definite reasons why memory chips fail more often. CMOS devices run cool when they are switched slowly. They can consume a lot of power when they are switched rapidly. Also, CMOS memory is expensive. Large memory systems are not often CMOS. >Data bits are stored as charge on a capacitor in the memory cell, and >is read (sensed) by connecting the cell to a sense amplifier via a bit >line shared by other cells. Thus if an alpha particle zips through >your capacitor, it can flip a bit in memory. If it zips through the >bitline while you are reading, you get wrong data, and the data gets >writen back wrong. (internally, DRAM reads are destructive, and are >always followed by a restore). For CMOS 1 Meg parts, a typical error >rate is 270 failures per 10^9 device hours. According to Seimens, bit >line failures now dominate alpha sensitivity. Both of these lead to >INTERMITTANT SINGLE BIT ERRORS. That works out to less than one single-bit error every 13 years of continuous operation on a system with 4 megabytes of CMOS DRAM. An in most cases, that single-bit error would not even affect the operation of the system. Surely this is a spit in the ocean. I doubt that most people would ever observe one of these in their entire computing life. Certainly there are sources of failure in most computer systems which are much higher than this. Like the electrical wall outlet! If the failure rate of 4 Meg DRAMs is really a lot higher than this, then perhaps some protection is called for. But what good is parity? It just replaces the system damage caused by the memory error with the system damage caused by a system failure caused by a catastrophic system crash. >There is >a story where Amdahl built a series of mainframes with no error >correction. Their DRAM vendor packaged a batch of DRAMs in ceramic >DIPs with a good dose of uranium contaminating the ceramic, and the >resulting mainframes wouldn't stay up for more than a day. Of course, >they failed a different way each time. Amdahls now have ECC. A company sent out a bad batch of DRAMS. So what else is new? It happens all the time. How common is this failure. It sounds like a spit in the ocean to me. >2) DRAMs come in difficult to solder-inspect packages >Crack open the top of your Sparcstation or Iris for this one... The >most popular package for DRAMs these days is the SOJ; the leads curl >underneath the chip and are impossible to inspect. The most popular >packages for logic are either through hole (like pin grid arrays), or >gull wing (plastic quad flat pack). Those big gate arrays are PQFPs. >Both are easy to inspect. Now if not quite enough solder gets >squigied through the solder mask when they make the board, and/or if >the chip has a slightly non-coplanar lead, then instead of being >soldered to the board, the lead ends up resting on a bump of solder >below it. Because of the springiness in the leads, this will work for >a while, but eventually oxidation will cause intermitant contact. >Now, most DRAMs used for main memory are 1 bit wide parts, so this >results in INTERMITTANT SINGLE BIT ERRORS. No doubt true, but at what rate does this failure mode occur? There are a lot of high density interconnect schemes now and even more is on the way. Are you suggesting that they are so failure prone that they require error detecting logic? In most all cases, this failure would be detected during system thermal testing and it would never make it out the door. It is possible that a certain number would slip through. How common is this failure? Would a typical system ever have a failure because of this failure mode? Is it worth adding 12% to the cost and size of a memory system and making it run more slowly because of this? Couldn't that money be spent elsewhere to more effectively improve the reliability of the system? I think you're system is more likely to be hit by lightning than to have sporadic crashes due to this failure mode. Do we have any real hard numbers on how often this failure occurs? You're probably more likely to see this failure on the SIM socket rather than on the chip leads. In that case, there could easily be more than a single bit error and the parity detection could still fail to catch the error. >3) DRAMs (like all dynamic circuits) are prone to forgetfullness >DRAMs store a bit by either charging or not charging a tiny capacitor; >the charge on the capacitor must be refreshed every 15ms before it >disipates. Normally this works fine, but marginal chips are prone to >data retention problems-- especially at high temperatures and out of >spec voltage ranges. Dynamic circuits are used in many CMOS >microprocessors as well, but typicaly refreshing is not a problem (it >happens on every clock tick, for example). DRAMS, when properly used, are not any more prone to forgetfullness than the other logic chips, unless the DRAM is defective. A defective memory chip will have errors. But are memory chips defective at so much of a higher rate than other chips that it is a problem? If not, then why single out memory chips for parity protection? >Probably. Pretty soon error correction will be standard on all machines >with signifigant memory sizes. I suspect that won't happen. Memory parity errors are a very rare failure mode. I don't think too many designers are going to add extra cost to their systems to guard against this failure. Especially not in the price-competitive computer market of today. There are just too many better places to improve reliability at. >>I'd like to see a comparison of the probability of a memory parity error >>causing a business to make a significant financial mistake, versus the >>probability of a software error causing the mistake. >True. But soft memory errors, like bad disk blocks, are a solved problem. >Software errors are not. But if a part who's only job is to decrease the rate of undetected failures does not make a significant improvement in the rate of undetected failures, then what good is it? If someone can show me that those parity chips really do significantly decrease the rate of undetected system failures, then I'll agree that they are necessary. Even if they only make a 5% reduction in this rate, they may be an acceptable idea. Somehow I think that if they make any reduction at all, it's several places to the right of the decimal point. E.g. .0001%. Even worse though, they may actually be decreasing the overall reliability of systems. Bruce Karsh karsh@sgi.com
gd@geovision.uucp (Gord Deinstadt) (09/02/90)
rwallace@vax1.tcd.ie writes: > Consider: suppose a parity error >occurs on a 640K machine. The error is probably in either an unused area of >memory (e.g. at the DOS prompt) or in a section of code that isn't going to be >executed on this session with the program (e.g. error handling code). In all the parity checking memories I've seen, parity is only checked when the data is fetched by the CPU or DMA controller. So it *is* in use. ECC systems do generally access memory locations that are not in use, but that's because they can usually do something useful, ie. fix the data if only a single bit is in error. > Or say the error is in a floating-point number in a spreadsheet. >Chances are the program will crash with a floating-point error or at least >produce obviously wrong results e.g. profit for 1989 was $-32198742.88888. No doubt 32198742.88888 posters have already replied to this. What they said. -- Gord Deinstadt gdeinstadt@geovision.UUCP
Don_A_Corbitt@cup.portal.com (09/03/90)
> I think you're system is more likely to be hit by lightning than to have > sporadic crashes due to this failure mode. Do we have any real hard > numbers on how often this failure occurs? > > You're probably more likely to see this failure on the SIM socket rather > than on the chip leads. In that case, there could easily be more than > a single bit error and the parity detection could still fail to catch the > error. > > > > But if a part who's only job is to decrease the rate of undetected failures > does not make a significant improvement in the rate of undetected failures, > then what good is it? > > If someone can show me that those parity chips really do significantly > decrease the rate of undetected system failures, then I'll agree that > they are necessary. Even if they only make a 5% reduction in this > rate, they may be an acceptable idea. > > Somehow I think that if they make any reduction at all, it's several > places to the right of the decimal point. E.g. .0001%. Even worse > though, they may actually be decreasing the overall reliability of systems. > > Bruce Karsh > karsh@sgi.com Well, I have some anecdotal evidence of the benefits of parity. I've been working with IBM PC and clones (the original subject of this discussion) since the early days. I've probably been around machines for 8 years * 4 machines (average) or 32 machine years I've seen 5 or 6 machines that would tend to get parity errors. Each time, it was possible to fix by replacing one or more RAM chips (with one exception). These machines all passed their power-on-self-test, but would fail every few minutes/hours/days. Knowing that hardware was broken, we were able to blindly swap RAMs until things worked. If we didn't have parity checking, we would suspect our software (SW developers) for bugs, pointer problems, etc. Each machine treats parity errors differently. Some show suspected address and ram chip, others just say "Parity Error R)eboot or I)gnore". The one time the problem wasn't bad RAM chips was when I installed a memory expansion board improperly (vendor sent wrong docs). It used page mode RAM, but I had the page mode switch turned off. This was for the upper 4MB of an 8BM 386 machine. POST worked fine, using the RAM for Ram Disk worked fine, but OS/2 would crash with a parity error when booting. It appeared that the access pattern would change the failure mode. What's the point? RAM is an area where the end-user often gets involved. Since it is so easy to damage chips when installing them, I find it to be worthwhile to have some sanity checking on their operation. Also, most of the transistors of a given system will be in the RAM chips. Parity gives an inexpensive way to reduce the number of "silent wrong answers". --- Don_A_Corbitt@cup.portal.com Not a spokesperson for CrystalGraphics, Inc. Mail flames, post apologies. Support short .signatures, three lines max.
henry@zoo.toronto.edu (Henry Spencer) (09/04/90)
In article <68362@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes: >... perhaps some protection is called for. But what good is parity? >It just replaces the system damage caused by the memory error with the >system damage caused by a system failure caused by a catastrophic system >crash. A hard failure is usually preferable to a silently wrong answer. -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
karsh@trifolium.esd.sgi.com (Bruce Karsh) (09/05/90)
In article <1990Sep4.163619.24726@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >A hard failure is usually preferable to a silently wrong answer. Given that a memory system is otherwise properly designed and tested and uses modern 4Mbit DRAM memory chips, is there any evidence that memory parity makes a measurable difference in the silent wrong answer rate? The memory failure component of the silent wrong answer rate seems to be so small as to be insignificant. If the answer is no, then isn't parity just a historical and cutural artifact from the days when parity really was necessary? Bruce Karsh karsh@sgi.com
wilker@descartes.math.purdue.edu (Clarence Wilkerson) (09/05/90)
I don't remember the exact rate but I thought that with 4 megs of memory, one expected from alpha radiation alone one error per two weeks.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/05/90)
In article <68505@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes: | Given that a memory system is otherwise properly designed and tested | and uses modern 4Mbit DRAM memory chips, is there any evidence that | memory parity makes a measurable difference in the silent wrong answer | rate? If the error rate for 1 bit error is 1 in N, then the rate for a 2 bit error is 1 in N^2. With N in the order of some millions (or billions), you make the chance of silent error millions of time less likely. EDAC makes it possible to *correct* errors, but the rate of 2 bit errors which would be caught is small to the point of really being insignificant. Below the rate of errors caused by noise on the bus, I believe. I would expect EDAC on a 64 bit machine, however, since it is probably cheaper. Note that for byte parity it takes 8 bits of parity memory, but for EDAC you can use 1+log2(N) or 7 bits, and get 2 bit error detection, 1 bit error correction, and use less memory. This assumes that you can (a) build fast EDAC as cheaply as parity, and (b) that you use ALL 64 bit data fetches. You can run a 71 bit data bus and put the EDAC in the bus masters (CPU and I/O controllers) which access the bus. You can still use parity on I/O devices without controllers, such as serial ports, if you have any such devices. This is a bus design issue and doesn't effect the theory at all, just the cost/simplicity ratio. Glossary: EDAC - error detection and correction BMD - bus master devices TLA - three letter acronym -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/05/90)
In article <13625@mentor.cc.purdue.edu> wilker@math.purdue.edu (Clarence Wilkerson) writes: | | I don't remember the exact rate but I thought that with 4 megs of | memory, one expected from alpha radiation alone one error per | two weeks. I haven't seen error rates that high in any workstation or 32 bit PC. I haven't seen a parity error in several years on nine machines with 100+ MB of memory total. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
alix@cerl.uiuc.edu (Chris Alix) (09/05/90)
In article <13625@mentor.cc.purdue.edu> (Clarence Wilkerson) writes: > I don't remember the exact rate but I thought that with 4 megs of >memory, one expected from alpha radiation alone one error per >two weeks. In article <2484@crdos1.crd.ge.COM> (Wm E Davidsen Jr) writes: > I haven't seen error rates that high in any workstation or 32 bit PC. >I haven't seen a parity error in several years on nine machines with >100+ MB of memory total. NOT A FLAME In what kind of physical environment are these machines located? I see 1 or 2 parity errors per year on two 12MB Sun 3/180's in an atypically "noisy" computer room (lots of unshielded custom hardware, video, fans switching on and off, etc.) I'd imagine that a lone workstation in a Tempest-compliant room might never see an error, but with more and more standard workstations being used for factory-floor applications, I suspect that the decision to include parity, secded, etc. is made with the worst possible physical environment in mind. -------------------------------------------------------------------------- Christopher Alix E-mail: alix@uiuc.edu University of Illinois PLATO/NovaNET: alix / s / cerl Computer-Based Education Research Lab Phone: (217) 333-7439 103 S. Mathews Urbana, IL 61820 Fax: (217) 244-0793 --------------------------------------------------------------------------
douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee) (09/06/90)
DRAM manufactures express reliablity in terms of FITs (Failures In Time). On FIT represents one error in one billion (10 ^ 9) hours of operation. Toshiba claims a FIT rate of 252 for 1 Mb DRAMs. Clearpoint, who makes add-in memory boards claims the actual rate is 1000 FITs. The FIT rate has steadily decreased for each sucessive generation of DRAMs until the 4 Mb. The FIT rate for 4 Mb DRAMs is higher than 1 Mb. From the FIT rate you can calculate the MTBF of any memory system. The MTBF in hours for one DRAM is calculated as 10 ^ 9 / FIT. The MTBF for a system is just the MTBF of each DRAM divided by the total number of DRAMs in the system. We are only looking at single bit errors here. Assuming a FIT rate of 252: # of DRAMs Memory Size MTBF 32 4 MB 14.1 years 96 12 MB 4.7 years 160 20 MB 2.8 years Assuming a FIT rate of 1000: # of DRAMs Memory Size MTBF 32 4 MB 3.6 years 96 12 MB 1.2 years 160 20 MB 260 days For most PC (memories < 12 MB) a single bit error should occur rarely due to soft errors. The FIT rate really only measures errors due to alpha particle radiation. There can be more soft errors caused by power supply spikes, drop outs, etc. that have not been accounted for here. This will cause the FIT rate to go up, reducing the MTBF. The thing to realize here, is that parity will actually make the MTBF go down. This is because more parts are added, more things can fail. Parity does allow you to detect these errors, however. Error detection and correction (EDAC) have been mentioned as an alternative and these are used in many workstations (i.e. Sun). One of the most popular parts is the Am29C660 and its predecessor Am2960. This part uses a modified Hamming code to detect and correct single bit errors and to detect double bit errors. It will in fact detect many multi-bit errors and catastrophic failures such as all 0's or all 1's. The part appends 7 bits to a 32 bit word and 8 bits to a 64 bit word (two parts are cascaded). For 32 bits the overhead is greater than parity, 7 vs. 4, but at 64 bits you break even. Similar parts are made by IDT and many workstation manufactures implement the same function in gate arrays. The advantage of this scheme is that all single bit errors are corrected. Also during refresh cycles, the EDAC can scrub memory. This is done by reading one memory location and correcting any single bit errors during each refresh cycle. By appropriately partioning memory the entire memory can be scrub in a short time and prevent the accumulation of double-bit errors. To calculate the probability of two bit errors occurring, the birthday paradox is used. This will give the probability of two single bit errors occuring in the same memory word. Assuming 32 bit words and 252 FITs: # of DRAMs Memory Size MTBF 39 4 MB 14,907 years 117 12 MB 8,607 years 195 20 MB 6,667 years For 1000 FITs # of DRAMs Memory Size MTBF 39 4 MB 3,757 years 117 12 MB 2,168 years 195 20 MB 1,680 years This increase is overstated since you have added extra circuitry and devices that can cause other failures to occur. The expected total system MTBF increase is 50 to 60 times the non-EDAC system. If scrubbing is used, than this will be even higher. What this also neglects is that may single bit errors can occur in memory locations that are not used, or are not read before they are written again. Therefore, the system may not detect all the parity errors that occur. I would expect that most 64 bit memories will have EDC circuits, especially memories using DRAMs > 1Mb. Some PC companies have looked at EDC, but found it too expensive to justify putting in the box. I now must say that I worked for Advanced Micro Devices supporting the Am29C660. I no longer am affiliated with them. I hope this answers some of the questions about memory reliability. Douglas Lee
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (09/06/90)
In article <2361@cirrusl.UUCP> douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee) writes: >DRAM manufactures express reliablity in terms of FITs (Failures In >Time). ... The FIT rate really only measures errors due to >alpha particle radiation. There can be more soft errors caused by >power supply spikes, drop outs, etc. that have not been accounted for >here. Yes. Also, note that parity/ECC may catch problems with connectors, bus drivers, fans and filters (== overheating), system environment, and so on. Further, FIT MTBF is an average. There are always machines "built on a Monday", just as with cars. It doesn't contribute much to this discussion to give anecdotal evidence of zero parity errors. Others of us have anecdotes to the contrary. For example, my workstation has had errors in its frame buffer - which isn't parity protected, because the occasional extra pixel isn't too important. I just refresh the screen. -- Don D.C.Lindsay
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (09/06/90)
In article <10397@pt.cs.cmu.edu>, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > > Yes. Also, note that parity/ECC may catch problems with connectors, > bus drivers, fans and filters (== overheating), system environment, > and so on. ... Exactly. There is a workstation in a lab near my office that was having several parity errors per hour, until an unnamed idiot removed the extra SIMMS he'd scrounged from a different model of the same brand of machine. The diagnostics reported no problems, and the errors occurred only when the machine got hot. Parity saved days of looking for strange, new kernel bugs, which would have been the diagnose without the parity error reports. Parity errors caused by a timing problem figured promenently in the resolution after years of searching for a problem in the old 68K SGI line. Without the parity error reports, we would still be looking for a wild pointer. From reading the UNIX-on-PC-clones news groups, it seems to me that parity errors are the main and most universally available and reliable memory diagnostic on such machines, detecting all kinds of speed, heat, and compatibility problems. Vernon Schryver, vjs@sgi.com
davec@nucleus.amd.com (Dave Christie) (09/06/90)
In article <2483@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > > I would expect EDAC on a 64 bit machine, however, since it is probably >cheaper. Note that for byte parity it takes 8 bits of parity memory, but >for EDAC you can use 1+log2(N) or 7 bits, and get 2 bit error detection, >1 bit error correction, and use less memory. Its been many years since I designed EDAC for a 64-bit machine, but what I seem to remember is that using 7 bits would only allow you to correct the 64-bit data portion, not the 7 check bits themselves. To cover those you need one more bit (and you really do want those covered as well). This extra bit is worthwhile for another reason - it gives you a lot more freedom in arranging the matrix for generating (& regenerating) the check bits, which translates into speed. Eight bits would probably allow faster encoding and decoding for a 64-bit word than seven. So lower cost really isn't a factor. For the 64-bit machines around today, the money spent on EDAC is a drop in the bucket, and any performance penalty is greatly reduced either by caches or vector operations. Moreover, these machines often run looooong jobs, and they are paid for by charging the users. If a user's 3-day job bombs after 2 1/2 days because of a memory error, you really can't charge him, and have lost significant revenue. (Not to mention severely pissing off the user, who undoubtedly has things timed to finish 1 hour before his paper on the results must be submitted :-). Future 64-bit workstations will certainly have some different considerations though. --------------------------------- Dave Christie My fuzzy memories only.
tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) (09/06/90)
In article <2483@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <68505@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes: > >| Given that a memory system is otherwise properly designed and tested >| and uses modern 4Mbit DRAM memory chips, is there any evidence that >| memory parity makes a measurable difference in the silent wrong answer >| rate? > > If the error rate for 1 bit error is 1 in N, then the rate for a 2 bit >error is 1 in N^2. With N in the order of some millions (or billions), >you make the chance of silent error millions of time less likely. I do not pretend to answer the original question but only to say that this answer is unfounded. According to my statistics class this is only true if the two events are independent. Perhaps the question could have been read as: ... is there any evidence that there are measurably more single bit errors than multiple bit errors? And now for a slightly biased question: Is it typical for a workstation to provide ECC and memory scrubbing like the Risc System/6000 does? I am getting at another possible selling point of this machine. Paul Chamberlain | I do NOT represent IBM tif@doorstop, sc30661@ausvm6 512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) (09/07/90)
In article <1189@geovision.UUCP> gd@geovision.uucp (Gord Deinstadt) writes: >rwallace@vax1.tcd.ie writes: >>Chances are the program will crash with a floating-point error or at least >>produce obviously wrong results e.g. profit for 1989 was $-32198742.88888. >No doubt 32198742.88888 posters have already replied to this. What they said. Either he got hit by a memory error or he typed that wrong because it's obvious that the number is wrong. I'd be willing to bet $32198742.88888 that he typed it that way. :-) Paul Chamberlain | I do NOT represent IBM tif@doorstop, sc30661@ausvm6 512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/07/90)
In article <1990Sep6.141040.3244@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: | Its been many years since I designed EDAC for a 64-bit machine, but what | I seem to remember is that using 7 bits would only allow you to correct | the 64-bit data portion, not the 7 check bits themselves. To cover those | you need one more bit (and you really do want those covered as well). I just looked at some C code for Hamming code I wrote years ago, and it appears to need log2(N)+1 bits, including the EDAC bits themselves. In any case, if you can have EDAC for the price of parity, why not? -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/07/90)
In article <3405@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes: | I do not pretend to answer the original question but only to say that | this answer is unfounded. According to my statistics class this is only | true if the two events are independent. Perhaps the question could have | been read as: There are exceptions to every assumption, but assuming that most memory systems are based on 1 bit wide chips, alpha strikes (which seem to be the common cause of bit errors) would be limited to one bit in a word. I think the answer is that multibit errors in a word are rare. My hardware guru says that one particle should only hit one bit, even in the same chip, and that depending on the chip it can only make one state transition. That means on some chips it can change 0 to 1, but not back. The alpha hit discharges the capacitor in the cell. Sounds right to me, but I don't claim to be a hardware type. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
dhinds@portia.Stanford.EDU (David Hinds) (09/07/90)
In article <2496@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <1990Sep6.141040.3244@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: > >| Its been many years since I designed EDAC for a 64-bit machine, but what >| I seem to remember is that using 7 bits would only allow you to correct >| the 64-bit data portion, not the 7 check bits themselves. To cover those >| you need one more bit (and you really do want those covered as well). > > I just looked at some C code for Hamming code I wrote years ago, and >it appears to need log2(N)+1 bits, including the EDAC bits themselves. >In any case, if you can have EDAC for the price of parity, why not? >-- But don't you really only need one parity bit per word, if you only want to be able to detect single bit errors? Using one parity bit per byte is wasteful - which is why the EDAC looks good. Having one parity bit per 64 bit word would seem to be the more fair comparison. Using 8 parity bits per word amounts to catching most two-bit errors as well - but catching more than single bit errors is not what parity is tailored for. -David Hinds dhinds@popserver.stanford.edu
davec@nucleus.amd.com (Dave Christie) (09/07/90)
In article <1990Sep7.003451.13193@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes: > But don't you really only need one parity bit per word, if you only >want to be able to detect single bit errors? Using one parity bit >per byte is wasteful - which is why the EDAC looks good. Having one >parity bit per 64 bit word would seem to be the more fair comparison. Yes, this would work, but 1) it would take almost twice as long to check the parity, (7 levels of XOR vs. 4), and parity checking tends to be a time critical path (although sometimes you can delay it), so it's a classic realestate/speed tradeoff 2) byte parity allows byte writes without the control complexity of doing read/modify/write. (EDAC of course requires r/m/w for partial-word writes.) ------------ Dave Christie
douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee) (09/07/90)
In <1990Sep7.003451.13193@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes: > But don't you really only need one parity bit per word, if you only >want to be able to detect single bit errors? Using one parity bit >per byte is wasteful - which is why the EDAC looks good. Having one >parity bit per 64 bit word would seem to be the more fair comparison. >Using 8 parity bits per word amounts to catching most two-bit errors >as well - but catching more than single bit errors is not what parity >is tailored for. > -David Hinds > dhinds@popserver.stanford.edu But using byte parity allows you to do things like byte writes. If you use word parity, you must do a read modify write for every byte in order to update the parity of the word. This is very inefficient. Douglas Lee
dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) (09/07/90)
In article <1990Sep7.144514.19015@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: >In article <1990Sep7.003451.13193@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes: : :: But don't you really only need one parity bit per word, if you only ::want to be able to detect single bit errors? Using one parity bit ::per byte is wasteful - which is why the EDAC looks good. Having one ::parity bit per 64 bit word would seem to be the more fair comparison. : :Yes, this would work, but : : 1) it would take almost twice as long to check the parity, : (7 levels of XOR vs. 4), and parity checking tends to be a : time critical path (although sometimes you can delay it), : so it's a classic realestate/speed tradeoff Beg pardon? You're telling me that all of these parity checks are done in serial??? If not, what difference does it make how many check bits there are? : : 2) byte parity allows byte writes without the control complexity : of doing read/modify/write. (EDAC of course requires r/m/w : for partial-word writes.) : This argument as least doesn't always hold when dealing with a write-back cache. : :------------ :Dave Christie -- Dan S.
davec@neutron.amd.com (Dave Christie) (09/08/90)
In article <2253@lectroid.sw.stratus.com> dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) writes: >In article <1990Sep7.144514.19015@mozart.amd.com> I write: >: >: 1) it would take almost twice as long to check the parity, >: (7 levels of XOR vs. 4), and parity checking tends to be a >: time critical path (although sometimes you can delay it), >: so it's a classic realestate/speed tradeoff > >Beg pardon? You're telling me that all of these parity checks are done >in serial??? If not, what difference does it make how many check bits >there are? Parity is typically generated and checked with a tree of 2-bit exclusive-ORs. Generating (or regenerating) 8-bit parity takes 3 levels of XOR. For an 8-byte word, you need three more levels to combine these into a single bit. Comparing the regenerated bit(s) with the stored bit(s) on a read requires one more XOR for each bit. As for generating an error signal, the 9 levels of XOR do it directly for 64-bit parity. With 8-bit parity on a 64-bit word the eight error signals you have after 4 levels of XOR must be combined with (logically) three levels of 2-bit OR, which in any technology will be somewhat faster than 3 levels of XOR, and can often be much faster (eg. ECL wired logic, CMOS dynamic logic). >: >: 2) byte parity allows byte writes without the control complexity >: of doing read/modify/write. (EDAC of course requires r/m/w >: for partial-word writes.) > >This argument as least doesn't always hold when dealing with a write-back >cache. Quite true. (Providing your I/O system or whatever else you have writing also does block or word writes, as Bill D. more or less pointed out earlier.) Just how significant either of these two points are is highly dependent on many parameters of your memory system design such as cycle time, ram speed, degree of pipelining, desired latency, write setup time, etc. ------------ Dave Christie
davec@neutron.amd.com (Dave Christie) (09/08/90)
In article <2496@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <1990Sep6.141040.3244@mozart.amd.com> I write: > >| Its been many years since I designed EDAC for a 64-bit machine, but what >| I seem to remember is that using 7 bits would only allow you to correct >| the 64-bit data portion, not the 7 check bits themselves. To cover those >| you need one more bit (and you really do want those covered as well). > > I just looked at some C code for Hamming code I wrote years ago, and >it appears to need log2(N)+1 bits, including the EDAC bits themselves. I stand corrected (thank you Robert Herndon) - the eighth bit gives you double error detection. (I knew it was necessary for some reason - like I said, its been a looong time...) The system I did was SECDED, which is the only EDAC scheme I've ever seen implemented, but then, I've only worked on mainframes. For a more cost sensitive workstation you may well skip the double-error stuff, but if it's only one more bit on top of 71, what the hell - it's good fodder for the sales brochures if nothing else. ---------- Dave Christie My opinions only.
daveg@near.cs.caltech.edu (Dave Gillespie) (09/08/90)
>>>>> On 8 Sep 90 01:46:08 GMT, davec@neutron.amd.com (Dave Christie) said: > The eighth bit gives you double error detection... > ... if it's only one more bit on top of 71, what the hell... I wonder, I can see single-bit errors occurring in isolation, but how likely is it to have an exactly two-bit error? Most catastrophes I can think of will nuke one bit or many. And if the only danger is two statistically independent errors occuring at once in the same word, I think a more pressing danger is that your machine might be the Ravenous Bugblatter Beast in a clever disguise. -- Dave -- Dave Gillespie 256-80 Caltech Pasadena CA USA 91125 daveg@csvax.cs.caltech.edu, ...!cit-vax!daveg
ching@brahms.amd.com (Mike Ching) (09/09/90)
In article <DAVEG.90Sep7233206@near.cs.caltech.edu> daveg@near.cs.caltech.edu (Dave Gillespie) writes: >>>>>> On 8 Sep 90 01:46:08 GMT, davec@neutron.amd.com (Dave Christie) said: > >> The eighth bit gives you double error detection... >> ... if it's only one more bit on top of 71, what the hell... > >I wonder, I can see single-bit errors occurring in isolation, but >how likely is it to have an exactly two-bit error? Most catastrophes >I can think of will nuke one bit or many. The problem is that the two errors don't have to occur simultaneously. If a soft error is not corrected (by accessing the word and writing a corrected word back), a second bit can be corrupted at a later time and result in a double bit error when the word is accessed. This is why scrubbing was incorporated in DRAM controllers. Scrubbing is a term coined for doing a RMW cycle with correction during a refresh cycle. All words in memory get accessed (and corrected if necessary) every few minutes instead of when accessed by a program. An added benefit is that errors are corrected in the background instead of imposing a correction cycle on an access when the processor is waiting for the data. Mike Ching
friedl@mtndew.Tustin.CA.US (Steve Friedl) (09/09/90)
[ discussions on ECC ] In article <1990Sep8.172848.4600@amd.com>, ching@brahms.amd.com (Mike Ching) writes: > The problem is that the two errors don't have to occur simultaneously. If > a soft error is not corrected (by accessing the word and writing a corrected > word back), a second bit can be corrupted at a later time and result in a > double bit error when the word is accessed. This is why scrubbing was > incorporated in DRAM controllers. On some machines, the scrubbing is done in software. The newer 3B2s all have a job running out of cron at the top of the hour that does: dd if=/dev/mem of=/dev/null This seems to serve the same purpose of provoking the single bit errors in the background so they get fixed right away. Steve -- Stephen J. Friedl, KA8CMY / I speak for me only / Tustin, CA / 3B2-kind-of-guy +1 714 544 6561 / friedl@mtndew.Tustin.CA.US / {uunet,attmail}!mtndew!friedl Steve's bright idea #44: COBOL interface library for X Windows
ddb@ns.network.com (David Dyer-Bennet) (09/25/90)
In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
:But, but, but... virtually all MSDOS software *explicitly ignores*
:parity errors.
The original article is dated 10-Aug, but no later ones in the thread
have reached this site.
This is a new story to me; what I KNOW is that I have had bad memory
chips in my ibm pc detected and reported by the parity logic. It gave
me enough information to identify the chip, and replacing that chip
cured the problem.
The lack of any way to test the parity system is extremely
unfortunate.
(Parity handling wouldn't come to the attention of the individual
program unless it went to special effort, it would be handled in
MS-DOS itself.)
--
David Dyer-Bennet, ddb@terrabit.fidonet.org
or ddb@network.com
or ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
or Fidonet 1:282/341.0, (612) 721-8967 9600hst/2400/1200/300