jensen@gt-eedsp.UUCP (P. Allen Jensen) (10/26/88)
Ok, Straight from a NeXT salesrep in response to the question: Q: Does the memory have a parity check bit ? A: "No" The reason was that "memory is reliable enough that the added cost was not justified." If you have ever worked on some older equipment without parity, your opinion may differ. Could an expert on RAM chips respond ? Is memory really "reliable enough" ? I was supprised to learn that the cold-start diagnostics do not check memory for errors - they just look to see if there is any memory there. If I had a NeXT, I think I would have a crontab entry to check memory every day/night ! P. Allen Jensen -- P. Allen Jensen Georgia Tech, School of Electrical Engineering, Atlanta, GA 30332-0250 USENET: ...!{allegra,hplabs,ulysses}!gatech!gt-eedsp!jensen INTERNET: jensen@gt-eedsp.gatech.edu
tim@hoptoad.uucp (Tim Maroney) (10/26/88)
In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >I was supprised to learn that the cold-start diagnostics do not >check memory for errors - they just look to see if there is any >memory there. How fast can *you* check 8 Megabytes of RAM??? -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim "Because there is something in you that I respect, and that makes me desire to have you for my enemy." "Thats well said. On those terms, sir, I will accept your enmity or any man's." - Shaw, "The Devil's Disciple"
roy@phri.UUCP (Roy Smith) (10/26/88)
jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: > The reason was that "memory is reliable enough that the added cost was not > justified." [...] Could an expert on RAM chips respond ? Is memory > really "reliable enough" ? I'm hardly an expert on ram, but here goes anyway. We've got 19 Sun-3's of various flavors around here with a total of 84 Mbytes of ram. We get a parity error panic on one machine or another a couple of times a year. Make that, oh maybe, 1 error per 400 Mbyte-months. In perhaps 2000 Mbyte-months of operation, we've had one hard memory error. That's my data. Draw your own conclusions. -- Roy Smith, System Administrator Public Health Research Institute {allegra,philabs,cmcl2,rutgers}!phri!roy -or- phri!roy@uunet.uu.net "The connector is the network"
debra@alice.UUCP (Paul De Bra) (10/26/88)
In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >Ok, Straight from a NeXT salesrep in response to the question: >Q: Does the memory have a parity check bit ? >A: "No" > >The reason was that "memory is reliable enough that the added cost >was not justified." If you have ever worked on some older equipment >without parity, your opinion may differ. Could an expert on RAM >chips respond ? Is memory really "reliable enough" ? NO. (memory is NOT reliable enough) I have seen memory go bad on ATs, Microvaxen, big Vaxen, ... My impression is that memory chips are not being tested well enough to be able to put them in a machine and expect them to still work (within specifications) in a year or so. 99.9% of the NeXT boxes may never have a problem, but I don't want to be among the 0.1% that spends weeks on the phone discussing unidentified problems, which cannot be reproduced, and tracking them down to a bad memory chip, when the few extra $ for parity could have pointed out the problem right away. It need not even be 9 chips for each row of 8; 33 instead of 32 would be adequate (though harder to get). Paul. -- ------------------------------------------------------------------------- |debra@research.att.com | uunet!research!debra | att!grumpy!debra | -------------------------------------------------------------------------
pardo@june.cs.washington.edu (David Keppel) (10/27/88)
roy@phri.UUCP (Roy Smith) writes: >[ 84Mb and only 2 parity error panics each year ] The Eagle PC's of a while back had no parity bits -- it was their claim that the parity bits had errors often enough that "fake" parity errors *could* be a problem and that *real* parity errors were few enough *not* to be a problem. I still like the idea of getting warned when something goes wrong... ;-D on ( My memory is perfac... purrfe... forgot how to spill ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
wtm@neoucom.UUCP (Bill Mayhew) (10/27/88)
I'm looking at the photo of the motherboard for the Next computer on page 164 of the Nov. 1988 issue of Byte. The SIMMS would appear to be 8 * 1 megabit chips. The accompanying text says that it is 100 nS memory. There is also a set of four 8K * 8 bit (possibly HM6164?) cache chips. The accompanying text says that they are rated at 45 nS. There are also four chips identified as "custom memory buffers" adjacent to the SIMM array. Last of all, there is 256K of VIDRAM. For the 32K static RAM, 24K is given to the DSP chip, and 8K goes to the disk controller. Apparently the LSI DMA chip must have some internal smarts. They do mention that there is a DMA burst mode that allows 4 long words to be fetched in 9 cycles. The SCSI controller is a 5390. The chip is marked NCR in the photo, but I can't find it in my SMS/OMTI catalog. I'm willing to believe the quoted 4 mb/s transfer rate. I don't know about lack of parity of error correction hardware. Relative to some of the other goodies included, it doesn't seem like it would have been that hard to include .. even if it was just parity. I'd like to know when my box is making a mistake. When I looked inside the Mac II, there didn't seem to be any sort of parity or error correction there either. I hope that the kernel does its own sanity checking in lieu of hardware. --Bill
jewett@hpl-opus.HP.COM (Bob Jewett) (10/28/88)
>> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >> Could an expert on RAM chips respond ? Is memory really "reliable enough" ? >> To which Roy Smith replies: > I'm hardly an expert on ram, but here goes anyway. ... > Make that, oh maybe, 1 error per 400 Mbyte-months. I too am not a RAM expert, but here is another data point. We have had about 5000 megabyte-months on 16 HP9000/350 workstations. (68020, 16 Meg RAM each) We have seen roughly the same rate of parity errors. Whether that error rate is a problem depends a lot on the application you're running. If there's a one-bit error when writing out your term report, it's probably OK. If it's the final version of an IC design, it may cost big bucks. Our file server 350 is equipped with ECC RAM (39 one-Meg chips for each 4 megabytes of RAM). There is a nightly daemon that "scrubs" the RAM -- finds and fixes all one-bit soft errors. The log shows two errors fixed in the last five months. That kind of RAM is slightly slower, but a parity error panic on the file server is painful enough that the extra safety was considered worthwhile. A subtle point in the statistics is that many (maybe most) soft errors in RAM are never noticed. Often RAM is written but not read. Bob Jewett jewett@hplabs This is not an official statement of the Hewlett-Packard Company.
david@ms.uky.edu (David Herron -- One of the vertebrae) (10/28/88)
Lack of a parity bit is a definite minus... even with "modern reliable memory". On my 3b1, somewhere in the 3.5 megs of memory in the machine, there is one or more bad chips. When the machine boots it finds those bad chips during the memory check and maps them out. I would prefer if it were to tell me where the bad chip is so that I could replace it, but I like the fact that it's being mapped out. And as near as I can figure it's only costing me 35K of memory ... I *like* the parity in my unix pc.. -- <-- David Herron; an MMDF guy <david@ms.uky.edu> <-- ska: David le casse\*' {rutgers,uunet}!ukma!david, david@UKMA.BITNET <-- <-- Controlled anarchy -- the essence of the net.
jtn@potomac.ads.com (John T. Nelson) (10/28/88)
> In article <3569@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > I'm hardly an expert on ram, but here goes anyway. We've got 19 > Sun-3's of various flavors around here with a total of 84 Mbytes of ram. > We get a parity error panic on one machine or another a couple of times a > year. Make that, oh maybe, 1 error per 400 Mbyte-months. In perhaps 2000 > Mbyte-months of operation, we've had one hard memory error. It only takes once to crash a machine... and it will probably occur at the least convenient time. This might not sound so bad in a University environment, but it could be disastorous elsewhere. NeXT really should have provided the extra bit pepr byte of parity checking (and they probably will once we've all bought these initial machine grumble). -- John T. Nelson UUCP: sun!sundc!potomac!jtn Advanced Decision Systems Internet: jtn@potomac.ads.com 1500 Wilson Blvd #512; Arlington, VA 22209-2401 (703) 243-1611 Shar and Enjoy!
geoff@desint.UUCP (Geoff Kuenning) (10/28/88)
CDC made that mistake, too, on the old 6000 series machines. The way I heard it, somebody "discovered" that most of the parity errors on the 3000 series were in the parity bits themselves. So dropping the parity bits would not only save money, but would cut pointless downtime. Needless to say, the result was hard-to-trace problems. I think (though I'm not sure) that they installed parity again on the 7600. I'm pretty sure that the Cray machines have ECC memory. One thing about Seymour -- he learns from his mistakes. -- Geoff Kuenning geoff@ITcorp.com uunet!desint!geoff
ejf@well.UUCP (Erik James Freed) (10/28/88)
In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >Ok, Straight from a NeXT salesrep in response to the question: >Q: Does the memory have a parity check bit ? >A: "No" >The reason was that "memory is reliable enough that the added cost >was not justified." If you have ever worked on some older equipment >without parity, your opinion may differ. Could an expert on RAM >chips respond ? Is memory really "reliable enough" ? >I was supprised to learn that the cold-start diagnostics do not >check memory for errors - they just look to see if there is any >memory there. If I had a NeXT, I think I would have a crontab >entry to check memory every day/night ! I think it was Seymour Cray who was quoted as saying "Parity is for farmers" I would tend to support NeXT's decision. Parity is supposed to allow you to pinpoint where errors reside, but the software is rarely written so that information is easily available. In general if your system memory is flakey, you will soon realize that something is up and then you can run memory tests to isolate the particular simm module. (I assume that a good memory checking diagnostic will be available at a standalone level for the NeXT) A useable (thorough) memory test takes a lot of time. It is not something that you want to run every boot up. And parity memory in my experience just is not really that useful. (at least to justify the PC real estate) I now submit myself to the flames Erik
whh@pbhya.PacBell.COM (Wilson Heydt) (10/29/88)
In article <1807@desint.UUCP>, geoff@desint.UUCP (Geoff Kuenning) writes: > Needless to say, the result was hard-to-trace problems. I think > (though I'm not sure) that they installed parity again on the 7600. I'm > pretty sure that the Cray machines have ECC memory. One thing about > Seymour -- he learns from his mistakes. The story I heard was that Cray was told that USGov purchasing *required* parity on computers, and that--therefore--he *had* to have it--or his primary customers wouldn't be allowed to buy the machines. Cray is reputed to have grumbled and complained about how parity slowed things down and added six inches to height of the box to add the parity bits & circuits. --Hal ========================================================================= Hal Heydt | "Hafnium plus Holmium is Analyst, Pacific*Bell | one-point-five, I think." 415-645-7708 | --Dr. Jane Robinson {att,bellcore,sun,ames,pyramid}!pacbell!pbhya!whh
jbn@glacier.STANFORD.EDU (John B. Nagle) (10/29/88)
It's not that "the machine might crash". It's that one might get bad data and not know it. Particularly in applications with long-lived databases updated over time, any source of undetected error is intolerable. Corrupted program objects might be generated. Some early MS-DOS machines, such as the Texas Instruments TI PRO, lacked memory parity. I at one time had one of these machines. The usual symptom of memory trouble was not a system crash, but junk in newly compiled and linked executables. It's really bad when you have to compile the same program twice and compare the executables to insure that the compile and link were successful. That TI PRO became a doorstop in late 1984. The Mac II doesn't have memory parity either. A bad move by Apple. I consider a machine without memory parity unacceptable for serious work. But then, NeXT is targeting the educational environment. John Nagle
hal@gvax.cs.cornell.edu (Hal Perkins) (10/29/88)
In article <1807@desint.UUCP> geoff@desint.UUCP (Geoff Kuenning) writes: >CDC made that mistake, too, on the old 6000 series machines. The way >I heard it, somebody "discovered" that most of the parity errors on the >3000 series were in the parity bits themselves. So dropping the parity >bits would not only save money, but would cut pointless downtime. The way I heard it, the parity bit was omitted on the 6000 series to save time. The clock would have had to be slower to generate and check parity. Apprently they assumed that if a memory module went bad, it would be obvious that there was a problem and the operator or field engineer could run diagnostics. It didn't work like that though. I was operating a 6400 a couple of times when a memory module failed. The machine would start acting weird, like it was having a nervous breakdown. Jobs would abort for no apparent reason and then work just fine when they were rerun, other jobs would appear to run correctly, but when rerun would produce different answers, parts of the operating system would abort or deadlock, etc. We learned that these symptoms probably meant a hardware problem, but then we'd have to tell the engineers to rerun their last couple of day's work to be safe, since there could have been errors in their numbers before things got bad enough to be noticable. Later CDC machines as well as Cray's have error correcting memory, which is essential in huge memories if you want to have acceptable MTBF. Personally, it's fine with me if a workstation-class machine doesn't have ECC, but I would like to have parity so I know when something is wrong. I wouldn't want to be riding on an airplane designed on machines without any form of error detection. Hal Perkins hal@cs.cornell.edu Cornell CS
henry@utzoo.uucp (Henry Spencer) (10/29/88)
In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >The reason was that "memory is reliable enough that the added cost >was not justified." If you have ever worked on some older equipment >without parity, your opinion may differ. Could an expert on RAM >chips respond ? Is memory really "reliable enough" ? I'm not really an expert on RAM chips, but I do know that the reliability of modern RAMs is *spectacularly* better than the ones that were routinely in use 5-10 years ago. Parity and error correction were fully justified on the 4Kb and 16Kb chips; the 64Kbs were vastly better, the 256Kbs better yet, and I imagine the 1Mbs are probably a further improvement. We're talking orders-of-magnitude improvement here. My feeling is that parity is nice but no longer a necessity. -- The dream *IS* alive... | Henry Spencer at U of Toronto Zoology but not at NASA. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
james@bigtex.cactus.org (James Van Artsdalen) (10/30/88)
In <549@gt-eedsp.UUCP>, jensen@gt-eedsp.UUCP (P. Allen Jensen) wrote: > The reason was that "memory is reliable enough that the added cost > was not justified." If you have ever worked on some older equipment > without parity, your opinion may differ. Could an expert on RAM > chips respond ? Is memory really "reliable enough" ? I personally haven't found parity checking to be worthwhile. I have had three memory systems errors on machines that had parity checking, and only one of those errors was a chip. None of those systems reported parity errors until well after I had discovered or deduced the problem myself, and the Apple Lisa never reported an error. A large number of machines in the PC market effectively don't have parity checking. Many clones use Phoenix's BIOS, which has this habit of disabling NMI and hence parity error reporting. Microsoft's symdeb debugger also leaves NMI disabled. Many video cards do bizarre things to NMI too. For those not aware: the Intel 80x88 family has a design flaw that requires external hardware to disable NMI. Without such hardware it is not possible to prevent the system from randomly crashing when NMIs are used. -- James R. Van Artsdalen james@bigtex.cactus.org "Live Free or Die" Home: 512-346-2444 Work: 338-8789 9505 Arboretum Blvd Austin TX 78759
mvs@meccsd.MECC.MN.ORG (Michael V. Stein) (10/30/88)
In article <1807@desint.UUCP> geoff@desint.UUCP (Geoff Kuenning) writes: >CDC made that mistake, too, on the old 6000 series machines. The way >I heard it, somebody "discovered" that most of the parity errors on the >3000 series were in the parity bits themselves. So dropping the parity >bits would not only save money, but would cut pointless downtime. I'm almost positive that old CDC machines had no form of parity bits. I am positive that our old Cyber 73 had no parity. >I think >(though I'm not sure) that they installed parity again on the 7600. All of the later CDC machines had full SECDED (Single Error Correction Double Error Detection) support. This meant that all of the 60 bit words had an extra 11 bits of SECDED data associated with it. -- Michael V. Stein - Minnesota Educational Computing Corp. - Technical Services {bungia,uiucdcs,umn-cs}!meccts!mvs or mvs@mecc.MN.ORG
jim@belltec.UUCP (Mr. Jim's Own Logon) (10/31/88)
In article <1988Oct28.210152.29417@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: > >The reason was that "memory is reliable enough that the added cost > >was not justified." If you have ever worked on some older equipment > >without parity, your opinion may differ. Could an expert on RAM > >chips respond ? Is memory really "reliable enough" ? > > I'm not really an expert on RAM chips, but I do know that the reliability > of modern RAMs is *spectacularly* better than the ones that were routinely > in use 5-10 years ago. According to the last court case I was a witness at, I'm an expert in memory system design. Credentials upon request. While it is true that the reliability of RAMs being corrupted from spurious alpha partical hits has greatly increased this has actually left them more susceptable to other failures (which were more prevelant anyway). The size of the memory cell, and the rate of leakage were what made earlier RAMs easy to have bit errors due to random alpha particals. The newer (larger) RAMs have smaller cells and less leakage, making them more reliable. But, the main cause of memory corruption has always been noise. Electrical noise from exrternal sources (generators, power surges, etc.), and from internal sources (other chips, peripherals, power supply). The less the memory cell charge, the more susceptable to noise. It is also very true that some vendors products are much more resilient to signal noise than others (no, I won't name names). In this era of RAM shortage you can bet that a company will be scatter buying their RAM to get as much as possible and as cheaply as possible. It follows then that some of the NExT machines will be better than others, some will never fail, some will fall out in burn-in, and those in the middle.... What should they have done? Make it a build option. If you want the extra safety, you shell out the extra bucks. If you like going to Las Vegas, playing the lottery, and Russian Roulette, you can have the base unit. -Jim Wall Bell Technologies, Inc. The above opions are mine. However in this case, the company would probably go along with them.
henry@utzoo.uucp (Henry Spencer) (11/01/88)
In article <8348@alice.UUCP> debra@alice.UUCP () writes: >NO. (memory is NOT reliable enough) > >I have seen memory go bad on ATs, Microvaxen, big Vaxen, ... You're at Bell Labs CS research, right? Do you use a Blit/5620/etc.? If so, unless they've changed the hardware, you're using a machine with no parity on its memory every day. Is it a problem? To expand on some of my earlier comments: There is no such thing as perfectly reliable memory. It's all a matter of how much you want to pay for lower error rate. If your memory chips are good enough, parity may be past the point of diminishing returns. I personally prefer it, but I don't insist on it. Those who are smug about their PCs having parity might want to consider three small complications: 1. There is no way for PC software to test the parity machinery, so it can go bad without notice. 2. At least one widely-used BIOS implementation has bugs in its handling of parity errors. 3. A lot of PC software essentially disables parity-error reporting. (This is not first-hand information, but it's from a source I consider quite reliable.) -- The dream *IS* alive... | Henry Spencer at U of Toronto Zoology but not at NASA. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
cramer@optilink.UUCP (Clayton Cramer) (11/02/88)
In article <7493@well.UUCP>, ejf@well.UUCP (Erik James Freed) writes: > I think it was Seymour Cray who was quoted as saying > "Parity is for farmers" > I would tend to support NeXT's decision. Parity is supposed to allow > you to pinpoint where errors reside, but the software is rarely written > so that information is easily available. In general if your system memory > is flakey, you will soon realize that something is up and then you can > run memory tests to isolate the particular simm module. (I assume that a > good memory checking diagnostic will be available at a standalone level > for the NeXT) A useable (thorough) memory test takes a lot of time. It is > not something that you want to run every boot up. And parity memory in > my experience just is not really that useful. (at least to justify the PC > real estate) I now submit myself to the flames > Erik ARGGGH! I have one terribly unpleasant experience with a lack of parity, and it makes me firm in my belief that memory needs parity. I was writing a gas station accounting system in BASIC on a Radio Shack Model 3 (not my choice of hardware or language, obviously), and I had to use someone else's system for an emergency bug fix. I loaded in my program, edited in my changes, then saved the program back to disk. Then I tried to run it. And it didn't work. Lots of variables were undefined, and I couldn't figure out why. After a bit of study, it turned out that bit 13 had gone bad in 16K of RAM. As a consequence, Q turned into A, R turned in B, S turned into C -- and the BASIC interpreter still accepted the lines, but all the variables were thoroughly garbled. (The keywords survived, perhaps using only small numbers in each word). Fortunately, a friend drove up from Los Angeles with our master disks, or I would have been in deep trouble. Parity would have caught this error, and prevented a huge loss of time and effort. I can't take seriously a machine without memory parity. The PC even has parity, and there are times it has caught memory errors. Waiting for those errors to be obvious before searching for them is a recipe for frustrated users, and corrupted data. -- Clayton E. Cramer ..!ames!pyramid!kontron!optilin!cramer
crum@lipari.usc.edu (Gary L. Crum) (11/02/88)
hmm... Hacking the MACH memory manager to check pages using a bit of its spare time might be fun; don't know about the kernel space though...
mitchell@cadovax.UUCP (Mitchell Lerner) (11/03/88)
In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >Ok, Straight from a NeXT salesrep in response to the question: >Q: Does the memory have a parity check bit ? >A: "No" > >The reason was that "memory is reliable enough that the added cost >was not justified." If you have ever worked on some older equipment >without parity, your opinion may differ. Could an expert on RAM >chips respond ? Is memory really "reliable enough" ? Well, where I work, we used to make machines that had parity checking and then we made machines that didn't and now we make them that do again. When I asked about this, I remember the answers that I got were something like this: Older memories would/could fail one cell at a time and in this case it is crucial to have parity checking logic in the hardware. So this is why we had parity checking in our older machines. RAMs don't fail one cell at a time. If a failure occurs in a RAM, the entire bank will fail and other hardware detects that, and so, to have parity checking in a RAM type machine is a waist of time, money and space. The customers though, don't know that parity is antiquated with newer memories but they still ask for it. And this is why we put it in our newest machines; not because it could do anything useful, but as a checklist item. I belive that this is why IBM put parity in their PCs. And since IBM put it in their PCs, everybody thinks that it is necessary for all machines. In their PC, it certainly doesn't do anything! This is what I heard, not necessarily is it what was said. I may be wrong. And, it is certainly not a technical statement of Contel Business Systems. -- Mitchell Lerner -- UUCP: {ucbvax, decvax}!trwrb!cadovax!mitchell Don't speak to him of your heartache, for he is speaking. He feels the touch of an ants foot. If a stone moves under the water he knows it.
mitchell@cadovax.UUCP (Mitchell Lerner) (11/03/88)
I apologize for the half truths (which are often more harmful than lies) in my last posting. What I now understand is this: In the old days, memory wasn't that reliable and parity checking was implemented to bring the system down quickly so that damage was minimal due to corrupt data. Parity used to be implemented on buses between processors and memory but the logic and the technology got so refined that harware people eventualy found out that they never had an error across these channels, so they removed parity in that area of the system. Today's memory is VERY reliable and (he said) it virtualy never fails one cell at a time; usually the entire bank or group of banks fail. I suppose that memory errors like this make failures much more obvious these days and the system will come down pretty quickly in the case of a memory failure. The logic used for parity checking can introduce more errors into the system if it should fail. Implementing parity on a system slowes the system down. With 100ns memories and 200ns to compute parity, one cannot run a system as fast as without parity. When I told him that the Next computer was implemented without parity, he said: "Well, I guess that guy is smarter than I give him credit for" :-) We build multi-user business systems that are used for on-line accounting, order-entry, billing and such. People's businesses depend on our systems and from what I understand, our systems are very reliable (software and hardware). I talked to some of our field support people and they said that memories just don't fail that often these days. "We just don't see data disasters caused by memory failing these days". Just one man's thoughts, not the opinion of Contel Business Systems. -- Mitchell Lerner -- UUCP: {ucbvax, decvax}!trwrb!cadovax!mitchell Don't speak to him of your heartache, for he is speaking. He feels the touch of an ants foot. If a stone moves under the water he knows it.
joel@peora.ccur.com (Joel Upchurch) (11/05/88)
Speaking of error checking. I wonder how many of the manufacturers of computers using memory caching use parity checking on the cache memory as well as the main memory? A lot of 386 machines have larger cache memories than the original PC had as main memory. Not only the data in the cache, but the translation addresses. And if the processor has loadable microcode how about on the microcode control store? Personally if my life depended on it I'd prefer having redundant computers to having a lot of error checking in a single computer. And if the consequences are really disastrous, I'd have two or more different kinds of computers running different programs written by different teams. -- Joel Upchurch/Concurrent Computer Corp/2486 Sand Lake Rd/Orlando, FL 32809 joel@peora.ccur.com {uiucuxc,hoptoad,petsd,ucf-cs}!peora!joel (407)850-1040
ralphw@ius3.ius.cs.cmu.edu (Ralph Hyre) (11/07/88)
In article <8348@alice.UUCP> debra@alice.UUCP () writes: >In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes: >>Ok, Straight from a NeXT salesrep in response to the question: >>Q: Does the memory have a parity check bit ? >>A: "No" >>... "memory is reliable enough that the added cost was not justified." >NO. (memory is NOT reliable enough) > And this is such a religious issue that I believe it should be left up to the end user/systems integrator. Add an 'extra' SIMM socket or two for a bank of chips to be used for parity/ECC, and make sure it is jumper-or even software selectable, depending on the users taste and wealth. For some applications (like digitized speech), I might rather have 10M of 99.9999% reliable memory than 9M+parity (all Unix can do is panic) or 8M+ECC. Anyone want to design a memory controller/MMU for this? -- - Ralph W. Hyre, Jr. Internet: ralphw@ius3.cs.cmu.edu Phone:(412) CMU-BUGS Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA "You can do what you want with my computer, but leave me alone!8-)"
alien@cpoint.UUCP (Alien Wells) (12/15/88)
Disclaimer: I work for a company whose main business is producing aftermarket memories. As such, I am exposed the memory business - but I cannot claim to be a memory expert. Memory reliability is extremely important in a computer. With decreasing cell sizes, it is becoming easier to have spurious bit errors, and the larger memory sizes lead to increased probabilities of failures. Even before joining Clearpoint, I considered the lack of parity to be a major problem with the Macintosh. I am extremely surprised to see it repeated by NeXT. Some figures about memory reliability. Prof McEliece (Caltech) in a paper called "The Reliability of Computer Memories" (Jan 1985 - Scientific American) estimated soft failure rate of a single memory cell at 1 every 1,000,000 years. In a 1MB board with party - this is a MBTF of 43 days. TI estimates MBTF more optimistically (no surprise). For their 64K DRAMS they estimate MBTF of 33.4 days for an 8MB system. AMD estimated a 16MB system would have an MBTF of 13 days. These error rates and MBTFs are for 64K DRAMS. Since 1MB DRAMS are considered to have twice as many errors per device, but 16 times the bits, multiply the above times by a factor of 8 to get MBTF estimates for 1MB chips. Thus, the optimistic TI estimate would lead to an extrapolation of an 8 month MBTF for soft errors for an 8MB system using 1MB memory chips. Prof McEliece's figures would extrapolate to 43 days for an 8MB system. TI estimates hard errors to be roughly 1/5 to 1/3 as likely as soft errors. Any 'reasonable' memory or computer manufacturer will use a 72 hour burn-in to assure infant mortality problems are found before shipment, but I think that the above figures are a compelling argument for a system-level approach to handle errors in the field. The simplest thing to do is parity checking. However, more and more vendors are using VLSI to incorporate Error Detection and Correction (EDC) circuitry on their memory boards. Standard EDC will detect 2 or more errors and correct 1 in the word size it deals with. The number of check bits required is log(2) of the word size. Thus, the following chart shows the memory overhead required: Word Size EDC Check Bits 8-bit Parity Bits --------- -------------- ----------------- 8 5 1 16 6 2 32 7 4 64 8 8 As you can see, by the time you get to 64 bit memory - there really isn't a reasonable excuse to not use EDC. (Of course, you could start using 16 bit parity ... but the protection is significantly diluted) Even 32 bit memories are seeing EDC used more and more often. In conclusion - I think that NeXT is bucking the trend in moving to no protection at all instead of moving to EDC protection for their memory. If the NeXT machine takes off, I expect that there will be a demand for 0MB next boxes which get populated with a 3rd party memory board - just for the reliability concerns. (-: Unless the claim is that the University Environment doesn't care about reliable operation any more than they care about packaged software. :-) For anyone who is interested in designing, evaluating, or purchasing computer memories, Clearpoint publishes a 70+ page "bible" entitled "The Designer's Guide to Add-In Memory". This is chock full of good information, and very light on the propaganda. It is available at no charge by calling: 1-800-CLEARPT Apologies: I thought I had sent this quite a while back, and recently found that I had not. I apologize if this seems dated.
johns@calvin.EE.CORNELL.EDU (John Sahr) (12/17/88)
In an article, Alien Wells gives interesting information about MTBF for soft and hard errors of memory. Also, the following information about performance of single error correct/double error detect is given: In article <1429@cpoint.UUCP> alien@cpoint.UUCP (Alien Wells) writes: >Standard EDC will detect 2 or more errors and correct 1 in the word size it >deals with. The number of check bits required is log(2) of the word size. >Thus, the following chart shows the memory overhead required: > >Word Size EDC Check Bits 8-bit Parity Bits >--------- -------------- ----------------- > 8 5 1 > 16 6 2 > 32 7 4 > 64 8 8 > >As you can see, by the time you get to 64 bit memory - there really isn't a >reasonable excuse to not use EDC. (Of course, you could start using 16 bit >parity ... but the protection is significantly diluted) Even 32 bit memories >are seeing EDC used more and more often. The parity check versus EDC comparison is not quite fair, because they are really doing two different things. For 64 bit word, although EDC can detect 2 errors and correct 1, the parity check can detect up to 8 errors (while correcting none). So, the tradeoff is not quite so clear. Although single error correction and 2 error detection is straightforward, parity checking must be faster: in fact, the 2 error detect is just a global parity check built on top of the Hamming code (single error correct). As far as the absence of any checking on the Mac or NeXT, I think it is defensible: single error detect per word would be nice, however. (ps. error detection is a little hobby of mine; I've taken a few classes, that's all)-- John Sahr, School of Elect. Eng., Upson Hall Cornell University, Ithaca, NY 14853 ARPA: johns@calvin.ee.cornell.edu; UUCP: {rochester,cmcl2}!cornell!calvin!johns
edwardm@hpcuhc.HP.COM (Edward McClanahan) (12/17/88)
/ hpcuhc:comp.sys.next / alien@cpoint.UUCP (Alien Wells) / 6:53 pm Dec 14, 1988 /
> ...this is a MBTF of 43 days...
You used this acronym so consistently, I wasn't sure... But you must mean
MTBF.
I don't know of a single PC-class computer that uses ECC memory. The IBM PC
uses PARITY to DETECT errors. I believe that all Atari, Apple, Commodore,
Compac, Dell, etc... "affordable" computers don't even have Parity! One could
argue that cost is the determinant. The obvious rebuttal to this tack is the
fact that several of these manufacturers sell machines in the $10,000 range.
In the PC days of the past, memory failures may have been acceptable. No
cached/paged data needed to be flushed/posted for consistency. In fact,
crashes are no big deal on a vintage PC (unless your editor doesn't do
frequent posts). No LAN would be left in an inconsistent state.
All that is changing quickly. OS/2 and UNIX both have Virtual Memory. Many
of the high performance PCs contain cache (albiet presently usually of the
write-through nature). Ram-disks are quite common. And finally, a large
percentage of PCs are being integrated into LANs. We all witnessed how
quickly "corruption" can infect other machines on these LANs (refer to the
email/Internet WORM reports).
If all these concerns are valid, where are all the ECC memory add-on boards?
Also, which NeXT competitors use ECC memory?
ed "I still wanna NeXT" mcclanahan
edwardm@hpcuhc.HP.COM (Edward McClanahan) (12/17/88)
> ...this is a MBTF of 43 days...
You used this acronym so consistently, I wasn't sure... But you must mean
MTBF.
I don't know of a single PC-class computer that uses ECC memory. The IBM PC
uses PARITY to DETECT errors. I believe that all Atari, Apple, Commodore,
Compac, Dell, etc... "affordable" computers don't even have Parity! One could
argue that cost is the determinant. The obvious rebuttal to this tack is the
fact that several of these manufacturers sell machines in the $10,000 range.
In the PC days of the past, memory failures may have been acceptable. No
cached/paged data needed to be flushed/posted for consistency. In fact,
crashes are no big deal on a vintage PC (unless your editor doesn't do
frequent posts). No LAN would be left in an inconsistent state.
All that is changing quickly. OS/2 and UNIX both have Virtual Memory. Many
of the high performance PCs contain cache (albiet presently usually of the
write-through nature). Ram-disks are quite common. And finally, a large
percentage of PCs are being integrated into LANs. We all witnessed how
quickly "corruption" can infect other machines on these LANs (refer to the
email/Internet WORM reports).
If all these concerns are valid, where are all the ECC memory add-on boards?
Also, which NeXT competitors use ECC memory?
ed "I still wanna NeXT" mcclanahan
----------
baum@Apple.COM (Allen J. Baum) (12/17/88)
[] >In article <1429@cpoint.UUCP> alien@cpoint.UUCP (Alien Wells) writes: >>As you can see, by the time you get to 64 bit memory - there really isn't a >>reasonable excuse to not use EDC. Um, theres a slight gotcha in doing ECC on a 64 bit chunk. Its not possible to write a byte anymore. You must read all 64 bits, substitute the byte, and write the 64 bits+new ECC. This is generally a sufficient reason to avoid ECC in low cost systems. Note that if you have a cache that reads and writes 64 bit chunks to main memory anyway, it may not be a big deal, until you have to worry about handling uncached writes, memory mapped i/o, ...... -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
bkliewer@iuvax.cs.indiana.edu (Bradley Dyck Kliewer) (12/17/88)
In article <680002@hpcuhc.HP.COM> edwardm@hpcuhc.HP.COM (Edward McClanahan) writes: >If all these concerns are valid, where are all the ECC memory add-on boards? >Also, which NeXT competitors use ECC memory? Well, there's always the Orchid ECCell board (which I have in my IBM AT). But, it's not the hottest seller on the market, and as far as I know, they don't have a Micro Channel or 32-bit version of the card. It would appear that end-users don't think error correction is important (whether this is simply over-confidence in technology is hard to say). If I remember correctly, there is little price difference between the Orchid and similar (non-ECC) cards, so I don't think price is the motivating factor here assuming RAM prices become reasonable again, which they surely will. Bradley Dyck Kliewer Hacking... bkliewer@iuvax.cs.indiana.edu It's not just an adventure It's my job!
jbn@glacier.STANFORD.EDU (John B. Nagle) (12/19/88)
In article <15877@iuvax.cs.indiana.edu> bkliewer@iuvax.UUCP (Bradley Dyck Kliewer) writes: >assuming RAM prices become reasonable again, which they surely will. I don't expect this to happen. Now that the Japanese manufacturers have achieved total market dominance, prices will be coordinated by the makers (which is legal in Japan) and will fall slowly, if at all. The era of "forward pricing" is over in RAM. Observation of the price trend in cars, color TVs, and VCRs will indicate the strategy. Yes, there will be 4 and 16Mb RAMs. But, just as we have seen with the 1Mb RAMs, they will not be priced so as to kill the market in smaller RAMs until sufficient time has elapsed, say five years, that the investment in the older technology has been repaid. Seen in this light, the rumor that 4Mb will be skipped and the RAM industry will go directly to 16Mb makes more sense. Having achieved coordination, it makes sense to wait until the 1Mb technology is fully amortized while working out the 16Mb production process, then introduce the new model in a controlled way. This is how a cartelized industry operates. One implication of this is that we cannot rely on advances in semiconductor technology to save us from the tendency of software to grow in size without bound. John Nagle
prem@andante.UUCP (Swami Devanbu) (12/28/88)
How can the Japanese Zaibatsu engage in price fixing while selling in countries (like the US) which do not allow anti-freemarket practices ? If such price fixing is indeed being conducted, can American manufacturers not bring Legal Action against the japanese manufacturers, and put an end to this ? It shouldn't really matter that they are Japanese companies, as long as they are selling in US markets. Prem Devanbu AT&T Bell Laboratories, (201) 582 - 2062 {...}!allegra!prem prem%allegra@research.att.com
izumi@violet.berkeley.edu (Izumi Ohzawa) (12/29/88)
In article <14723@andante.UUCP> prem@andante.UUCP (Swami Devanbu) writes: > >How can the Japanese Zaibatsu engage in price fixing while selling >in countries (like the US) which do not allow anti-freemarket >practices ? If such price fixing is indeed being conducted, can >Prem Devanbu >AT&T Bell Laboratories, Are you talking about the pricing of DRAMs and EPROMs ?? Well, SURPRISE!! The price fixing is imposed by the US Government in the first place. Yeah, I wonder why the government of a countery where cartel is prohibited is doing exactly that itself. Izumi Ohzawa izumi@violet.berkeley.eud