roy@phri.UUCP (Roy Smith) (09/10/87)
In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes: > Does anyone know about soft failure modes of DRAMs ? How likely is it to > find double bit errors ? With denser and denser memory chips, one might > expect that one day soon, background alpha particles will be able to flip > several adjacent bits. The way most (all?) modern memory systems are built is to have each chip contribute a single bit to each of many words. Thus, a typical 1 Mbyte ECC board (small by today's standards) might consist of 39 256k chips, each chip contributing a single bit to each of the 256k 39-bit words (32 data plus 7 EEC bits) on the board. If several bits in a given chip were to go bad, you would see errors in the same bit of several different words. If an entire chip were to die, you would see an error in the same bit of *every* word on the board. The memory controller would be able to correct any of these problems. Note that the typical-but-mythical memory board described above has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to correct an N-bit error, this board should be able to detect and correct as many as 3 bad bits in any 32-bit word. Thus, you could, if you wanted, go so far as to pluck out any 3 RAM chips on the board without loosing any function (other than, maybe, access speed). -- Roy Smith, {allegra,cmcl2,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
neil@mitsumi.UUCP (Neil Katin) (09/11/87)
->In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
-> The way most (all?) modern memory systems are built is to have each
->chip contribute a single bit to each of many words. Thus, a typical 1
->Mbyte ECC board (small by today's standards) might consist of 39 256k
->chips, each chip contributing a single bit to each of the 256k 39-bit words
->(32 data plus 7 EEC bits) on the board. If several bits in a given chip
->were to go bad, you would see errors in the same bit of several different
->words. If an entire chip were to die, you would see an error in the same
->bit of *every* word on the board. The memory controller would be able to
->correct any of these problems.
->
-> Note that the typical-but-mythical memory board described above
->has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to
->correct an N-bit error, this board should be able to detect and correct as
->many as 3 bad bits in any 32-bit word. Thus, you could, if you wanted, go
->so far as to pluck out any 3 RAM chips on the board without loosing any
->function (other than, maybe, access speed).
->--
->Roy Smith, {allegra,cmcl2,philabs}!phri!roy
->System Administrator, Public Health Research Institute
->455 First Avenue, New York, NY 10016
Sorry, I don't believe that is correct. As I understand error correcting
codes, It takes at least ln(m) bits to protect an m bit data word from
a one bit error. That means that you three bits to protect a byte, and
five bits to protect a 32-bit word.
I think (e.g. its been a while since I did the math) that seven bits
is enough to protect against two bit errors for a 32 bit word.
The place where "2N+1" comes in the the "error distance" needed to
map an erroneous data word back to a correct one. There is basically
a tradeoff between pure detection (distance N+1) and correction (2N+1).
In other words, if you could either correct a two bit error or detect
a four bit error with the same number of code bits..
Neil Katin
{amiga,pyramid}!mitsumi!neil
reiter@endor.harvard.edu (Ehud Reiter) (09/11/87)
In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > Note that the typical-but-mythical memory board described above >has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to >correct an N-bit error, this board should be able to detect and correct as >many as 3 bad bits in any 32-bit word. No. 7 check bits will let you correct single-bit errors, and detect double bit errors. You would need many more check bits to detect and correct triple bit errors. A Hamming code which can correct 1-bit errors and detect 2-bit errors requires ceiling(log2(N)) + 1 check bits, where N is the total number of bits (data + check), i.e. N = 39 if there are 32 data bits and 7 check bits. Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP) reiter@harvard.harvard.EDU (new ARPA)
oconnor@sunray.steinmetz (Dennis Oconnor) (09/11/87)
In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > Note that the typical-but-mythical memory board described above >has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to >correct an N-bit error, this board should be able to detect and correct as >many as 3 bad bits in any 32-bit word. Thus, you could, if you wanted, go >so far as to pluck out any 3 RAM chips on the board without loosing any >function (other than, maybe, access speed). >-- >Roy Smith, {allegra,cmcl2,philabs}!phri!roy >System Administrator, Public Health Research Institute >455 First Avenue, New York, NY 10016 Sorry, this is incorrect. To perform just SINGLE bit error CORRECTION you need 1+log2(word-width) bits of ECC bits. That means you need 6 bits for a 32-bit word, 5 for a 16-bit halfword, and 4 for a byte. Which is why you don't see ECC perfromed at the byte level, and DO see it performed at the word level, even though this makes writing a byte a pain in the neck ( to write a byte into an ECC'd word, you must read out the word, substitute in the new byte, and recompute the ECC for the new word; then you can write it back ). To perform DOUBLE bit error CORRECTION, you need to DOUBLE the number of check bits ( for randomly-occuring bit errors; block-error correcting codes where all the errors are assumed to be djacent are different, these are applicable to serial media like disk drives, not to memories ). Error DETECTION is another kettle of fish : for instance, a single parity bit detects ALL situations where an odd number of errors has occurred. A simple explanation ( intuitive, not neccesarily a proof ) for why you need 1+log2(word-width) bits of check code to correct a single bit error is the following : You need to be able to locate the error to correct it, and to locate a bit in a word of length(word-width + check-bits) [remember, the error might be in the check bits] you need log2(word-width + check-bits) bits of information. If number_of_check_bits < number_of_data_bits, this is equivalent to 1+log2(word-width). I could be SLIGHTLY wrong about this stuff : it's been a while. -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"
maa@nbires.UUCP (09/12/87)
In article <7319@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes: >Sorry, this is incorrect. To perform just SINGLE bit error CORRECTION >you need 1+log2(word-width) bits of ECC bits. That means you need >6 bits for a 32-bit word, 5 for a 16-bit halfword, and 4 for a byte. > <etc.> Not strictly true! I can remember reading (sorry, too long ago to remember where) about a clever way to detect and correct all single bit errors, detect all double bit errors, and correct most double bit errors USING ONE BIT PER WORD. The idea is that each parity bit is calculated as the mod 2 sum of all the bits in the word plus one bit from each word at addresses +- n where n is the word size: n+8 p7654321* n+7 p765432*0 ... n+2 p7*543210 n+1 p*6543210 n X******** parity bit X calculated as mod 2 sum of *'s n-1 p*6543210 n-2 p7*543210 ... n-7 p765432*0 n-8 p7654321* God only knows how this could be implemented in a real memory system though; to do a read/write you need to check/set the parity bits on all words +-n which means reading all words +-2n. Maybe some clever VLSI hack will do it for us. 8-) Mark P.S. If anyone out there knows of any references to this type of coding, I'm interested. I was just a dumb (smartass?) college kid when I read (and mostly forgot) this stuff!
alverson@decwrl.dec.com (Robert Alverson) (09/15/87)
In article <1215@nbires.UUCP> maa@nbires.UUCP (Mark Armbrust) writes: >Not strictly true! I can remember reading (sorry, too long ago to remember >where) about a clever way to detect and correct all single bit errors, detect >all double bit errors, and correct most double bit errors USING ONE BIT PER >WORD. ... describes wonderfully convoluted ECC method. The 1+log2(...) relation is *strictly* true. This is a result from information theory. It seems to me that the scheme you mentioned lowers the cost/bit of ECC by effectively using a larger word size. Since the number of check bits needed is logarithmic to the word size, you can make the cost/bit arbitrarily low by working on more bits at once. Note that you don't get something for nothing. Since you are checking over more bits, there is a greater chance of multiple bit errors that you cannot correct. Also there is the mentioned hardware complexity of the scheme you described. Bob
henry@utzoo.UUCP (Henry Spencer) (09/15/87)
Clearly, what we need, urgently, is ECC on the damn memory chips. There have already been mutterings about this, but no commercial products as far as I know. This is an ideal place for ECC: wide words are available internally to reduce the number of correction bits needed (to the extent that this is desirable -- fewer bits mean poorer coverage against multiple errors), modest amounts of circuitry are not hard to add, and the problem with needing read-modify-write cycles for a partial write goes away because dynamic RAMs have to do this *anyway*. (Essentially all accesses to DRAMs are r-m-w cycles, because the internal readout operation is destructive and must be followed by a writeback, and the chip works internally with quite large words and *any* write is a partial write, needing a read first. It's to the credit of DRAM designers that these grubby details are largely invisible nowadays; high time they did the same for ECC.) -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
nather@ut-sally.UUCP (Ed Nather) (09/16/87)
Henry Spencer's suggestion that automatic error correction be included right in the memory chip is a good one, but I fear it won't happen soon, if at all. We users are so hungry for more memory we put size at a great premium, and the chip designers respond. If they are given a choice of more (uncorrected) bits vs. fewer (corrected) ones, I doubt they'd choose the latter. Chip real estate is expensive: yield is a non-linear function of chip size, so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be very costly. Maybe some day ... -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather@astro.AS.UTEXAS.EDU
mark@mips.UUCP (09/16/87)
In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes > Clearly, what we need, urgently, is ECC on the damn memory > chips. There have already been mutterings about this, but no > commercial products as far as I know. Micron Technology's 256Kbit dynamic RAM has on-chip ECC. And customers just frigging HATE the idea. You shoulda been there (NY Hilton, Feb 1985) at ISSCC when Jim O'Toole of Micron Technology faced an angry mob of non-believers and tried to explain the advantages of on-chip ECC. Poor guy got hooted off the platform. The gripes against ECC are (1) it's "dishonest" because it lets mfrs sell defective chips. {This was also heard three years previously when redundant memories were first discussed.} (2) There's no way to tell whether a given chip has a hard error {ECC masks it}, in which case the single-bit ECC provides no protection against soft errors. Note that a hard error can occur weeks after system installation so special RAM chip "test modes" aren't useful here. Big customers (the ones that DRAM mfrs seek to please!) have a Component Qualification and Reliability group, who qualify and/or disqualify RAM vendors. If the head of this group doesn't want ECC RAMs, then he doesn't qual them and that company doesn't buy them. Sadly, the most savage attacks on Mr. O'Toole of Micron came from the heads of Qual depts. of immense DRAM consumers. Most notable among them was Mr. X of Burroughs (Unisys) who also led the battle against redundant RAMs three years before. DRAM mfrs therefore *perceived* that ECC RAMs were poison in the (major customer) marketplace, so they backed away from the idea PRONTO. In fact, I believe (don't know for sure) that even Micron Technology gave up on ECC and left it off their 1-Megabit DRAM. You can call them in Boise, Idaho to find out. -- -Mark Johnson *** DISCLAIMER: The opinions above are personal. *** UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark TEL: 408-720-1700 x208 US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
bcase@apple.UUCP (Brian Case) (09/16/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There >have already been mutterings about this, but no commercial products as >far as I know. Micron Technology was releasing information about just such a DRAM years ago (maybe 2? 1? 3?), at least to the trade press. I don't know if they ever shipped any.
baum@apple.UUCP (09/17/87)
-------- [] >In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. Micron Technologies (Boise, Idaho) made such a chip, and may even be still sell it. It took them a long time to get it out the door; they missed a big window of opportunity on that. The organization is 256kx1. Its not clear that duplicating the logic on each chip is a cost effective solution, especially considering that at the chip level, errors must be detected and corrected before data comes off the chip. At a system level, this may not be necessary; you might have a extra cycle before you need to know there is an error, and can afford lots of time to correct it (sinces it presumably an infrequent event). -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
ccplumb@watmath.waterloo.edu (Colin Plumb) (09/17/87)
In article <9024@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >Henry Spencer's suggestion that automatic error correction be included right >in the memory chip is a good one, but I fear it won't happen soon, if at all. >We users are so hungry for more memory we put size at a great premium, and >the chip designers respond. If they are given a choice of more (uncorrected) >bits vs. fewer (corrected) ones, I doubt they'd choose the latter. > >Chip real estate is expensive: yield is a non-linear function of chip size, >so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be >very costly. Maybe some day ... Au contraire! I forget my sources (trade magazines), but prototype 4 Meg chips *do* perform ECC. If your ECC scheme is sophisticated enough, it can handle multi-bit errors, and thus ignore a hard error (read: flaw in the chip) or two. Thus, yield goes *up*. The only problem is that this circuitry slows the chip down. One of the fundamental theorems of information theory states that the number of usable bits on a memory chip can approach, as closely as desired, the number of good bits there. (Actually, it's for communication channels, but the theory applies equally to memory.) This assumes very sophisticated ECC and indefinitely large memory chips, but one can do a pretty good job with 4 Megabits and reasonable timing constraints. -Colin Plumb (ccplumb@watmath) I'll hold the GIRAFFE while you fill the BATHTUB with brightly coloured MACHINE TOOLS!!
elwell@tut.cis.ohio-state.edu (Clayton Elwell) (09/17/87)
henry@utzoo.UUCP (Henry Spencer) writes:
Clearly, what we need, urgently, is ECC on the damn memory chips. There
have already been mutterings about this, but no commercial products as
far as I know.
I have data sheet from Micron Technology that describes a 64K DRAM
with ECC from a couple years ago. Anyone know if they're actually
shipping this beastie?
--
Clayton M. Elwell
The Ohio State University Department of Computer and Information Science
(614) 292-6546 UUCP: ...!cbosgd!osu-cis!tut.cis.ohio-state.edu!elwell
ARPA: elwell@ohio-state.arpa (not working well right now)
qwerty@drutx.ATT.COM (Brian Jones) (09/17/87)
In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: > Clearly, what we need, urgently, is ECC on the damn memory chips. There > have already been mutterings about this, but no commercial products as > far as I know. ^^^^^^^^^^^^^^^^^^^^^^ Intel has the 8206/8207 chip set for dual port DRAM control with DEDSEC (dual error detection, single error correction). -- Brian Jones aka {ihnp4,allegra}!{drutx}!qwerty @ AT&T-IS, Denver
randys@mipon3.intel.com (Randy Steck) (09/18/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There >have already been mutterings about this, but no commercial products as >far as I know. There is certainly a trend toward making "smarter" memory chips, but ECC is a different animal all together that does not really lend itself to implementation on the memory chip. It certainly doesn't belong on the most common organization of the memory device (X1) since the overhead of using it is so high (in terms of silicon cost). The cost and yield curve in this case tends to argue that the ECC logic be included directly on a much smarter and more configurable memory controller. I would propose a memory controller that was smart enough to do ECC, powerful enough to drive the array of memory devices directly (relaxing the access time requirements), and smart enough to work with others of its type in a system without contention. >and the problem >with needing read-modify-write cycles for a partial write goes away because >dynamic RAMs have to do this *anyway*. (Essentially all accesses to DRAMs >are r-m-w cycles, because the internal readout operation is destructive >and must be followed by a writeback, .... Unfortunately, this is not really true. The apparent RMW cycle that is performed by DRAMs is a characteristic of the circuitry and not of the logical design. In other words, the designer of the DRAM has done nothing to sequence the refresh of the DRAM cell. The act of reading the memory cell is sufficient to refresh it to its fully charged state. The requirements of ECC would be that a cell would have to also be "flipped" during the interval in which it is read, which would be extremely difficult without some form of sequencing logic. (And sequencing is really very tough without a clock!) >It's to the credit of DRAM designers that these grubby details are largely >invisible nowadays; high time they did the same for ECC.) Although I have enormous respect for my colleagues who *want* to spend their lives looking at circuit simulations to create a DRAM, I think it is stretching to say that they have gone to great lengths to hide the "grubby details". These details are an inherent part of the mechanism by which DRAM cells are read and written. There is no easy counterpart to the problem for ECC. Please notice that I am not saying that it cannot be done (Micron Tech. already did it!), just that it is not feasible for the foreseeable future given the alternative implementations. Besides, do you really care where the ECC is done as long as it is done and you don't have to bother with it? Randy Steck Intel Corp. ...intelca!mipos3!omepd!mipon3!randys
pf@diab.UUCP (Per Fogelstrom) (09/18/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There >have already been mutterings about this, but no commercial products as >far as I know. This is an ideal place for ECC: wide words are available > [ deleted text ] There has been an announcment about such a chip. A 1Meg * 1 bit dynamic cmos ram, with "row error correction" over, i belive 256 bits. Forgive me if i'm wrong (Can't find that da**ed paper) but i think the manufacturer was "Samsung".
jpp@slxsys.UUCP (John Pettitt) (09/19/87)
In article <686@obiwan.UUCP> mark@mips.UUCP (Mark G. Johnson) writes: >In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes > > Clearly, what we need, urgently, is ECC on the damn memory > > chips. There have already been mutterings about this, but no > > commercial products as far as I know. >Micron Technology's 256Kbit dynamic RAM has on-chip ECC. And customers >just frigging HATE the idea. . . . >The gripes against ECC are (1) it's "dishonest" because it lets mfrs >sell defective chips. {This was also heard three years previously >when redundant memories were first discussed.} (2) There's no way to >tell whether a given chip has a hard error {ECC masks it}, in which >case the single-bit ECC provides no protection against soft errors. This may be a dumb suggestion, but ...... Why not have an `ECC_FAULT' pin on the ram chip that signals that the on chip ecc logic just found and corrected an error ? This output would then be used to generate a 'ram fault' signal to the os, and with the correct software place the service call . . . This would solve both 1 and 2 above as faulty chips would be detectable. This solution seems so simple there must be a catch to it but right now I can't see it. -- John Pettitt - G6KCQ, CIX jpettitt, Voice +44 1 398 9422, Discalimer applies ! UUCP: {backbone}!mcvax!ukc!{ pyrltd || stc!datlog }!slxsys!jpp Remember: Bill Gates is the worlds greatest expert on Operating Systems :-)
daveb@geac.UUCP (Brown) (09/21/87)
In article <208@slxsys.UUCP> jpp@slxsys.UUCP (John Pettitt) writes: >In article <686@obiwan.UUCP> mark@mips.UUCP (Mark G. Johnson) writes: >>Micron Technology's 256Kbit dynamic RAM has on-chip ECC. And customers >>just frigging HATE the idea. . . . >>The gripes against ECC are (1) it's "dishonest" because it lets mfrs >>sell defective chips. {This was also heard three years previously >>when redundant memories were first discussed.} It strikes me that the place I heard about ECC-equipped RAMs was in a journal article on fault and radiation-resistant hardware for military and sattelite usage. Since I wouldn't want my Comsat[1] to suddenly go wonky because an energetic particle happened to wander by, much less my battle station (:-}), this is probably the area where the customers wouldn't complain about dishonesty. I suspect *those* customers would welcome John Pettts's ECC_FAULT pin. --dave [1] Comsat is a trademark, presumably of the Comsat Company. -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
henry@utzoo.UUCP (Henry Spencer) (09/22/87)
> The gripes against ECC are (1) it's "dishonest" because it lets mfrs > sell defective chips. {This was also heard three years previously > when redundant memories were first discussed.} (2) There's no way to > tell whether a given chip has a hard error {ECC masks it}, in which > case the single-bit ECC provides no protection against soft errors. Mmm, I was actually thinking of ECC provided purely for post-manufacturing errors, not as a way of covering up manufacturing defects. And doing it right would definitely require some way to find out what had happened on chip, so that the software could cope appropriately. In other words, what we have right now in board-level implementations, but done on the chip. (Yes, I realize there is a pin-count problem that makes it difficult to devise a way of asking the chip for an error report.) If the manufacturers want to use ECC as a way of dealing with chip defects, that's fine by me, but it's *not* what I'm asking for. I don't have any real hope that anyone is going to do what I want, though. (Heavens! Change the DRAM interface to make the system-level design simpler?!? Much too radical. Completely unacceptable to Marketing.) -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (09/22/87)
In a similar vein... We now have machines that do TLB loading in software (e.g. MIPSCo) and even an occasional machine that does cache loading in software (Cheriton's MMUless virtual-address cache). Has anybody thought about doing the correction (as opposed to detection) part of ECC in software? Clearly this is viable only if ECC's purpose is to handle infrequent soft errors and provide fail-soft behavior in the presence of newly-arrived hard errors; it won't work if errors are frequent or if you are trying to cover up rather than fix hard errors. Given that restriction on its domain of application, though, it seems like it might work. -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
mikew@bigboy.UUCP (09/24/87)
In article <8638@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >...Has anybody thought >about doing the correction (as opposed to detection) part of ECC in software? >Clearly this is viable only if ECC's purpose is to handle infrequent soft >errors and provide fail-soft behavior in the presence of newly-arrived hard >errors; ... Given that restriction on its domain of >application, though, it seems like it might work. I was thinking about this a few days ago, and I came up with some interesting techniques for implementing this. For single bit errors you could just use the ECC bits to correct them, the advantage comes if you multiple bit errors. The first step is to see if the page is dirty(different than on the page device). If it isn't, just page it in. This is very likely to work since there are a lot of pages that never get changed(executable code) and a lot that is infrequently changed. If this fails and the error was in the data space of a user process, just terminate the user process. If all else fails, and the error is in the code space of the kernel, you can always generate a panic(or the equivalent on your OS). Does anybody implement a scheme like this? It would seem to greatly reduce the problems caused by memory errors. -- Mike Wexler UUCP: wyse!mike ATT: (408)433-1000 x 1330
pww@alaska.cray.com (Paul Wells) (09/26/87)
In article <8638@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: > In a similar vein... We now have machines that do TLB loading in software > (e.g. MIPSCo) and even an occasional machine that does cache loading in > software (Cheriton's MMUless virtual-address cache). Has anybody thought > about doing the correction (as opposed to detection) part of ECC in software? Depends on what you mean by "correction". The machines I'm familiar with do perform correction in software -- in the sense of writing the corrected word back to fix soft errors. However, if you mean actually decoding the syndrome bits to determine which bit has been flipped, this seems impractical. What happens if the error is in the code that corrects errors?
adam@gec-mi-at.co.uk (Adam Quantrill) (10/07/87)
In article <14617@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: > >If your ECC scheme is sophisticated enough, it can handle multi-bit errors, >and thus ignore a hard error (read: flaw in the chip) or two. Thus, yield >goes *up*. The only problem is that this circuitry slows the chip down. > It needn't slow down the chip that much. If you do the ECC at chip refresh time, the random errors will be spotted then and appropriate action taken. Also, this approach will minimise the chance of two independent errors corrupting the same row, especially if that row hadn't been accessed for yonks. It would still be a good idea to have an extra pad on the chip to flag hard errors so the chip can be graded to: -totally correct -correctable hard errors -duff but I don't think it would be necessary to bring this out to a pin. -Adam. /* If at first it don't compile, kludge, kludge again.*/
henry@utzoo.UUCP (Henry Spencer) (10/07/87)
> ... However, if you mean actually decoding the syndrome > bits to determine which bit has been flipped, this seems impractical. What > happens if the error is in the code that corrects errors? What happens if the code that runs your software-managed TLB gets a TLB miss? What happens if your pager gets a page fault? The answer is the same: you have to make sure it doesn't. Either the software has to be very careful (which is okay for things like paging but not for hardware issues like error correction), or else the crucial bits of software have to get special help. Include a small amount of high-reliability static RAM to hold the memory-error handler. That is what Cheriton et al did for the cache handler in their virtual-cache-MMUless design: the hardware has no idea how to do the virtual->real mapping for a cache miss, so the software that does the mapping MUST NOT cache miss, so it sits in a special bit of supervisor-only memory that is neither mapped nor cached. -- PS/2: Yesterday's hardware today. | Henry Spencer @ U of Toronto Zoology OS/2: Yesterday's software tomorrow. | {allegra,ihnp4,decvax,utai}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (10/11/87)
> ... However, if you mean actually decoding the syndrome > bits to determine which bit has been flipped, this seems impractical. What > happens if the error is in the code that corrects errors? Greg Noel has pointed out that I responded to one of two possible meanings of this question; does "code" mean the error-correction software or the extra bits on the failing memory word? In the latter case, which may have been what was meant and which I didn't address, the answer is simple: if you want to look at them, which you most assuredly do, the hardware has to provide a way to do it. The simplest thing would be a register which latches the extra bits when an error occurs. Actually, there's a good chance that you will have something more complicated than that if the hardware people have done their job right -- how do you run diagnostics on error-corrected memory without a way to inspect those bits? -- "Mir" means "peace", as in | Henry Spencer @ U of Toronto Zoology "the war is over; we've won". | {allegra,ihnp4,decvax,utai}!utzoo!henry
jerry@oliveb.UUCP (Jerry Aguirre) (10/12/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There The disadvantage is that this provides less protection. Off chip ECC protects against the total failure of the chip, not just the failure of a bit or two. If an address or output line fails you would never know about it with on-chip ECC. There is also a problem with how the memory chips are going to communicate the ECC information to the CPU. Not only does the chip have to notify the CPU about both uncorrected and corrected errors but, at least at the diagnostic level, you probably want to be able to interagate the chip about the details of the error. All this sounds like more IO pins which are already at a premium. On the other hand having both would be a real win. With each chip handling its own ECC you could have every bit of a word wrong and still have it corrected. Also it could be checking every memory location at refresh time instead of waiting to find errors when they are accessed. (And having multiple errors accumulate in infrequently accessed words.) With a second level of correction the on-chip ECC could fail silently and thus not require any extra pins. Jerry Aguirre
pf@diab.UUCP (Per Fogelstrom) (10/15/87)
In article <8739@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >> ... However, if you mean actually decoding the syndrome >> bits to determine which bit has been flipped, this seems impractical. What >> happens if the error is in the code that corrects errors? > >Greg Noel has pointed out that I responded to one of two possible meanings >of this question; does "code" mean the error-correction software or the >extra bits on the failing memory word? In the latter case, which may have > [ removed discussion about how to take care of correction bits ] If there is an error in the LOGIC that corrects errors, then you are in trouble, however there must be a way to locate such faults with software. Anyway, a fault correction logic tends to signal errors more than missing them. If the error is in the syndrome bits itself, correction will take care of that. Even the correction bits are covered by itself !