roy@phri.UUCP (Roy Smith) (09/10/87)
In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes: > Does anyone know about soft failure modes of DRAMs ? How likely is it to > find double bit errors ? With denser and denser memory chips, one might > expect that one day soon, background alpha particles will be able to flip > several adjacent bits. The way most (all?) modern memory systems are built is to have each chip contribute a single bit to each of many words. Thus, a typical 1 Mbyte ECC board (small by today's standards) might consist of 39 256k chips, each chip contributing a single bit to each of the 256k 39-bit words (32 data plus 7 EEC bits) on the board. If several bits in a given chip were to go bad, you would see errors in the same bit of several different words. If an entire chip were to die, you would see an error in the same bit of *every* word on the board. The memory controller would be able to correct any of these problems. Note that the typical-but-mythical memory board described above has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to correct an N-bit error, this board should be able to detect and correct as many as 3 bad bits in any 32-bit word. Thus, you could, if you wanted, go so far as to pluck out any 3 RAM chips on the board without loosing any function (other than, maybe, access speed). -- Roy Smith, {allegra,cmcl2,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
neil@mitsumi.UUCP (Neil Katin) (09/11/87)
->In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
-> The way most (all?) modern memory systems are built is to have each
->chip contribute a single bit to each of many words. Thus, a typical 1
->Mbyte ECC board (small by today's standards) might consist of 39 256k
->chips, each chip contributing a single bit to each of the 256k 39-bit words
->(32 data plus 7 EEC bits) on the board. If several bits in a given chip
->were to go bad, you would see errors in the same bit of several different
->words. If an entire chip were to die, you would see an error in the same
->bit of *every* word on the board. The memory controller would be able to
->correct any of these problems.
->
-> Note that the typical-but-mythical memory board described above
->has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to
->correct an N-bit error, this board should be able to detect and correct as
->many as 3 bad bits in any 32-bit word. Thus, you could, if you wanted, go
->so far as to pluck out any 3 RAM chips on the board without loosing any
->function (other than, maybe, access speed).
->--
->Roy Smith, {allegra,cmcl2,philabs}!phri!roy
->System Administrator, Public Health Research Institute
->455 First Avenue, New York, NY 10016
Sorry, I don't believe that is correct. As I understand error correcting
codes, It takes at least ln(m) bits to protect an m bit data word from
a one bit error. That means that you three bits to protect a byte, and
five bits to protect a 32-bit word.
I think (e.g. its been a while since I did the math) that seven bits
is enough to protect against two bit errors for a 32 bit word.
The place where "2N+1" comes in the the "error distance" needed to
map an erroneous data word back to a correct one. There is basically
a tradeoff between pure detection (distance N+1) and correction (2N+1).
In other words, if you could either correct a two bit error or detect
a four bit error with the same number of code bits..
Neil Katin
{amiga,pyramid}!mitsumi!neil
oconnor@sunray.steinmetz (Dennis Oconnor) (09/11/87)
In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > Note that the typical-but-mythical memory board described above >has 7 check bits per 32 bit data word. Since you need 2N+1 check bits to >correct an N-bit error, this board should be able to detect and correct as >many as 3 bad bits in any 32-bit word. Thus, you could, if you wanted, go >so far as to pluck out any 3 RAM chips on the board without loosing any >function (other than, maybe, access speed). >-- >Roy Smith, {allegra,cmcl2,philabs}!phri!roy >System Administrator, Public Health Research Institute >455 First Avenue, New York, NY 10016 Sorry, this is incorrect. To perform just SINGLE bit error CORRECTION you need 1+log2(word-width) bits of ECC bits. That means you need 6 bits for a 32-bit word, 5 for a 16-bit halfword, and 4 for a byte. Which is why you don't see ECC perfromed at the byte level, and DO see it performed at the word level, even though this makes writing a byte a pain in the neck ( to write a byte into an ECC'd word, you must read out the word, substitute in the new byte, and recompute the ECC for the new word; then you can write it back ). To perform DOUBLE bit error CORRECTION, you need to DOUBLE the number of check bits ( for randomly-occuring bit errors; block-error correcting codes where all the errors are assumed to be djacent are different, these are applicable to serial media like disk drives, not to memories ). Error DETECTION is another kettle of fish : for instance, a single parity bit detects ALL situations where an odd number of errors has occurred. A simple explanation ( intuitive, not neccesarily a proof ) for why you need 1+log2(word-width) bits of check code to correct a single bit error is the following : You need to be able to locate the error to correct it, and to locate a bit in a word of length(word-width + check-bits) [remember, the error might be in the check bits] you need log2(word-width + check-bits) bits of information. If number_of_check_bits < number_of_data_bits, this is equivalent to 1+log2(word-width). I could be SLIGHTLY wrong about this stuff : it's been a while. -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"
henry@utzoo.UUCP (Henry Spencer) (09/15/87)
Clearly, what we need, urgently, is ECC on the damn memory chips. There have already been mutterings about this, but no commercial products as far as I know. This is an ideal place for ECC: wide words are available internally to reduce the number of correction bits needed (to the extent that this is desirable -- fewer bits mean poorer coverage against multiple errors), modest amounts of circuitry are not hard to add, and the problem with needing read-modify-write cycles for a partial write goes away because dynamic RAMs have to do this *anyway*. (Essentially all accesses to DRAMs are r-m-w cycles, because the internal readout operation is destructive and must be followed by a writeback, and the chip works internally with quite large words and *any* write is a partial write, needing a read first. It's to the credit of DRAM designers that these grubby details are largely invisible nowadays; high time they did the same for ECC.) -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
nather@ut-sally.UUCP (Ed Nather) (09/16/87)
Henry Spencer's suggestion that automatic error correction be included right in the memory chip is a good one, but I fear it won't happen soon, if at all. We users are so hungry for more memory we put size at a great premium, and the chip designers respond. If they are given a choice of more (uncorrected) bits vs. fewer (corrected) ones, I doubt they'd choose the latter. Chip real estate is expensive: yield is a non-linear function of chip size, so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be very costly. Maybe some day ... -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather@astro.AS.UTEXAS.EDU
bcase@apple.UUCP (Brian Case) (09/16/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There >have already been mutterings about this, but no commercial products as >far as I know. Micron Technology was releasing information about just such a DRAM years ago (maybe 2? 1? 3?), at least to the trade press. I don't know if they ever shipped any.
baum@apple.UUCP (09/17/87)
-------- [] >In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. Micron Technologies (Boise, Idaho) made such a chip, and may even be still sell it. It took them a long time to get it out the door; they missed a big window of opportunity on that. The organization is 256kx1. Its not clear that duplicating the logic on each chip is a cost effective solution, especially considering that at the chip level, errors must be detected and corrected before data comes off the chip. At a system level, this may not be necessary; you might have a extra cycle before you need to know there is an error, and can afford lots of time to correct it (sinces it presumably an infrequent event). -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
scott@labtam.oz (Scott Colwell) (09/17/87)
In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: > Clearly, what we need, urgently, is ECC on the damn memory chips. There > have already been mutterings about this, but no commercial products as > far as I know. Micron Technology of Boise Idaho have actually done this. parts are :- MT41C001 1M by 1 MT44C256 256k by 4 They are available in all the usual packages and ras access times (Trac) and have 'real-time on-chip error correction using a modified Hamming code'. Internally they use a 16bit data word with 5 check bits and (16,21) Hamming code. How they get this when the normal row size is 512 on 1M DRAMs, I don't know but this does suggest that it does not scrub during refresh. The pinouts are the standard pinouts for these parts and the speeds are very similar to the same specs for old NMOS DRAMS. This is a bit of a problem for us 'cause we would like to have the faster Tcac times (25ns for 100ns part) that the new generation of CMOS parts from Mitsubishi, Hitachi, TI etc offer. (Micron part Tcac 50ns for 100ns part). As usual for DRAM manufacturers Micron are loathe to tell you the error rates for the part. (If your listening, I'd like to see it on the data sheets guys.) -- Scott Colwell ACSnet: scott@labtam.oz Design Engineer UUCP: ..uunet!munnari!labtam.oz!scott Information Systems Division ARPA: scott%labtam.oz@UUNET.UU.NET Labtam Ltd Melbourne, Australia PHONE: +61-3-587-1444 D
ccplumb@watmath.waterloo.edu (Colin Plumb) (09/17/87)
In article <9024@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >Henry Spencer's suggestion that automatic error correction be included right >in the memory chip is a good one, but I fear it won't happen soon, if at all. >We users are so hungry for more memory we put size at a great premium, and >the chip designers respond. If they are given a choice of more (uncorrected) >bits vs. fewer (corrected) ones, I doubt they'd choose the latter. > >Chip real estate is expensive: yield is a non-linear function of chip size, >so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be >very costly. Maybe some day ... Au contraire! I forget my sources (trade magazines), but prototype 4 Meg chips *do* perform ECC. If your ECC scheme is sophisticated enough, it can handle multi-bit errors, and thus ignore a hard error (read: flaw in the chip) or two. Thus, yield goes *up*. The only problem is that this circuitry slows the chip down. One of the fundamental theorems of information theory states that the number of usable bits on a memory chip can approach, as closely as desired, the number of good bits there. (Actually, it's for communication channels, but the theory applies equally to memory.) This assumes very sophisticated ECC and indefinitely large memory chips, but one can do a pretty good job with 4 Megabits and reasonable timing constraints. -Colin Plumb (ccplumb@watmath) I'll hold the GIRAFFE while you fill the BATHTUB with brightly coloured MACHINE TOOLS!!
elwell@tut.cis.ohio-state.edu (Clayton Elwell) (09/17/87)
henry@utzoo.UUCP (Henry Spencer) writes:
Clearly, what we need, urgently, is ECC on the damn memory chips. There
have already been mutterings about this, but no commercial products as
far as I know.
I have data sheet from Micron Technology that describes a 64K DRAM
with ECC from a couple years ago. Anyone know if they're actually
shipping this beastie?
--
Clayton M. Elwell
The Ohio State University Department of Computer and Information Science
(614) 292-6546 UUCP: ...!cbosgd!osu-cis!tut.cis.ohio-state.edu!elwell
ARPA: elwell@ohio-state.arpa (not working well right now)
qwerty@drutx.ATT.COM (Brian Jones) (09/17/87)
In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: > Clearly, what we need, urgently, is ECC on the damn memory chips. There > have already been mutterings about this, but no commercial products as > far as I know. ^^^^^^^^^^^^^^^^^^^^^^ Intel has the 8206/8207 chip set for dual port DRAM control with DEDSEC (dual error detection, single error correction). -- Brian Jones aka {ihnp4,allegra}!{drutx}!qwerty @ AT&T-IS, Denver
randys@mipon3.intel.com (Randy Steck) (09/18/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There >have already been mutterings about this, but no commercial products as >far as I know. There is certainly a trend toward making "smarter" memory chips, but ECC is a different animal all together that does not really lend itself to implementation on the memory chip. It certainly doesn't belong on the most common organization of the memory device (X1) since the overhead of using it is so high (in terms of silicon cost). The cost and yield curve in this case tends to argue that the ECC logic be included directly on a much smarter and more configurable memory controller. I would propose a memory controller that was smart enough to do ECC, powerful enough to drive the array of memory devices directly (relaxing the access time requirements), and smart enough to work with others of its type in a system without contention. >and the problem >with needing read-modify-write cycles for a partial write goes away because >dynamic RAMs have to do this *anyway*. (Essentially all accesses to DRAMs >are r-m-w cycles, because the internal readout operation is destructive >and must be followed by a writeback, .... Unfortunately, this is not really true. The apparent RMW cycle that is performed by DRAMs is a characteristic of the circuitry and not of the logical design. In other words, the designer of the DRAM has done nothing to sequence the refresh of the DRAM cell. The act of reading the memory cell is sufficient to refresh it to its fully charged state. The requirements of ECC would be that a cell would have to also be "flipped" during the interval in which it is read, which would be extremely difficult without some form of sequencing logic. (And sequencing is really very tough without a clock!) >It's to the credit of DRAM designers that these grubby details are largely >invisible nowadays; high time they did the same for ECC.) Although I have enormous respect for my colleagues who *want* to spend their lives looking at circuit simulations to create a DRAM, I think it is stretching to say that they have gone to great lengths to hide the "grubby details". These details are an inherent part of the mechanism by which DRAM cells are read and written. There is no easy counterpart to the problem for ECC. Please notice that I am not saying that it cannot be done (Micron Tech. already did it!), just that it is not feasible for the foreseeable future given the alternative implementations. Besides, do you really care where the ECC is done as long as it is done and you don't have to bother with it? Randy Steck Intel Corp. ...intelca!mipos3!omepd!mipon3!randys
pf@diab.UUCP (Per Fogelstrom) (09/18/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There >have already been mutterings about this, but no commercial products as >far as I know. This is an ideal place for ECC: wide words are available > [ deleted text ] There has been an announcment about such a chip. A 1Meg * 1 bit dynamic cmos ram, with "row error correction" over, i belive 256 bits. Forgive me if i'm wrong (Can't find that da**ed paper) but i think the manufacturer was "Samsung".
adam@gec-mi-at.co.uk (Adam Quantrill) (10/07/87)
In article <14617@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: > >If your ECC scheme is sophisticated enough, it can handle multi-bit errors, >and thus ignore a hard error (read: flaw in the chip) or two. Thus, yield >goes *up*. The only problem is that this circuitry slows the chip down. > It needn't slow down the chip that much. If you do the ECC at chip refresh time, the random errors will be spotted then and appropriate action taken. Also, this approach will minimise the chance of two independent errors corrupting the same row, especially if that row hadn't been accessed for yonks. It would still be a good idea to have an extra pad on the chip to flag hard errors so the chip can be graded to: -totally correct -correctable hard errors -duff but I don't think it would be necessary to bring this out to a pin. -Adam. /* If at first it don't compile, kludge, kludge again.*/
jerry@oliveb.UUCP (Jerry Aguirre) (10/12/87)
In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Clearly, what we need, urgently, is ECC on the damn memory chips. There The disadvantage is that this provides less protection. Off chip ECC protects against the total failure of the chip, not just the failure of a bit or two. If an address or output line fails you would never know about it with on-chip ECC. There is also a problem with how the memory chips are going to communicate the ECC information to the CPU. Not only does the chip have to notify the CPU about both uncorrected and corrected errors but, at least at the diagnostic level, you probably want to be able to interagate the chip about the details of the error. All this sounds like more IO pins which are already at a premium. On the other hand having both would be a real win. With each chip handling its own ECC you could have every bit of a word wrong and still have it corrected. Also it could be checking every memory location at refresh time instead of waiting to find errors when they are accessed. (And having multiple errors accumulate in infrequently accessed words.) With a second level of correction the on-chip ECC could fail silently and thus not require any extra pins. Jerry Aguirre
aglew%mycroft@gswd-vms.Gould.COM (Andy Glew) (10/15/87)
/* Written 11:08 pm Oct 14, 1987 by jerry@oliveb.uu in mycroft:fa.unix-wizards */ >Jerry Aguirre <jerry@oliveb.uucp>: >>Henry Spencer: >>Clearly, what we need, urgently, is ECC on the damn memory chips. There > >The disadvantage is that this provides less protection. Off chip ECC >protects against the total failure of the chip, not just the failure of >a bit or two. If an address or output line fails you would never know >about it with on-chip ECC. Maybe the place to put ECC is where the data is used - on the CPU chip, at the disk controller, and so on. This way you can detect and correct faults both at the memory chip, and in the interconnection. The trade-off is the number of wires in the interconnection, against the error rate due to the interconnection: wiring faults, EMI, etc. I suspect that the tradeoff lies with ECC on memory right now, but it may well move if interconnection costs fall (but error rates increase). Not also that interconnection complexity may decrease, if ECC is on chip at either end of the memory/cpu highway. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or it is as