[comp.unix.wizards] Double-bit errors and ECC memory

roy@phri.UUCP (Roy Smith) (09/10/87)

In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
> Does anyone know about soft failure modes of DRAMs ? How likely is it to
> find double bit errors ? With denser and denser memory chips, one might
> expect that one day soon, background alpha particles will be able to flip
> several adjacent bits.

	The way most (all?) modern memory systems are built is to have each
chip contribute a single bit to each of many words.  Thus, a typical 1
Mbyte ECC board (small by today's standards) might consist of 39 256k
chips, each chip contributing a single bit to each of the 256k 39-bit words
(32 data plus 7 EEC bits) on the board.  If several bits in a given chip
were to go bad, you would see errors in the same bit of several different
words.  If an entire chip were to die, you would see an error in the same
bit of *every* word on the board.  The memory controller would be able to
correct any of these problems.

	Note that the typical-but-mythical memory board described above
has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
correct an N-bit error, this board should be able to detect and correct as
many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
so far as to pluck out any 3 RAM chips on the board without loosing any
function (other than, maybe, access speed).
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

neil@mitsumi.UUCP (Neil Katin) (09/11/87)

->In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
->	The way most (all?) modern memory systems are built is to have each
->chip contribute a single bit to each of many words.  Thus, a typical 1
->Mbyte ECC board (small by today's standards) might consist of 39 256k
->chips, each chip contributing a single bit to each of the 256k 39-bit words
->(32 data plus 7 EEC bits) on the board.  If several bits in a given chip
->were to go bad, you would see errors in the same bit of several different
->words.  If an entire chip were to die, you would see an error in the same
->bit of *every* word on the board.  The memory controller would be able to
->correct any of these problems.
->
->	Note that the typical-but-mythical memory board described above
->has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
->correct an N-bit error, this board should be able to detect and correct as
->many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
->so far as to pluck out any 3 RAM chips on the board without loosing any
->function (other than, maybe, access speed).
->-- 
->Roy Smith, {allegra,cmcl2,philabs}!phri!roy
->System Administrator, Public Health Research Institute
->455 First Avenue, New York, NY 10016

Sorry, I don't believe that is correct.  As I understand error correcting
codes, It takes at least ln(m) bits to protect an m bit data word from
a one bit error.  That means that you three bits to protect a byte, and
five bits to protect a 32-bit word.

I think (e.g. its been a while since I did the math) that seven bits
is enough to protect against two bit errors for a 32 bit word.

The place where "2N+1" comes in the the "error distance" needed to
map an erroneous data word back to a correct one.  There is basically
a tradeoff between pure detection (distance N+1) and correction (2N+1).
In other words, if you could either correct a two bit error or detect 
a four bit error with the same number of code bits..

	Neil Katin
	{amiga,pyramid}!mitsumi!neil

oconnor@sunray.steinmetz (Dennis Oconnor) (09/11/87)

In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	Note that the typical-but-mythical memory board described above
>has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
>correct an N-bit error, this board should be able to detect and correct as
>many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
>so far as to pluck out any 3 RAM chips on the board without loosing any
>function (other than, maybe, access speed).
>-- 
>Roy Smith, {allegra,cmcl2,philabs}!phri!roy
>System Administrator, Public Health Research Institute
>455 First Avenue, New York, NY 10016

Sorry, this is incorrect. To perform just SINGLE bit error CORRECTION
you need 1+log2(word-width) bits of ECC bits. That means you need
6 bits for a 32-bit word, 5 for a 16-bit halfword, and 4 for a byte.
Which is why you don't see ECC perfromed at the byte level, and DO
see it performed at the word level, even though this makes writing
a byte a pain in the neck ( to write a byte into an ECC'd word, you
must read out the word, substitute in the new byte, and recompute
the ECC for the new word; then you can write it back ). To perform
DOUBLE bit error CORRECTION, you need to DOUBLE the number of check
bits ( for randomly-occuring bit errors; block-error correcting
codes where all the errors are assumed to be djacent are different,
these are applicable to serial media like disk drives, not to memories ).
Error DETECTION is another kettle of fish : for instance, a single
parity bit detects ALL situations where an odd number of errors has
occurred. 

A simple explanation ( intuitive, not neccesarily a proof ) for why
you need 1+log2(word-width) bits of check code to correct a
single bit error is the following : You need to be able to locate
the error to correct it, and to locate a bit in a word of
length(word-width + check-bits) [remember, the error might be in
the check bits] you need log2(word-width + check-bits) bits of
information. If number_of_check_bits < number_of_data_bits,
this is equivalent to 1+log2(word-width).

I could be SLIGHTLY wrong about this stuff : it's been a while.

--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

henry@utzoo.UUCP (Henry Spencer) (09/15/87)

Clearly, what we need, urgently, is ECC on the damn memory chips.  There
have already been mutterings about this, but no commercial products as
far as I know.  This is an ideal place for ECC:  wide words are available
internally to reduce the number of correction bits needed (to the extent
that this is desirable -- fewer bits mean poorer coverage against multiple
errors), modest amounts of circuitry are not hard to add, and the problem
with needing read-modify-write cycles for a partial write goes away because
dynamic RAMs have to do this *anyway*.  (Essentially all accesses to DRAMs
are r-m-w cycles, because the internal readout operation is destructive
and must be followed by a writeback, and the chip works internally with
quite large words and *any* write is a partial write, needing a read first.
It's to the credit of DRAM designers that these grubby details are largely
invisible nowadays; high time they did the same for ECC.)
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

nather@ut-sally.UUCP (Ed Nather) (09/16/87)

Henry Spencer's suggestion that automatic error correction be included right
in the memory chip is a good one, but I fear it won't happen soon, if at all.
We users are so hungry for more memory we put size at a great premium, and
the chip designers respond. If they are given a choice of more (uncorrected)
bits vs. fewer (corrected) ones, I doubt they'd choose the latter.  

Chip real estate is expensive: yield is a non-linear function of chip size,
so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be
very costly.  Maybe some day ...

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.AS.UTEXAS.EDU

bcase@apple.UUCP (Brian Case) (09/16/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>have already been mutterings about this, but no commercial products as
>far as I know.

Micron Technology was releasing information about just such a DRAM years
ago (maybe 2? 1? 3?), at least to the trade press.  I don't know if they
ever shipped any.

baum@apple.UUCP (09/17/87)

--------
[]
>In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.

Micron Technologies (Boise, Idaho) made such a chip, and may even be
still sell it. It took them a long time to get it out the door; they
missed a big window of opportunity on that. The organization is
256kx1.
 Its not clear that duplicating the logic on each chip is a
cost effective solution, especially considering that at the chip
level, errors must be detected and corrected before data comes off
the chip. At a system level, this may not be necessary; you might
have a extra cycle before you need to know there is an error, and can
afford lots of time to correct it (sinces it presumably an infrequent
event).

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

scott@labtam.oz (Scott Colwell) (09/17/87)

In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> Clearly, what we need, urgently, is ECC on the damn memory chips.  There
> have already been mutterings about this, but no commercial products as
> far as I know.

	Micron Technology of Boise Idaho have actually done this.

	parts are :-	MT41C001	1M by 1
			MT44C256	256k by 4
	They are available in all the usual packages and ras access times
(Trac) and have 'real-time on-chip error correction using a modified Hamming
code'.

	Internally they use a 16bit data word with 5 check bits and (16,21)
Hamming code. How they get this when the normal row size is 512 on 1M DRAMs,
I don't know but this does suggest that it does not scrub during refresh.
	The pinouts are the standard pinouts for these parts and the speeds
are very similar to the same specs for old NMOS DRAMS. This is a bit of a
problem for us 'cause we would like to have the faster Tcac times (25ns for
100ns part) that the new generation of CMOS parts from Mitsubishi, Hitachi,
TI etc offer. (Micron part Tcac 50ns for 100ns part).

	As usual for DRAM manufacturers Micron are loathe to tell you the
error rates for the part. (If your listening, I'd like to see it on the
data sheets guys.)

-- 
Scott Colwell			ACSnet:	scott@labtam.oz
Design Engineer			UUCP:	..uunet!munnari!labtam.oz!scott
Information Systems Division	ARPA:	scott%labtam.oz@UUNET.UU.NET
Labtam Ltd Melbourne, Australia PHONE:	+61-3-587-1444
D

ccplumb@watmath.waterloo.edu (Colin Plumb) (09/17/87)

In article <9024@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>Henry Spencer's suggestion that automatic error correction be included right
>in the memory chip is a good one, but I fear it won't happen soon, if at all.
>We users are so hungry for more memory we put size at a great premium, and
>the chip designers respond. If they are given a choice of more (uncorrected)
>bits vs. fewer (corrected) ones, I doubt they'd choose the latter.  
>
>Chip real estate is expensive: yield is a non-linear function of chip size,
>so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be
>very costly.  Maybe some day ...

Au contraire!  I forget my sources (trade magazines), but prototype 4 Meg
chips *do* perform ECC.

If your ECC scheme is sophisticated enough, it can handle multi-bit errors,
and thus ignore a hard error (read: flaw in the chip) or two.  Thus, yield
goes *up*.  The only problem is that this circuitry slows the chip down.

One of the fundamental theorems of information theory states that the
number of usable bits on a memory chip can approach, as closely as desired,
the number of good bits there.  (Actually, it's for communication channels,
but the theory applies equally to memory.)  This assumes very sophisticated
ECC and indefinitely large memory chips, but one can do a pretty good job
with 4 Megabits and reasonable timing constraints.

	-Colin Plumb (ccplumb@watmath)

I'll hold the GIRAFFE while you fill the BATHTUB with brightly coloured
MACHINE TOOLS!!

elwell@tut.cis.ohio-state.edu (Clayton Elwell) (09/17/87)

henry@utzoo.UUCP (Henry Spencer) writes:

    Clearly, what we need, urgently, is ECC on the damn memory chips.  There
    have already been mutterings about this, but no commercial products as
    far as I know.

I have data sheet from Micron Technology that describes a 64K DRAM
with ECC from a couple years ago.  Anyone know if they're actually
shipping this beastie?

-- 
							      Clayton M. Elwell
       The Ohio State University Department of Computer and Information Science
       (614) 292-6546	 UUCP: ...!cbosgd!osu-cis!tut.cis.ohio-state.edu!elwell
		      ARPA: elwell@ohio-state.arpa (not working well right now)

qwerty@drutx.ATT.COM (Brian Jones) (09/17/87)

In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> Clearly, what we need, urgently, is ECC on the damn memory chips.  There
> have already been mutterings about this, but no commercial products as
> far as I know.                               ^^^^^^^^^^^^^^^^^^^^^^

Intel has the 8206/8207 chip set for dual port DRAM control with DEDSEC
(dual error detection, single error correction).
-- 

Brian Jones  aka  {ihnp4,allegra}!{drutx}!qwerty  @  AT&T-IS, Denver

randys@mipon3.intel.com (Randy Steck) (09/18/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>have already been mutterings about this, but no commercial products as
>far as I know.

There is certainly a trend toward making "smarter" memory chips, but ECC is
a different animal all together that does not really lend itself to
implementation on the memory chip.  It certainly doesn't belong on the most
common organization of the memory device (X1) since the overhead of using
it is so high (in terms of silicon cost).  The cost and yield curve in this
case tends to argue that the ECC logic be included directly on a much
smarter and more configurable memory controller.  I would propose a memory
controller that was smart enough to do ECC, powerful enough to drive the
array of memory devices directly (relaxing the access time requirements),
and smart enough to work with others of its type in a system without
contention.

>and the problem
>with needing read-modify-write cycles for a partial write goes away because
>dynamic RAMs have to do this *anyway*.  (Essentially all accesses to DRAMs
>are r-m-w cycles, because the internal readout operation is destructive
>and must be followed by a writeback, ....

Unfortunately, this is not really true.  The apparent RMW cycle that is
performed by DRAMs is a characteristic of the circuitry and not of the
logical design.  In other words, the designer of the DRAM has done nothing
to sequence the refresh of the DRAM cell.  The act of reading the memory
cell is sufficient to refresh it to its fully charged state.  The
requirements of ECC would be that a cell would have to also be "flipped"
during the interval in which it is read, which would be extremely difficult
without some form of sequencing logic.  (And sequencing is really very tough
without a clock!)

>It's to the credit of DRAM designers that these grubby details are largely
>invisible nowadays; high time they did the same for ECC.)

Although I have enormous respect for my colleagues who *want* to spend
their lives looking at circuit simulations to create a DRAM, I think it is
stretching to say that they have gone to great lengths to hide the "grubby
details".  These details are an inherent part of the mechanism by which
DRAM cells are read and written.  There is no easy counterpart to the
problem for ECC.

Please notice that I am not saying that it cannot be done (Micron Tech.
already did it!), just that it is not feasible for the foreseeable future
given the alternative implementations.  Besides, do you really care where
the ECC is done as long as it is done and you don't have to bother with
it?

Randy Steck
Intel Corp.     ...intelca!mipos3!omepd!mipon3!randys

pf@diab.UUCP (Per Fogelstrom) (09/18/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>have already been mutterings about this, but no commercial products as
>far as I know.  This is an ideal place for ECC:  wide words are available
>  [ deleted text ]

There has been an announcment about such a chip. A 1Meg * 1 bit dynamic
cmos ram, with "row error correction" over, i belive 256 bits. Forgive
me if i'm wrong (Can't find that da**ed paper) but i think the
manufacturer was "Samsung".

adam@gec-mi-at.co.uk (Adam Quantrill) (10/07/87)

In article <14617@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
>
>If your ECC scheme is sophisticated enough, it can handle multi-bit errors,
>and thus ignore a hard error (read: flaw in the chip) or two.  Thus, yield
>goes *up*.  The only problem is that this circuitry slows the chip down.
>
It needn't slow down the chip that much. If you do the ECC at chip refresh time,
the random errors will be spotted then and appropriate action taken. Also, this
approach will minimise the chance of two independent errors corrupting the same
row, especially if that row hadn't been accessed for yonks.

It would still be a good idea to have an extra pad on the chip to flag hard
errors so the chip can be graded to:

-totally correct
-correctable hard errors
-duff

but I don't think it would be necessary to bring this out to a pin.
       -Adam.

/* If at first it don't compile, kludge, kludge again.*/

jerry@oliveb.UUCP (Jerry Aguirre) (10/12/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There

The disadvantage is that this provides less protection.  Off chip ECC
protects against the total failure of the chip, not just the failure of
a bit or two.  If an address or output line fails you would never know
about it with on-chip ECC.

There is also a problem with how the memory chips are going to
communicate the ECC information to the CPU.  Not only does the chip have
to notify the CPU about both uncorrected and corrected errors but, at
least at the diagnostic level, you probably want to be able to
interagate the chip about the details of the error.  All this sounds
like more IO pins which are already at a premium.

On the other hand having both would be a real win.  With each chip
handling its own ECC you could have every bit of a word wrong and still
have it corrected.  Also it could be checking every memory location at
refresh time instead of waiting to find errors when they are accessed.
(And having multiple errors accumulate in infrequently accessed words.)
With a second level of correction the on-chip ECC could fail silently
and thus not require any extra pins.

				Jerry Aguirre

aglew%mycroft@gswd-vms.Gould.COM (Andy Glew) (10/15/87)

/* Written 11:08 pm  Oct 14, 1987 by jerry@oliveb.uu in mycroft:fa.unix-wizards */
>Jerry Aguirre <jerry@oliveb.uucp>:
>>Henry Spencer:
>>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>
>The disadvantage is that this provides less protection.  Off chip ECC
>protects against the total failure of the chip, not just the failure of
>a bit or two.  If an address or output line fails you would never know
>about it with on-chip ECC.

Maybe the place to put ECC is where the data is used - on the CPU
chip, at the disk controller, and so on. This way you can detect and
correct faults both at the memory chip, and in the interconnection.
    The trade-off is the number of wires in the interconnection,
against the error rate due to the interconnection: wiring faults, EMI,
etc.

I suspect that the tradeoff lies with ECC on memory right now, but it
may well move if interconnection costs fall (but error rates increase).
Not also that interconnection complexity may decrease, if ECC is on 
chip at either end of the memory/cpu highway.

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or it is as