[comp.arch] Double-bit errors and ECC memory

roy@phri.UUCP (Roy Smith) (09/10/87)

In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
> Does anyone know about soft failure modes of DRAMs ? How likely is it to
> find double bit errors ? With denser and denser memory chips, one might
> expect that one day soon, background alpha particles will be able to flip
> several adjacent bits.

	The way most (all?) modern memory systems are built is to have each
chip contribute a single bit to each of many words.  Thus, a typical 1
Mbyte ECC board (small by today's standards) might consist of 39 256k
chips, each chip contributing a single bit to each of the 256k 39-bit words
(32 data plus 7 EEC bits) on the board.  If several bits in a given chip
were to go bad, you would see errors in the same bit of several different
words.  If an entire chip were to die, you would see an error in the same
bit of *every* word on the board.  The memory controller would be able to
correct any of these problems.

	Note that the typical-but-mythical memory board described above
has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
correct an N-bit error, this board should be able to detect and correct as
many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
so far as to pluck out any 3 RAM chips on the board without loosing any
function (other than, maybe, access speed).
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

neil@mitsumi.UUCP (Neil Katin) (09/11/87)

->In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
->	The way most (all?) modern memory systems are built is to have each
->chip contribute a single bit to each of many words.  Thus, a typical 1
->Mbyte ECC board (small by today's standards) might consist of 39 256k
->chips, each chip contributing a single bit to each of the 256k 39-bit words
->(32 data plus 7 EEC bits) on the board.  If several bits in a given chip
->were to go bad, you would see errors in the same bit of several different
->words.  If an entire chip were to die, you would see an error in the same
->bit of *every* word on the board.  The memory controller would be able to
->correct any of these problems.
->
->	Note that the typical-but-mythical memory board described above
->has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
->correct an N-bit error, this board should be able to detect and correct as
->many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
->so far as to pluck out any 3 RAM chips on the board without loosing any
->function (other than, maybe, access speed).
->-- 
->Roy Smith, {allegra,cmcl2,philabs}!phri!roy
->System Administrator, Public Health Research Institute
->455 First Avenue, New York, NY 10016

Sorry, I don't believe that is correct.  As I understand error correcting
codes, It takes at least ln(m) bits to protect an m bit data word from
a one bit error.  That means that you three bits to protect a byte, and
five bits to protect a 32-bit word.

I think (e.g. its been a while since I did the math) that seven bits
is enough to protect against two bit errors for a 32 bit word.

The place where "2N+1" comes in the the "error distance" needed to
map an erroneous data word back to a correct one.  There is basically
a tradeoff between pure detection (distance N+1) and correction (2N+1).
In other words, if you could either correct a two bit error or detect 
a four bit error with the same number of code bits..

	Neil Katin
	{amiga,pyramid}!mitsumi!neil

reiter@endor.harvard.edu (Ehud Reiter) (09/11/87)

In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	Note that the typical-but-mythical memory board described above
>has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
>correct an N-bit error, this board should be able to detect and correct as
>many as 3 bad bits in any 32-bit word.

No.  7 check bits will let you correct single-bit errors, and detect double
bit errors.  You would need many more check bits to detect and correct
triple bit errors.

A Hamming code which can correct 1-bit errors and detect 2-bit errors
requires
		ceiling(log2(N)) + 1
check bits, where N is the total number of bits (data + check), i.e. N = 39
if there are 32 data bits and 7 check bits.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

oconnor@sunray.steinmetz (Dennis Oconnor) (09/11/87)

In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	Note that the typical-but-mythical memory board described above
>has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
>correct an N-bit error, this board should be able to detect and correct as
>many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
>so far as to pluck out any 3 RAM chips on the board without loosing any
>function (other than, maybe, access speed).
>-- 
>Roy Smith, {allegra,cmcl2,philabs}!phri!roy
>System Administrator, Public Health Research Institute
>455 First Avenue, New York, NY 10016

Sorry, this is incorrect. To perform just SINGLE bit error CORRECTION
you need 1+log2(word-width) bits of ECC bits. That means you need
6 bits for a 32-bit word, 5 for a 16-bit halfword, and 4 for a byte.
Which is why you don't see ECC perfromed at the byte level, and DO
see it performed at the word level, even though this makes writing
a byte a pain in the neck ( to write a byte into an ECC'd word, you
must read out the word, substitute in the new byte, and recompute
the ECC for the new word; then you can write it back ). To perform
DOUBLE bit error CORRECTION, you need to DOUBLE the number of check
bits ( for randomly-occuring bit errors; block-error correcting
codes where all the errors are assumed to be djacent are different,
these are applicable to serial media like disk drives, not to memories ).
Error DETECTION is another kettle of fish : for instance, a single
parity bit detects ALL situations where an odd number of errors has
occurred. 

A simple explanation ( intuitive, not neccesarily a proof ) for why
you need 1+log2(word-width) bits of check code to correct a
single bit error is the following : You need to be able to locate
the error to correct it, and to locate a bit in a word of
length(word-width + check-bits) [remember, the error might be in
the check bits] you need log2(word-width + check-bits) bits of
information. If number_of_check_bits < number_of_data_bits,
this is equivalent to 1+log2(word-width).

I could be SLIGHTLY wrong about this stuff : it's been a while.

--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

maa@nbires.UUCP (09/12/87)

In article <7319@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
>Sorry, this is incorrect. To perform just SINGLE bit error CORRECTION
>you need 1+log2(word-width) bits of ECC bits. That means you need
>6 bits for a 32-bit word, 5 for a 16-bit halfword, and 4 for a byte.
> <etc.>

Not strictly true!  I can remember reading (sorry, too long ago to remember
where) about a clever way to detect and correct all single bit errors, detect
all double bit errors, and correct most double bit errors USING ONE BIT PER
WORD.

The idea is that each parity bit is calculated as the mod 2 sum of all the bits
in the word plus one bit from each word at addresses +- n where n is the word
size:

n+8  p7654321*
n+7  p765432*0
...
n+2  p7*543210
n+1  p*6543210
n    X********		parity bit X calculated as mod 2 sum of *'s
n-1  p*6543210
n-2  p7*543210
...
n-7  p765432*0
n-8  p7654321*

God only knows how this could be implemented in a real memory system though;
to do a read/write you need to check/set the parity bits on all words +-n which
means reading all words +-2n.

Maybe some clever VLSI hack will do it for us.  8-)

Mark

P.S.  If anyone out there knows of any references to this type of coding, I'm
interested.  I was just a dumb (smartass?) college kid when I read (and mostly
forgot) this stuff!

alverson@decwrl.dec.com (Robert Alverson) (09/15/87)

In article <1215@nbires.UUCP> maa@nbires.UUCP (Mark Armbrust) writes:
>Not strictly true!  I can remember reading (sorry, too long ago to remember
>where) about a clever way to detect and correct all single bit errors, detect
>all double bit errors, and correct most double bit errors USING ONE BIT PER
>WORD.
... describes wonderfully convoluted ECC method.

The 1+log2(...) relation is *strictly* true.  This is a result from
information theory.  It seems to me that the scheme you mentioned lowers
the cost/bit of ECC by effectively using a larger word size.  Since the
number of check bits needed is logarithmic to the word size, you can
make the cost/bit arbitrarily low by working on more bits at once.  Note
that you don't get something for nothing.  Since you are checking over
more bits, there is a greater chance of multiple bit errors that you
cannot correct.  Also there is the mentioned hardware complexity of
the scheme you described.

Bob

henry@utzoo.UUCP (Henry Spencer) (09/15/87)

Clearly, what we need, urgently, is ECC on the damn memory chips.  There
have already been mutterings about this, but no commercial products as
far as I know.  This is an ideal place for ECC:  wide words are available
internally to reduce the number of correction bits needed (to the extent
that this is desirable -- fewer bits mean poorer coverage against multiple
errors), modest amounts of circuitry are not hard to add, and the problem
with needing read-modify-write cycles for a partial write goes away because
dynamic RAMs have to do this *anyway*.  (Essentially all accesses to DRAMs
are r-m-w cycles, because the internal readout operation is destructive
and must be followed by a writeback, and the chip works internally with
quite large words and *any* write is a partial write, needing a read first.
It's to the credit of DRAM designers that these grubby details are largely
invisible nowadays; high time they did the same for ECC.)
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

nather@ut-sally.UUCP (Ed Nather) (09/16/87)

Henry Spencer's suggestion that automatic error correction be included right
in the memory chip is a good one, but I fear it won't happen soon, if at all.
We users are so hungry for more memory we put size at a great premium, and
the chip designers respond. If they are given a choice of more (uncorrected)
bits vs. fewer (corrected) ones, I doubt they'd choose the latter.  

Chip real estate is expensive: yield is a non-linear function of chip size,
so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be
very costly.  Maybe some day ...

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.AS.UTEXAS.EDU

mark@mips.UUCP (09/16/87)

In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes

	> Clearly, what we need, urgently, is ECC on the damn memory
	> chips.  There have already been mutterings about this, but no
	> commercial products as far as I know.

Micron Technology's 256Kbit dynamic RAM has on-chip ECC.  And customers
just frigging HATE the idea.  You shoulda been there (NY Hilton, Feb 1985)
at ISSCC when Jim O'Toole of Micron Technology faced an angry mob of
non-believers and tried to explain the advantages of on-chip ECC.  Poor
guy got hooted off the platform.

The gripes against ECC are (1) it's "dishonest" because it lets mfrs
sell defective chips.  {This was also heard three years previously
when redundant memories were first discussed.}  (2) There's no way to
tell whether a given chip has a hard error {ECC masks it}, in which
case the single-bit ECC provides no protection against soft errors.
Note that a hard error can occur weeks after system installation
so special RAM chip "test modes" aren't useful here.

Big customers (the ones that DRAM mfrs seek to please!) have a
Component Qualification and Reliability group, who qualify and/or
disqualify RAM vendors.  If the head of this group doesn't want
ECC RAMs, then he doesn't qual them and that company doesn't buy
them.  Sadly, the most savage attacks on Mr. O'Toole of Micron
came from the heads of Qual depts. of immense DRAM consumers.
Most notable among them was Mr. X of Burroughs (Unisys) who also
led the battle against redundant RAMs three years before.

DRAM mfrs therefore *perceived* that ECC RAMs were poison in the
(major customer) marketplace, so they backed away from the idea
PRONTO.  In fact, I believe (don't know for sure) that even Micron
Technology gave up on ECC and left it off their 1-Megabit DRAM.
You can call them in Boise, Idaho to find out.
-- 
-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-720-1700 x208
US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@apple.UUCP (Brian Case) (09/16/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>have already been mutterings about this, but no commercial products as
>far as I know.

Micron Technology was releasing information about just such a DRAM years
ago (maybe 2? 1? 3?), at least to the trade press.  I don't know if they
ever shipped any.

baum@apple.UUCP (09/17/87)

--------
[]
>In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.

Micron Technologies (Boise, Idaho) made such a chip, and may even be
still sell it. It took them a long time to get it out the door; they
missed a big window of opportunity on that. The organization is
256kx1.
 Its not clear that duplicating the logic on each chip is a
cost effective solution, especially considering that at the chip
level, errors must be detected and corrected before data comes off
the chip. At a system level, this may not be necessary; you might
have a extra cycle before you need to know there is an error, and can
afford lots of time to correct it (sinces it presumably an infrequent
event).

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

ccplumb@watmath.waterloo.edu (Colin Plumb) (09/17/87)

In article <9024@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>Henry Spencer's suggestion that automatic error correction be included right
>in the memory chip is a good one, but I fear it won't happen soon, if at all.
>We users are so hungry for more memory we put size at a great premium, and
>the chip designers respond. If they are given a choice of more (uncorrected)
>bits vs. fewer (corrected) ones, I doubt they'd choose the latter.  
>
>Chip real estate is expensive: yield is a non-linear function of chip size,
>so tacking ECC manipulations on top of, say, a 4 Mbit memory chip would be
>very costly.  Maybe some day ...

Au contraire!  I forget my sources (trade magazines), but prototype 4 Meg
chips *do* perform ECC.

If your ECC scheme is sophisticated enough, it can handle multi-bit errors,
and thus ignore a hard error (read: flaw in the chip) or two.  Thus, yield
goes *up*.  The only problem is that this circuitry slows the chip down.

One of the fundamental theorems of information theory states that the
number of usable bits on a memory chip can approach, as closely as desired,
the number of good bits there.  (Actually, it's for communication channels,
but the theory applies equally to memory.)  This assumes very sophisticated
ECC and indefinitely large memory chips, but one can do a pretty good job
with 4 Megabits and reasonable timing constraints.

	-Colin Plumb (ccplumb@watmath)

I'll hold the GIRAFFE while you fill the BATHTUB with brightly coloured
MACHINE TOOLS!!

elwell@tut.cis.ohio-state.edu (Clayton Elwell) (09/17/87)

henry@utzoo.UUCP (Henry Spencer) writes:

    Clearly, what we need, urgently, is ECC on the damn memory chips.  There
    have already been mutterings about this, but no commercial products as
    far as I know.

I have data sheet from Micron Technology that describes a 64K DRAM
with ECC from a couple years ago.  Anyone know if they're actually
shipping this beastie?

-- 
							      Clayton M. Elwell
       The Ohio State University Department of Computer and Information Science
       (614) 292-6546	 UUCP: ...!cbosgd!osu-cis!tut.cis.ohio-state.edu!elwell
		      ARPA: elwell@ohio-state.arpa (not working well right now)

qwerty@drutx.ATT.COM (Brian Jones) (09/17/87)

In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> Clearly, what we need, urgently, is ECC on the damn memory chips.  There
> have already been mutterings about this, but no commercial products as
> far as I know.                               ^^^^^^^^^^^^^^^^^^^^^^

Intel has the 8206/8207 chip set for dual port DRAM control with DEDSEC
(dual error detection, single error correction).
-- 

Brian Jones  aka  {ihnp4,allegra}!{drutx}!qwerty  @  AT&T-IS, Denver

randys@mipon3.intel.com (Randy Steck) (09/18/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>have already been mutterings about this, but no commercial products as
>far as I know.

There is certainly a trend toward making "smarter" memory chips, but ECC is
a different animal all together that does not really lend itself to
implementation on the memory chip.  It certainly doesn't belong on the most
common organization of the memory device (X1) since the overhead of using
it is so high (in terms of silicon cost).  The cost and yield curve in this
case tends to argue that the ECC logic be included directly on a much
smarter and more configurable memory controller.  I would propose a memory
controller that was smart enough to do ECC, powerful enough to drive the
array of memory devices directly (relaxing the access time requirements),
and smart enough to work with others of its type in a system without
contention.

>and the problem
>with needing read-modify-write cycles for a partial write goes away because
>dynamic RAMs have to do this *anyway*.  (Essentially all accesses to DRAMs
>are r-m-w cycles, because the internal readout operation is destructive
>and must be followed by a writeback, ....

Unfortunately, this is not really true.  The apparent RMW cycle that is
performed by DRAMs is a characteristic of the circuitry and not of the
logical design.  In other words, the designer of the DRAM has done nothing
to sequence the refresh of the DRAM cell.  The act of reading the memory
cell is sufficient to refresh it to its fully charged state.  The
requirements of ECC would be that a cell would have to also be "flipped"
during the interval in which it is read, which would be extremely difficult
without some form of sequencing logic.  (And sequencing is really very tough
without a clock!)

>It's to the credit of DRAM designers that these grubby details are largely
>invisible nowadays; high time they did the same for ECC.)

Although I have enormous respect for my colleagues who *want* to spend
their lives looking at circuit simulations to create a DRAM, I think it is
stretching to say that they have gone to great lengths to hide the "grubby
details".  These details are an inherent part of the mechanism by which
DRAM cells are read and written.  There is no easy counterpart to the
problem for ECC.

Please notice that I am not saying that it cannot be done (Micron Tech.
already did it!), just that it is not feasible for the foreseeable future
given the alternative implementations.  Besides, do you really care where
the ECC is done as long as it is done and you don't have to bother with
it?

Randy Steck
Intel Corp.     ...intelca!mipos3!omepd!mipon3!randys

pf@diab.UUCP (Per Fogelstrom) (09/18/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There
>have already been mutterings about this, but no commercial products as
>far as I know.  This is an ideal place for ECC:  wide words are available
>  [ deleted text ]

There has been an announcment about such a chip. A 1Meg * 1 bit dynamic
cmos ram, with "row error correction" over, i belive 256 bits. Forgive
me if i'm wrong (Can't find that da**ed paper) but i think the
manufacturer was "Samsung".

jpp@slxsys.UUCP (John Pettitt) (09/19/87)

In article <686@obiwan.UUCP> mark@mips.UUCP (Mark G. Johnson) writes:
>In article <8587@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes
>	> Clearly, what we need, urgently, is ECC on the damn memory
>	> chips.  There have already been mutterings about this, but no
>	> commercial products as far as I know.
>Micron Technology's 256Kbit dynamic RAM has on-chip ECC.  And customers
>just frigging HATE the idea.  . . . 

>The gripes against ECC are (1) it's "dishonest" because it lets mfrs
>sell defective chips.  {This was also heard three years previously
>when redundant memories were first discussed.}  (2) There's no way to
>tell whether a given chip has a hard error {ECC masks it}, in which
>case the single-bit ECC provides no protection against soft errors.

This may be a dumb suggestion, but ......

Why not have an `ECC_FAULT' pin on the ram chip that signals that
the on chip ecc logic just found and corrected an error ?
This output would then be used to generate a 'ram fault'
signal to the os, and with the correct software place
the service call . . .

This would solve both 1 and 2 above as faulty chips would
be detectable.   This solution seems so simple there must 
be a catch to it but right now I can't see it.

-- 
John Pettitt - G6KCQ, CIX jpettitt, Voice +44 1 398 9422, Discalimer applies ! 
UUCP:   {backbone}!mcvax!ukc!{ pyrltd || stc!datlog }!slxsys!jpp 
Remember: Bill Gates is the worlds greatest expert on Operating Systems :-)

daveb@geac.UUCP (Brown) (09/21/87)

In article <208@slxsys.UUCP> jpp@slxsys.UUCP (John Pettitt) writes:
>In article <686@obiwan.UUCP> mark@mips.UUCP (Mark G. Johnson) writes:
>>Micron Technology's 256Kbit dynamic RAM has on-chip ECC.  And customers
>>just frigging HATE the idea.  . . . 
>>The gripes against ECC are (1) it's "dishonest" because it lets mfrs
>>sell defective chips.  {This was also heard three years previously
>>when redundant memories were first discussed.}  

  It strikes me that the place I heard about ECC-equipped RAMs was in
a journal article on fault and radiation-resistant hardware for
military and sattelite usage.  Since I wouldn't want my Comsat[1] to
suddenly go wonky because an energetic particle happened to wander by, 
much less my battle station (:-}), this is probably the area where the
customers wouldn't complain about dishonesty.

  I suspect *those* customers would welcome John Pettts's ECC_FAULT pin.

--dave

[1] Comsat is a trademark, presumably of the Comsat Company.
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

henry@utzoo.UUCP (Henry Spencer) (09/22/87)

> The gripes against ECC are (1) it's "dishonest" because it lets mfrs
> sell defective chips.  {This was also heard three years previously
> when redundant memories were first discussed.}  (2) There's no way to
> tell whether a given chip has a hard error {ECC masks it}, in which
> case the single-bit ECC provides no protection against soft errors.

Mmm, I was actually thinking of ECC provided purely for post-manufacturing
errors, not as a way of covering up manufacturing defects.  And doing it
right would definitely require some way to find out what had happened on
chip, so that the software could cope appropriately.  In other words, what
we have right now in board-level implementations, but done on the chip.
(Yes, I realize there is a pin-count problem that makes it difficult to
devise a way of asking the chip for an error report.)

If the manufacturers want to use ECC as a way of dealing with chip defects,
that's fine by me, but it's *not* what I'm asking for.

I don't have any real hope that anyone is going to do what I want, though.
(Heavens!  Change the DRAM interface to make the system-level design
simpler?!?  Much too radical.  Completely unacceptable to Marketing.)
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (09/22/87)

In a similar vein...  We now have machines that do TLB loading in software
(e.g. MIPSCo) and even an occasional machine that does cache loading in
software (Cheriton's MMUless virtual-address cache).  Has anybody thought
about doing the correction (as opposed to detection) part of ECC in software?
Clearly this is viable only if ECC's purpose is to handle infrequent soft
errors and provide fail-soft behavior in the presence of newly-arrived hard
errors; it won't work if errors are frequent or if you are trying to cover
up rather than fix hard errors.  Given that restriction on its domain of
application, though, it seems like it might work.
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

mikew@bigboy.UUCP (09/24/87)

In article <8638@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>...Has anybody thought
>about doing the correction (as opposed to detection) part of ECC in software?
>Clearly this is viable only if ECC's purpose is to handle infrequent soft
>errors and provide fail-soft behavior in the presence of newly-arrived hard
>errors; ...  Given that restriction on its domain of
>application, though, it seems like it might work.
I was thinking about this a few days ago, and I came up with some interesting
techniques for implementing this.  For single bit errors you could just use
the ECC bits to correct them, the advantage comes if you multiple bit errors.
The first step is to see if the page is dirty(different than on the page 
device). If it isn't, just page it in.  This is very likely to work since
there are a lot of pages that never get changed(executable code) and a lot
that is infrequently changed.  If this fails and the error was in the data
space of a user process, just terminate the user process.  If all else fails,
and the error is in the code space of the kernel, you can always generate a
panic(or the equivalent on your OS).  Does anybody implement a scheme like
this?  It would seem to greatly reduce the problems caused by memory errors.


-- 
Mike Wexler               UUCP: wyse!mike
ATT: (408)433-1000 x 1330

pww@alaska.cray.com (Paul Wells) (09/26/87)

In article <8638@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> In a similar vein...  We now have machines that do TLB loading in software
> (e.g. MIPSCo) and even an occasional machine that does cache loading in
> software (Cheriton's MMUless virtual-address cache).  Has anybody thought
> about doing the correction (as opposed to detection) part of ECC in software?

Depends on what you mean by "correction".  The machines I'm familiar with do
perform correction in software -- in the sense of writing the corrected word
back to fix soft errors.  However, if you mean actually decoding the syndrome 
bits to determine which bit has been flipped, this seems impractical.  What
happens if the error is in the code that corrects errors?

adam@gec-mi-at.co.uk (Adam Quantrill) (10/07/87)

In article <14617@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
>
>If your ECC scheme is sophisticated enough, it can handle multi-bit errors,
>and thus ignore a hard error (read: flaw in the chip) or two.  Thus, yield
>goes *up*.  The only problem is that this circuitry slows the chip down.
>
It needn't slow down the chip that much. If you do the ECC at chip refresh time,
the random errors will be spotted then and appropriate action taken. Also, this
approach will minimise the chance of two independent errors corrupting the same
row, especially if that row hadn't been accessed for yonks.

It would still be a good idea to have an extra pad on the chip to flag hard
errors so the chip can be graded to:

-totally correct
-correctable hard errors
-duff

but I don't think it would be necessary to bring this out to a pin.
       -Adam.

/* If at first it don't compile, kludge, kludge again.*/

henry@utzoo.UUCP (Henry Spencer) (10/07/87)

> ... However, if you mean actually decoding the syndrome 
> bits to determine which bit has been flipped, this seems impractical.  What
> happens if the error is in the code that corrects errors?

What happens if the code that runs your software-managed TLB gets a TLB miss?
What happens if your pager gets a page fault?  The answer is the same:  you
have to make sure it doesn't.  Either the software has to be very careful
(which is okay for things like paging but not for hardware issues like error
correction), or else the crucial bits of software have to get special help.
Include a small amount of high-reliability static RAM to hold the memory-error
handler.  That is what Cheriton et al did for the cache handler in their
virtual-cache-MMUless design:  the hardware has no idea how to do the
virtual->real mapping for a cache miss, so the software that does the mapping
MUST NOT cache miss, so it sits in a special bit of supervisor-only memory
that is neither mapped nor cached.
-- 
PS/2: Yesterday's hardware today.    |  Henry Spencer @ U of Toronto Zoology
OS/2: Yesterday's software tomorrow. | {allegra,ihnp4,decvax,utai}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (10/11/87)

> ... However, if you mean actually decoding the syndrome 
> bits to determine which bit has been flipped, this seems impractical.  What
> happens if the error is in the code that corrects errors?

Greg Noel has pointed out that I responded to one of two possible meanings
of this question; does "code" mean the error-correction software or the
extra bits on the failing memory word?  In the latter case, which may have
been what was meant and which I didn't address, the answer is simple:  if
you want to look at them, which you most assuredly do, the hardware has to
provide a way to do it.  The simplest thing would be a register which latches
the extra bits when an error occurs.  Actually, there's a good chance that
you will have something more complicated than that if the hardware people
have done their job right -- how do you run diagnostics on error-corrected
memory without a way to inspect those bits?
-- 
"Mir" means "peace", as in           |  Henry Spencer @ U of Toronto Zoology
"the war is over; we've won".        | {allegra,ihnp4,decvax,utai}!utzoo!henry

jerry@oliveb.UUCP (Jerry Aguirre) (10/12/87)

In article <8587@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Clearly, what we need, urgently, is ECC on the damn memory chips.  There

The disadvantage is that this provides less protection.  Off chip ECC
protects against the total failure of the chip, not just the failure of
a bit or two.  If an address or output line fails you would never know
about it with on-chip ECC.

There is also a problem with how the memory chips are going to
communicate the ECC information to the CPU.  Not only does the chip have
to notify the CPU about both uncorrected and corrected errors but, at
least at the diagnostic level, you probably want to be able to
interagate the chip about the details of the error.  All this sounds
like more IO pins which are already at a premium.

On the other hand having both would be a real win.  With each chip
handling its own ECC you could have every bit of a word wrong and still
have it corrected.  Also it could be checking every memory location at
refresh time instead of waiting to find errors when they are accessed.
(And having multiple errors accumulate in infrequently accessed words.)
With a second level of correction the on-chip ECC could fail silently
and thus not require any extra pins.

				Jerry Aguirre

pf@diab.UUCP (Per Fogelstrom) (10/15/87)

In article <8739@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>> ... However, if you mean actually decoding the syndrome 
>> bits to determine which bit has been flipped, this seems impractical.  What
>> happens if the error is in the code that corrects errors?
>
>Greg Noel has pointed out that I responded to one of two possible meanings
>of this question; does "code" mean the error-correction software or the
>extra bits on the failing memory word?  In the latter case, which may have
> [ removed discussion about how to take care of correction bits ]

If there is an error in the LOGIC that corrects errors, then you are in trouble,
however there must be a way to locate such faults with software. Anyway, a
fault correction logic tends to signal errors more than missing them.

If the error is in the syndrome bits itself, correction will take care of that.
Even the correction bits are covered by itself !