[comp.arch] ECC

aglew@dwarfs.csg.uiuc.edu (Andy Glew) (05/07/90)

ECC on memory has several performance costs:
(1) the time to compute the ECC on a read - can it be overlapped with the
    processor, or can you handle a "the memory location you read several
    cycles ago, that you have written to all of the registers, was bad"
    trap with a significant delay?
(2) the hardware to do the ECC computation - does it add load delays, or could
    it be used for something else?
(3) partial writes become (non-atomic) read-modify-write operations...

Note that (3) isn't necessary if you already have all of the old cache line
data in your cache - well, you do an RMW in cache, and recompute the ECCs,
but at least you don't have to do the RMW across the bus.  This is equivalent
to a write-allocate policy, which is implicit in many of the snoopy policies
that obtain a cache line exclusively on the first write.

But sometimes first-write exclusivity isn't desirable, and/or you want to be
able to do partial writes.
    Q: does anyone know of "sometimes" ECC systems?  Ie. memory
systems where there is a valid bit associated with the ECC, so that a
partial write where the ECC cannot be totally computed just clears the
valid bit (perhaps downgrading to some simpler form of parity check,
which may be embedded within the ECC).  The next time that the whole
cache line is written or read, then a new full-line width ECC could be
computed.  A "scrubber" process in the OS could go through all of
memory updating the full ECC to valid during idle time.
    Obvious counter argument is that if you can afford not to have ECC
at all times, then you can afford not to have ECC at all.  Or, if the
frequency of partial writes is low enough that the (un)reliability of
not having ECC on those memory locations is acceptable, then the
performance cost of doing RMWs at those times is probably acceptable
as well. I'm not proposing or evaluating, just wondering out loud if
it has already been done somewhere.
--
Andy Glew, aglew@uiuc.edu

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (05/08/90)

In article <AGLEW.90May7112551@dwarfs.csg.uiuc.edu> aglew@dwarfs.csg.uiuc.edu (Andy Glew) writes:
>    Q: does anyone know of "sometimes" ECC systems?  

In our central office telephone switches, we have ECC that is sometimes
switched off. Specifically, we switch off ECC when we are running in 
"Lock-step Sync" on matched processor pair with every cycle matched for
error. (The ECC is always writen into memory, just the check is 
suppressed.)


Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh185
Me? Represent other people? Don't make them laugh so hard.

rh@craycos.com (Robert Herndon) (06/15/90)

  Saying that 'ECC' can't correct multiple bit errors is too
strong.  Traditional SECDED won't help you, but there are
S2ECD2ED codes that will correct any errors in a single two
bit block of a machine word, S4ECD4ED codes that will correct
any errors in a single four bit block, etc., while recognizing
double block errors.
  ECC is generally only for people who worry about large words
(or perhaps cache lines), say, 64 bits or larger.  There is
some additional cost over simple SECDED, as SECDED for a 64
bit word requires 8 check bits, S2ECD2ED for a 64 bit word
requires 10 or 12 check bits, and S4ECD4ED for a 64 bit word
requires 16 check bits.
   Block-error correcting codes are probably not that well known/
popular; this may have some effect.  IBM has published (and
patented) some stuff on this, but I don't have the references
handy.  If you're interested, e-mail me and I can probably dig
them up.
  Even so, what percentage of machines "out there" really use
SECDED codes?  What percentage of memory chip production do
these products consume?

Robert Herndon				Cray Computer Corp.
rh@craycos.com				1110 Bayfield Dr.
719/540-4240				Colorado Springs, CO

stevew@wyse.wyse.com (Steve Wilson x2580 dept303) (06/16/90)

In article <1990Jun14.220128.7904@craycos.com> rh@craycos.com (Robert Herndon) writes:
>ECC is generally only for people who worry about large words
>(or perhaps cache lines), say, 64 bits or larger.  

I haven't found this to be the case. I've seen machines that 
used ECC for 8 bit words(i.e. an ECC memory card for the IBM
P.C.) and LOTS of commercial machines that use ECC on 32 bit
memory organizations.  You'll typically see ECC on mini/super-micros 
built for commercial use. 

> stuff deleted....

>  Even so, what percentage of machines "out there" really use
>SECDED codes?  What percentage of memory chip production do
>these products consume?
>
>Robert Herndon				Cray Computer Corp.

Back in the bad old days, say circa 1978-1979 when the 64k
memory chips were just coming out there was a story running
around Burroughs that Burroughs alone consumed 10% of the 
world's memory production at that time.  All of their machines
at that time had ECC.  Maybe that gives a ballpark answer to
your question.

Steve Wilson