[comp.unix.wizards] Memory ECC

JOSH@ibm.COM (Joshua Knight) (09/21/87)

In article <2891@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
 > In article <797@spar.SPAR.SLB.COM> hunt@spar.UUCP (Neil Hunt) writes:
 > > Does anyone know about soft failure modes of DRAMs ? How likely is it to
 > > find double bit errors ? With denser and denser memory chips, one might
 > > expect that one day soon, background alpha particles will be able to flip
 > > several adjacent bits.
 >
 >     The way most (all?) modern memory systems are built is to have each
 > chip contribute a single bit to each of many words.  Thus, a typical 1
 > Mbyte ECC board (small by today's standards) might consist of 39 256k
 > chips, each chip contributing a single bit to each of the 256k 39-bit words
 > (32 data plus 7 EEC bits) on the board.  If several bits in a given chip
 > were to go bad, you would see errors in the same bit of several different
 > words.  If an entire chip were to die, you would see an error in the same
 > bit of *every* word on the board.  The memory controller would be able to
 > correct any of these problems.

Quoting from "The IBM 3090 System:  An Overview" by S.G. Tucker (IBM
Systems Journal, Vol 25, No 1, 1986, pp. 4-19):

   "Both the central and expanded storage have error-correcting
    codes.  The central storage has a single-error-correcting,
    double-error-detecting code on each double word of data.  The
    code is designed to detect all four-bit errors on a single
    card.  The correcting code is passed to the caches on a fetch
    operation so that it can cover transmission errors as well as
    storage array errors.  The expanded storage is even more fault-
    tolerant.  Each quad-word of the expanded storage has a double-
    error-correcting, triple-error-detecting code.  Again, a four-
    bit error is always detected if caused by a single-card-level
    failure."

In the context of the article a word is 32 bits, so the ECC is done on
64 bit quanta in the main memory and 128 bit quanta in the expanded
storage.  I think you can buy a 3090 with 1.25GB of main and expanded
storage; however, it ain't cheap.  I don't have at my finger tips how
much memory is on each card.

Stu also describes how some combinations of "stuck at" faults and soft
faults can be corrected that are not correctable with the ECC alone.

 >     Note that the typical-but-mythical memory board described above
 > has 7 check bits per 32 bit data word.  Since you need 2N+1 check bits to
 > correct an N-bit error, this board should be able to detect and correct as
 > many as 3 bad bits in any 32-bit word.  Thus, you could, if you wanted, go
 > so far as to pluck out any 3 RAM chips on the board without loosing any
 > function (other than, maybe, access speed).

Quoting from "Error-Correcting Codes for Semiconductor Memory
Applications:  A State-of-the-Art Review" by C.L. Chen and M.Y. Hsiao
(IBM J R&D Vol 28, No 2, March 1984, pp 124-134):

   "The maximum number of data bits k of a SEC-DED code must
    satisfy k less than or equal to 2**(r-1) - r."

Where SEC-DED means single-error-correcting-double-error-detecting and r
is the number of check bits.  For a particular class of double-error-
correcting-triple-error-detecting codes, the number of check bits
required for 2**m data bits is 2m+3.  All the articles in this issue
of the IBM J R&D are concerned with error correction/detection.

Any opinions expressed or errors committed are mine alone.

			Josh Knight
			IBM T.J. Watson Research Center
josh@ibm.com, josh@yktvmh.BITNET

daveb@geac.UUCP (Brown) (09/22/87)

  For further references to actual work, try the August 1984(!) issue
of IEEE computer. 
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.