[comp.bugs.4bsd] mcr0: errors

haynes@ucscc.UCSC.EDU.ucsc.edu (99700000) (12/04/87)

In article <9609@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>[I overrided the followup-to header because I know various people
>do not get comp.sys.dec, particularly those in ARPAland.]
>
>In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes:
>>... I have noticed a large number of soft ecc errors appearing in my system
>>log.  It looks alot like there may be a bad chip on one of my 6 memory
>>boards.  I can isolate which board (probably) by board swapping, but
>>how can I determine which chip?
>
>Not even board swapping is necessary; the address and syndrome values
>tell which board and which chip, although you will need a table to
>decode syndrome numbers.
>
>>The memory boards all pass the software diagnostic tests from Digital.

It appears to me that there is a problem in connection with 750s, in
that if a memory bit gets munged in an area that is read-only (e.g.
part of the kernel code) then the ECC corrects the error in the data
sent to the cpu, so the system keeps running OK; but nothing ever
writes the corrected data back into memory.  I have this mcr0: soft
ecc situation every now and then, and the trouble invariably lasts
until a reboot, and then invariably goes away.  (By invariably lasts
I mean invariably if it is down in the low end of memory where kernel
code lives.)  Apparently on other models the memory controller itself
writes the corrected data back to memory.

I've been intending to play with some code in machdep.c to fix this;
tho a hard part of playing with it is that you have to wait for a
soft ecc error to occur.  What I think should be done on a soft ECC
error is to hang on to the address, read the word at that address into
a variable, then write the variable back to memory at that address, which
will store the corrected value in memory.  So I believe the problem
is really a matter of cheapness in the memory controller hardware
and failure to account for this in the software that handles soft
ecc error reports.

Will appreciate hearing from anybody who can confirm or deny my
analysis, and especially for anybody who has written code to fix it.
haynes@ucscc.ucsc.edu
haynes@ucscc.bitnet
..ucbvax!ucscc!haynes