[comp.sys.dec] mcr0: errors

ane@hal.UUCP (Aydin "Bif" Edguer) (12/02/87)

Hello,
	I am currently running UNIX 4.3 BSD on a VAX-11/750.  Recently
I have noticed a large number of soft ecc errors appearing in my system
log.  It looks alot like there may be a bad chip on one of my 6 memory
boards.  I can isolate which board (probably) by board swapping, but
how can I determine which chip?  The memory boards all pass the software
diagnostic tests from Digital.  I am including some of the log entries
in the hopes that this will help.

Aydin Edguer					Case Western Reserve University
!{cbosgd,decvax,sun}!mandrill.cwru.EDU!hal!ane		Cleveland, OH
Work : (216) 368 3195	0900-1700 EST M-F
----------------------------------------------------------
Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4
Nov 28 22:10:02 hal vmunix: mcr0: soft ecc addr 1222 syn 4
Nov 28 15:10:02 hal vmunix: mcr0: soft ecc addr 1228 syn 4
Nov 28 14:58:00 hal vmunix: mcr0: soft ecc addr 1229 syn 4
Nov 28 20:15:35 hal vmunix: mcr0: soft ecc addr 1230 syn 4
Nov 28 19:29:12 hal vmunix: mcr0: soft ecc addr 1231 syn 4
Nov 28 12:17:47 hal vmunix: mcr0: soft ecc addr 1233 syn 4
Nov 28 09:00:01 hal vmunix: mcr0: soft ecc addr 1238 syn 4
Nov 28 09:30:01 hal vmunix: mcr0: soft ecc addr 1239 syn 4
Nov 28 09:10:01 hal vmunix: mcr0: soft ecc addr 123a syn 4
Nov 28 09:16:25 hal vmunix: mcr0: soft ecc addr 123b syn 4
Nov 28 18:45:51 hal vmunix: mcr0: soft ecc addr 123e syn 4
Nov 28 19:15:51 hal vmunix: mcr0: soft ecc addr 123f syn 4
Nov 28 05:40:41 hal vmunix: mcr0: soft ecc addr 124f syn 4
Nov 28 22:16:28 hal vmunix: mcr0: soft ecc addr 126a syn 4
Nov 28 17:24:11 hal vmunix: mcr0: soft ecc addr 12b7 syn 4
Nov 28 19:45:54 hal vmunix: mcr0: soft ecc addr 12d8 syn 4

ables@hi3.aca.mcc.com.UUCP (King Ables) (12/04/87)

in article <192@hal.UUCP>, ane@hal.UUCP (Aydin "Bif" Edguer) says:
> Keywords: syslog mcr0 ecc errors
> 
> Hello,
> 	I am currently running UNIX 4.3 BSD on a VAX-11/750.  Recently
> I have noticed a large number of soft ecc errors appearing in my system
> log.  It looks alot like there may be a bad chip on one of my 6 memory
> boards.  I can isolate which board (probably) by board swapping, but
> how can I determine which chip?  The memory boards all pass the software
> diagnostic tests from Digital.  I am including some of the log entries
> in the hopes that this will help.
> 
> Aydin Edguer					Case Western Reserve University
> !{cbosgd,decvax,sun}!mandrill.cwru.EDU!hal!ane		Cleveland, OH
> Work : (216) 368 3195	0900-1700 EST M-F
> ----------------------------------------------------------
> Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4
> Nov 28 22:10:02 hal vmunix: mcr0: soft ecc addr 1222 syn 4
> Nov 28 15:10:02 hal vmunix: mcr0: soft ecc addr 1228 syn 4
> Nov 28 14:58:00 hal vmunix: mcr0: soft ecc addr 1229 syn 4
> Nov 28 20:15:35 hal vmunix: mcr0: soft ecc addr 1230 syn 4
> Nov 28 19:29:12 hal vmunix: mcr0: soft ecc addr 1231 syn 4
> Nov 28 12:17:47 hal vmunix: mcr0: soft ecc addr 1233 syn 4
> Nov 28 09:00:01 hal vmunix: mcr0: soft ecc addr 1238 syn 4
> Nov 28 09:30:01 hal vmunix: mcr0: soft ecc addr 1239 syn 4
> Nov 28 09:10:01 hal vmunix: mcr0: soft ecc addr 123a syn 4
> Nov 28 09:16:25 hal vmunix: mcr0: soft ecc addr 123b syn 4
> Nov 28 18:45:51 hal vmunix: mcr0: soft ecc addr 123e syn 4
> Nov 28 19:15:51 hal vmunix: mcr0: soft ecc addr 123f syn 4
> Nov 28 05:40:41 hal vmunix: mcr0: soft ecc addr 124f syn 4
> Nov 28 22:16:28 hal vmunix: mcr0: soft ecc addr 126a syn 4
> Nov 28 17:24:11 hal vmunix: mcr0: soft ecc addr 12b7 syn 4
> Nov 28 19:45:54 hal vmunix: mcr0: soft ecc addr 12d8 syn 4

chris@mimsy.UUCP (Chris Torek) (12/04/87)

[I overrided the followup-to header because I know various people
do not get comp.sys.dec, particularly those in ARPAland.]

In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes:
>... I have noticed a large number of soft ecc errors appearing in my system
>log.  It looks alot like there may be a bad chip on one of my 6 memory
>boards.  I can isolate which board (probably) by board swapping, but
>how can I determine which chip?

Not even board swapping is necessary; the address and syndrome values
tell which board and which chip, although you will need a table to
decode syndrome numbers.

>The memory boards all pass the software diagnostic tests from Digital.

Memory diagnostics are notoriously unreliable.  There are too many
ways for the chips to fail to test them all, so diagnostics usually
look only for `serious' trouble.

>Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4

Drag in your DEC FieldServicePerson and tell (her,him) that if you
were running VMS it would have printed

	%FOO-W-BAR, Some very long message that somewhere mentions
	something about memory chip corrected errors without giving
	anyone any clue as to what that means even though it takes
	several thousand characters to say it,*
		ADDR=00120D04

The exact meaning of all those bits depends on the memory controller
and memory boards in your system; typically the first few bits specify
an array number, the middle bits the address within the array, and the
last 7 or 8 bits the failing chip.  (This is why the address may vary,
but not the syndrome number.)

-----
*Just out of curiosity, I would like to know the actual message
format, and the description under the %FOO-W-BAR key in the VMS
manuals (which I suspect is something like this: `VMS has detected
and corrected a minor hardware fault; call your Field Service
Engineer'---i.e., utterly undescriptive).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

haynes@ucscc.UCSC.EDU.ucsc.edu (99700000) (12/04/87)

In article <9609@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>[I overrided the followup-to header because I know various people
>do not get comp.sys.dec, particularly those in ARPAland.]
>
>In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes:
>>... I have noticed a large number of soft ecc errors appearing in my system
>>log.  It looks alot like there may be a bad chip on one of my 6 memory
>>boards.  I can isolate which board (probably) by board swapping, but
>>how can I determine which chip?
>
>Not even board swapping is necessary; the address and syndrome values
>tell which board and which chip, although you will need a table to
>decode syndrome numbers.
>
>>The memory boards all pass the software diagnostic tests from Digital.

It appears to me that there is a problem in connection with 750s, in
that if a memory bit gets munged in an area that is read-only (e.g.
part of the kernel code) then the ECC corrects the error in the data
sent to the cpu, so the system keeps running OK; but nothing ever
writes the corrected data back into memory.  I have this mcr0: soft
ecc situation every now and then, and the trouble invariably lasts
until a reboot, and then invariably goes away.  (By invariably lasts
I mean invariably if it is down in the low end of memory where kernel
code lives.)  Apparently on other models the memory controller itself
writes the corrected data back to memory.

I've been intending to play with some code in machdep.c to fix this;
tho a hard part of playing with it is that you have to wait for a
soft ecc error to occur.  What I think should be done on a soft ECC
error is to hang on to the address, read the word at that address into
a variable, then write the variable back to memory at that address, which
will store the corrected value in memory.  So I believe the problem
is really a matter of cheapness in the memory controller hardware
and failure to account for this in the software that handles soft
ecc error reports.

Will appreciate hearing from anybody who can confirm or deny my
analysis, and especially for anybody who has written code to fix it.
haynes@ucscc.ucsc.edu
haynes@ucscc.bitnet
..ucbvax!ucscc!haynes

klb@philabs.Philips.Com (Ken Bourque) (12/07/87)

In article <9609@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>
>>Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4
>
>Drag in your DEC FieldServicePerson and tell (her,him) that if you
>were running VMS it would have printed
>
>	%FOO-W-BAR, Some very long message that somewhere mentions
>	something about memory chip corrected errors without giving
>	anyone any clue as to what that means even though it takes
>	several thousand characters to say it,*
>		ADDR=00120D04
>
>-----
>*Just out of curiosity, I would like to know the actual message
>format, and the description under the %FOO-W-BAR key in the VMS
>manuals (which I suspect is something like this: `VMS has detected
>and corrected a minor hardware fault; call your Field Service
>Engineer'---i.e., utterly undescriptive).


Here is a VMS error log entry for a memory error.  I suggest that in the
future you get your facts before distorting them.

 V A X / V M S        SYSTEM ERROR REPORT      COMPILED  7-DEC-1987 09:56
                                                                PAGE   1.

 **************************** ENTRY    1043. ****************************
 ERROR SEQUENCE 62183.                             LOGGED ON SID 02006278

 CORRECTED MEMORY ERROR  25-SEP-1987 08:05:50.44
                         KA750    REV# 120.   UCODE REV# 98.

 CONTROLLER AT SLOT INDEX #0.

       CSR0            20266229
                                       ERROR SYNDROME = 29
                                       CORRECTED ERROR, BIT #9.
                                       ARRAY BANK #1. IN ERROR
                                       ARRAY #2. IN ERROR
                                       CORRECTED ERROR FLAG
       CSR1            10000000
                                       ENABLE REPORTING CORRECTED ERRORS
       CSR2            0100AAAA
                                       MEMORY SIZE = 8192.K
                                       MEMORY BASE ADDRESS = 0.K
                                       CONTROLLER IS L0016




-- 
Ken Bourque    klb@philabs.philips.com    ...!{uunet,ihnp4,decvax}!philabs!klb

cetron@utah-cs.UUCP (Edward J Cetron) (01/21/88)

	when you find it, please let me know... I've had field service
replace:

	1. all cpu boards
	2. all memory boards and memory controllers
	3. backplane
	4. power supply

	so what's left???

I'm at a loss....

-ed
cetron@cs.utah.edu