ane@hal.UUCP (Aydin "Bif" Edguer) (12/02/87)
Hello, I am currently running UNIX 4.3 BSD on a VAX-11/750. Recently I have noticed a large number of soft ecc errors appearing in my system log. It looks alot like there may be a bad chip on one of my 6 memory boards. I can isolate which board (probably) by board swapping, but how can I determine which chip? The memory boards all pass the software diagnostic tests from Digital. I am including some of the log entries in the hopes that this will help. Aydin Edguer Case Western Reserve University !{cbosgd,decvax,sun}!mandrill.cwru.EDU!hal!ane Cleveland, OH Work : (216) 368 3195 0900-1700 EST M-F ---------------------------------------------------------- Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4 Nov 28 22:10:02 hal vmunix: mcr0: soft ecc addr 1222 syn 4 Nov 28 15:10:02 hal vmunix: mcr0: soft ecc addr 1228 syn 4 Nov 28 14:58:00 hal vmunix: mcr0: soft ecc addr 1229 syn 4 Nov 28 20:15:35 hal vmunix: mcr0: soft ecc addr 1230 syn 4 Nov 28 19:29:12 hal vmunix: mcr0: soft ecc addr 1231 syn 4 Nov 28 12:17:47 hal vmunix: mcr0: soft ecc addr 1233 syn 4 Nov 28 09:00:01 hal vmunix: mcr0: soft ecc addr 1238 syn 4 Nov 28 09:30:01 hal vmunix: mcr0: soft ecc addr 1239 syn 4 Nov 28 09:10:01 hal vmunix: mcr0: soft ecc addr 123a syn 4 Nov 28 09:16:25 hal vmunix: mcr0: soft ecc addr 123b syn 4 Nov 28 18:45:51 hal vmunix: mcr0: soft ecc addr 123e syn 4 Nov 28 19:15:51 hal vmunix: mcr0: soft ecc addr 123f syn 4 Nov 28 05:40:41 hal vmunix: mcr0: soft ecc addr 124f syn 4 Nov 28 22:16:28 hal vmunix: mcr0: soft ecc addr 126a syn 4 Nov 28 17:24:11 hal vmunix: mcr0: soft ecc addr 12b7 syn 4 Nov 28 19:45:54 hal vmunix: mcr0: soft ecc addr 12d8 syn 4
chris@mimsy.UUCP (Chris Torek) (12/04/87)
[I overrided the followup-to header because I know various people do not get comp.sys.dec, particularly those in ARPAland.] In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes: >... I have noticed a large number of soft ecc errors appearing in my system >log. It looks alot like there may be a bad chip on one of my 6 memory >boards. I can isolate which board (probably) by board swapping, but >how can I determine which chip? Not even board swapping is necessary; the address and syndrome values tell which board and which chip, although you will need a table to decode syndrome numbers. >The memory boards all pass the software diagnostic tests from Digital. Memory diagnostics are notoriously unreliable. There are too many ways for the chips to fail to test them all, so diagnostics usually look only for `serious' trouble. >Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4 Drag in your DEC FieldServicePerson and tell (her,him) that if you were running VMS it would have printed %FOO-W-BAR, Some very long message that somewhere mentions something about memory chip corrected errors without giving anyone any clue as to what that means even though it takes several thousand characters to say it,* ADDR=00120D04 The exact meaning of all those bits depends on the memory controller and memory boards in your system; typically the first few bits specify an array number, the middle bits the address within the array, and the last 7 or 8 bits the failing chip. (This is why the address may vary, but not the syndrome number.) ----- *Just out of curiosity, I would like to know the actual message format, and the description under the %FOO-W-BAR key in the VMS manuals (which I suspect is something like this: `VMS has detected and corrected a minor hardware fault; call your Field Service Engineer'---i.e., utterly undescriptive). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
haynes@ucscc.UCSC.EDU.ucsc.edu (99700000) (12/04/87)
In article <9609@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >[I overrided the followup-to header because I know various people >do not get comp.sys.dec, particularly those in ARPAland.] > >In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes: >>... I have noticed a large number of soft ecc errors appearing in my system >>log. It looks alot like there may be a bad chip on one of my 6 memory >>boards. I can isolate which board (probably) by board swapping, but >>how can I determine which chip? > >Not even board swapping is necessary; the address and syndrome values >tell which board and which chip, although you will need a table to >decode syndrome numbers. > >>The memory boards all pass the software diagnostic tests from Digital. It appears to me that there is a problem in connection with 750s, in that if a memory bit gets munged in an area that is read-only (e.g. part of the kernel code) then the ECC corrects the error in the data sent to the cpu, so the system keeps running OK; but nothing ever writes the corrected data back into memory. I have this mcr0: soft ecc situation every now and then, and the trouble invariably lasts until a reboot, and then invariably goes away. (By invariably lasts I mean invariably if it is down in the low end of memory where kernel code lives.) Apparently on other models the memory controller itself writes the corrected data back to memory. I've been intending to play with some code in machdep.c to fix this; tho a hard part of playing with it is that you have to wait for a soft ecc error to occur. What I think should be done on a soft ECC error is to hang on to the address, read the word at that address into a variable, then write the variable back to memory at that address, which will store the corrected value in memory. So I believe the problem is really a matter of cheapness in the memory controller hardware and failure to account for this in the software that handles soft ecc error reports. Will appreciate hearing from anybody who can confirm or deny my analysis, and especially for anybody who has written code to fix it. haynes@ucscc.ucsc.edu haynes@ucscc.bitnet ..ucbvax!ucscc!haynes
klb@philabs.Philips.Com (Ken Bourque) (12/07/87)
In article <9609@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: > >>Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4 > >Drag in your DEC FieldServicePerson and tell (her,him) that if you >were running VMS it would have printed > > %FOO-W-BAR, Some very long message that somewhere mentions > something about memory chip corrected errors without giving > anyone any clue as to what that means even though it takes > several thousand characters to say it,* > ADDR=00120D04 > >----- >*Just out of curiosity, I would like to know the actual message >format, and the description under the %FOO-W-BAR key in the VMS >manuals (which I suspect is something like this: `VMS has detected >and corrected a minor hardware fault; call your Field Service >Engineer'---i.e., utterly undescriptive). Here is a VMS error log entry for a memory error. I suggest that in the future you get your facts before distorting them. V A X / V M S SYSTEM ERROR REPORT COMPILED 7-DEC-1987 09:56 PAGE 1. **************************** ENTRY 1043. **************************** ERROR SEQUENCE 62183. LOGGED ON SID 02006278 CORRECTED MEMORY ERROR 25-SEP-1987 08:05:50.44 KA750 REV# 120. UCODE REV# 98. CONTROLLER AT SLOT INDEX #0. CSR0 20266229 ERROR SYNDROME = 29 CORRECTED ERROR, BIT #9. ARRAY BANK #1. IN ERROR ARRAY #2. IN ERROR CORRECTED ERROR FLAG CSR1 10000000 ENABLE REPORTING CORRECTED ERRORS CSR2 0100AAAA MEMORY SIZE = 8192.K MEMORY BASE ADDRESS = 0.K CONTROLLER IS L0016 -- Ken Bourque klb@philabs.philips.com ...!{uunet,ihnp4,decvax}!philabs!klb