root@LBL-CSAM.arpa (09/24/86)
HELP, please! I am running a VAX 751, with 2.0 MB of TRENDATA memory. I recently moved from 4.1 to 4.2 OS. Ever since, I get the following error message at about 10 minute intervals: mcr0: soft ecc addr xxx syn yy (xxx and yy are NOT constant) I also get the following when we boot: WARNING: should run interleaved swap with >= 2MB My questions: 1) How do I "run interleaved"? 2) Is the boot message an indication of why I am getting the other messages? 3) If I go back to 4.1, I don't see the "ecc" message (or the other one, for that matter). Is there really something wrong with my memory boards? 4) I have discovered that the "ecc" message is (likely) from /usr/sys/vax/machdep.c and I have found several #if TRENDATA ... #endif lines. But when I defined TRENDATA as an "optional" in my kernel configuration file (and reboot), the same error messages continue to come out. Am I missing some "bugfix" code for TRENDATA memory on a 750? (Looks like most of the TRENDATA mods are for 780 machines.) 5) Besides risking the filling of my disk from /usr/adm/messages, is there any other danger in ignoring the error messages? I'd much appreciate any help. Please reply directly to me at: trwspf!expert@lbl-csam or {decvax,ucbvax}!trwrb!trwspf!expert
chris@umcp-cs.UUCP (Chris Torek) (10/07/86)
(Since I have seen no summary of replies, and since I can answer most of these, I shall ignore the `reply by mail' request.) In article <4072@brl-smoke.ARPA> vader!root@LBL-CSAM.arpa (RADIX System) writes: >... I get the following error message at about 10 minute intervals: > > mcr0: soft ecc addr xxx syn yy > >I also get the following when we boot: > > WARNING: should run interleaved swap with >= 2MB > >1) How do I "run interleaved"? This refers to swap/paging partitions. If you have two or more disc drives, you should set up swap areas on at least two. See `Building Systems with Config'. Multiple swap areas is supposed to be faster. Whether it is in fact faster is a function of many variables. >2) Is the boot message an indication of why I am getting the other >messages? No. >3) If I go back to 4.1, I don't see the "ecc" message (or the other >one, for that matter). Is there really something wrong with my memory >boards? Yes. 4.1 had less support for 750s, and presumably did not catch 750 ECC errors. >4) I have discovered that the "ecc" message is (likely) from >/usr/sys/vax/machdep.c It is indeed. >and I have found several > #if TRENDATA > ... > #endif >lines. But when I defined TRENDATA as an "optional" in my kernel >configuration file (and reboot), the same error messages continue >to come out. Am I missing some "bugfix" code for TRENDATA memory >on a 750? (Looks like most of the TRENDATA mods are for 780 machines.) The Trendata tables are for specific boards, probably for 780s. Whether they apply to yours is questionable. In any case, Trendata should have provided you with, or be able to provide you with, decoding tables. If Trendata understands only VMS format errors, just concatenate `xxx' and `yy' and pad with zeroes on the left: mcr0: soft ecc addr 54f90 syn e3 means the same as VMS's ?VMS-W-WARNINGMESSAGE, ridiculously long error string that lets you know something is wrong, but is no more help than `soft ecc addr ...' when it comes to figuring out just what, but fortunately you can look it up in some manual, which will of course just tell you to call Field Service, ERR ADDR=054F90E3 >5) Besides risking the filling of my disk from /usr/adm/messages, is >there any other danger in ignoring the error messages? Yes. If another few chips fail, you will no longer get soft (correctable) errors; you will get crashes. Incidentally, just because you see the messages only once every ten minutes does not mean the ECC correction is infrequent. The code in /sys/vax/machdep.c disables ECC reporting after each error, then re-enables it ten minutes later. This is controlled by the variable `memintvl', which is in seconds: % su Password: # adb -w /vmunix /dev/kmem memintvl/W 1 _memintvl: _memintvl: 258 = 1 $q # will re-enable reporting after one second. Stand back from the console, and have plenty of paper handy! Rebooting will restore the ten minute interval; or you can use adb again to change it back. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu