[net.unix] 4.2 \"soft ecc\" errors

root@LBL-CSAM.arpa (09/24/86)

HELP, please!

I am running a VAX 751, with 2.0 MB of TRENDATA memory.  I recently
moved from 4.1 to 4.2 OS.  Ever since, I get the following error
message at about 10 minute intervals:

	mcr0: soft ecc addr xxx syn yy

(xxx and yy are NOT constant)

I also get the following when we boot:

	WARNING: should run interleaved swap with >= 2MB

My questions:

1) How do I "run interleaved"?

2) Is the boot message an indication of why I am getting the other
messages?

3) If I go back to 4.1, I don't see the "ecc" message (or the other
one, for that matter).  Is there really something wrong with my memory
boards?

4) I have discovered that the "ecc" message is (likely) from
/usr/sys/vax/machdep.c and I have found several
	#if TRENDATA
	...
	#endif
lines.  But when I defined TRENDATA as an "optional" in my kernel
configuration file (and reboot), the same error messages continue
to come out.  Am I missing some "bugfix" code for TRENDATA memory
on a 750?  (Looks like most of the TRENDATA mods are for 780 machines.)

5) Besides risking the filling of my disk from /usr/adm/messages, is
there any other danger in ignoring the error messages?

I'd much appreciate any help.  Please reply directly to me at:

	trwspf!expert@lbl-csam
		or
	{decvax,ucbvax}!trwrb!trwspf!expert

chris@umcp-cs.UUCP (Chris Torek) (10/07/86)

(Since I have seen no summary of replies, and since I can answer
most of these, I shall ignore the `reply by mail' request.)

In article <4072@brl-smoke.ARPA> vader!root@LBL-CSAM.arpa (RADIX System) writes:
>... I get the following error message at about 10 minute intervals:
>
>	mcr0: soft ecc addr xxx syn yy
>
>I also get the following when we boot:
>
>	WARNING: should run interleaved swap with >= 2MB
>
>1) How do I "run interleaved"?

This refers to swap/paging partitions.  If you have two or more
disc drives, you should set up swap areas on at least two.  See
`Building Systems with Config'.  Multiple swap areas is supposed
to be faster.  Whether it is in fact faster is a function of many
variables.

>2) Is the boot message an indication of why I am getting the other
>messages?

No.

>3) If I go back to 4.1, I don't see the "ecc" message (or the other
>one, for that matter).  Is there really something wrong with my memory
>boards?

Yes.  4.1 had less support for 750s, and presumably did not catch
750 ECC errors.

>4) I have discovered that the "ecc" message is (likely) from
>/usr/sys/vax/machdep.c

It is indeed.

>and I have found several
>	#if TRENDATA
>	...
>	#endif
>lines.  But when I defined TRENDATA as an "optional" in my kernel
>configuration file (and reboot), the same error messages continue
>to come out.  Am I missing some "bugfix" code for TRENDATA memory
>on a 750?  (Looks like most of the TRENDATA mods are for 780 machines.)

The Trendata tables are for specific boards, probably for 780s.
Whether they apply to yours is questionable.  In any case, Trendata
should have provided you with, or be able to provide you with,
decoding tables.  If Trendata understands only VMS format errors,
just concatenate `xxx' and `yy' and pad with zeroes on the left:

	mcr0: soft ecc addr 54f90 syn e3

means the same as VMS's

	?VMS-W-WARNINGMESSAGE, ridiculously long error string that
	lets you know something is wrong, but is no more help than
	`soft ecc addr ...' when it comes to figuring out just
	what, but fortunately you can look it up in some manual,
	which will of course just tell you to call Field Service,
		ERR ADDR=054F90E3

>5) Besides risking the filling of my disk from /usr/adm/messages, is
>there any other danger in ignoring the error messages?

Yes.  If another few chips fail, you will no longer get soft
(correctable) errors; you will get crashes.

Incidentally, just because you see the messages only once every
ten minutes does not mean the ECC correction is infrequent.  The
code in /sys/vax/machdep.c disables ECC reporting after each
error, then re-enables it ten minutes later.  This is controlled
by the variable `memintvl', which is in seconds:

	% su
	Password:
	# adb -w /vmunix /dev/kmem
	memintvl/W 1
	_memintvl:
	_memintvl:	258	=		1
	$q
	#

will re-enable reporting after one second.  Stand back from the
console, and have plenty of paper handy!

Rebooting will restore the ten minute interval; or you can use adb
again to change it back.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu