[net.unix-wizards] Memory problems on Vax 750s running 4.1bsd

jim (04/04/83)

I have performed a few experiments with memory known to be bad, and
have discovered the following unsettling facts about 4.1bsd on a 750.

Neither single-bit (correctable) nor multi-bit (uncorrectable) errors
are caught by the kernel!  Single-bit errors will be corrected, but you
won't be notified that anything is wrong until the memory deteriorates
to the point of uncorrectable errors.  Multi-bit errors will not be
caught.  If they happen in user space, you will get strange,
unpredictable behavior, usually core-dumps from ordinary programs like
'ls' and 'cat'.  If the errors happen in kernel space, and you are
lucky, you will get a crash, but there won't be any indication of why
you got a crash.  If you are unlucky, some kernel code or data will
become corrupted, resulting in all kinds of strange behavior.

I was sent the following fix by Peter Collinson.  If you apply this
fix, single-bit errors will be caught, but multi-bit errors will still
be ignored (at least on my machine).  If I find a fix for multi-bit
errors, I will let you know.  If you have a 750, I strongly urge you to
put this fix in now, before your machine goes nuts.

>From microsof!decvax!harpo!lime!ukc!pc Thu Mar 31 16:15:29 1983
>Date: Thu Mar 31 18:29:01 1983
To: lime!orion!ariel!vax135!floyd!harpo!decvax!microsof!uw-beave!jim
Subject: Re: I have memory problems on my 750

You probably have lots of answers to this by now
But there is a bug in the define statements for the memory controller on 4.1BSD
for the 750.
The appropriate lines in mem.h should read:

#if VAX750
					/* FIXES FROM JOHN SHEMELD - UKC */
#define	M750_ICRD	0		/* Fix: inhibit crd interrupts, in [1] */
#define	M750_UNCORR	0xc0000000	/* uncorrectable error, in [0] */
#define	M750_CORERR	0x20000000	/* Fix: correctable error, in [0] */

#define	M750_INH(mcr)	((mcr)->mc_reg[1] = M750_ICRD)
#define	M750_ENA(mcr)	((mcr)->mc_reg[0] = (M750_UNCORR|M750_CORERR), \
			    (mcr)->mc_reg[1] = 0x10000000) /* Fix */
#define	M750_ERR(mcr)	((mcr)->mc_reg[0] & (M750_UNCORR|M750_CORERR))

#define	M750_SYN(mcr)	((mcr)->mc_reg[0] & 0x3f)
#define	M750_ADDR(mcr)	(((mcr)->mc_reg[0] >> 8) & 0x7fff)
#endif

Of course, if you were a member of the European UNIX User group, you would have already 
got these fixes.

I am
	Peter Collinson
	lime!ukc!pc
	or
	phillabs!mcvax!ukc!pc