[net.unix-wizards] 4BSD 750 memory traps

dmr (04/10/83)

A recent article from Jim Rees, uw-beaver!jim (uw-beave.458), containing
a report from Peter Collinson, ukc!pc, appears to be partially
in error.  It refers to purported bugs in 4.1BSD in handling memory
traps on the 11/750.

The article makes two suggestions.  First, it says that the macros
M750_INH and M750_ENA in mem.h, which respectively disable and enable
warning traps for single-bit memory errors, are reversed in sense,
so that corrected ECC errors pass unnoticed.  According to the
Vax Hardware Handbook (1982-1983), this is not so.  The relevant bit
is bit 28 of Memory CSR1, and the discussion clearly says that
setting the bit to 1 disables the trap.  This is what the standard
distribution does for INH, and ENA clears the bit.  Perhaps the writer
was confused by the diagram on the same page, which calls the bit
"Enable reporting correctable errors."  However the discussion
is unambiguous.  It is on page 118 of the cited manual.

The article's other suggestion appears to be correct, and the bug
it reports would indeed prevent reporting corrected memory errors:
the distributed code checks the wrong bit in memory CSR0, and will
print "soft ecc" only when two uncorrectable errors have occurred.
M750_CORERR should be 0x20000000 in mem.h.


Another article, from Tucker Withington (vaxine.111), addresses
a related problem.  He suggests that it is wrong to check only
the fault code when deciding to ignore a 750 TB parity error.
I do not doubt the worth of his suggestion, but I would like to verify
it.  Unfortunately, the meanings of the fault codes stored in the
machine-check frame are undocumented in both the Architecture Handbook
(at least 1981 edition) and the Hardware Handbook (1982-1983).
Where can one find this information?

I do have a quibble about the suggestion: it makes sure that
(mcesr & 0xf) == 4.  In particular this test succeeds only when
bit 0 is 0, and bit 0 is variously described as "prefetch reference"
and "XB (Execution buffer) reference."  It looks as if this bit
is only advisory; indeed in looking over the logs it is often set
when an ignorable TB parity error occurs.  So perhaps the test
should be (mcesr & 0xfe) == 4.

			Dennis Ritchie