dmr (04/10/83)
A recent article from Jim Rees, uw-beaver!jim (uw-beave.458), containing a report from Peter Collinson, ukc!pc, appears to be partially in error. It refers to purported bugs in 4.1BSD in handling memory traps on the 11/750. The article makes two suggestions. First, it says that the macros M750_INH and M750_ENA in mem.h, which respectively disable and enable warning traps for single-bit memory errors, are reversed in sense, so that corrected ECC errors pass unnoticed. According to the Vax Hardware Handbook (1982-1983), this is not so. The relevant bit is bit 28 of Memory CSR1, and the discussion clearly says that setting the bit to 1 disables the trap. This is what the standard distribution does for INH, and ENA clears the bit. Perhaps the writer was confused by the diagram on the same page, which calls the bit "Enable reporting correctable errors." However the discussion is unambiguous. It is on page 118 of the cited manual. The article's other suggestion appears to be correct, and the bug it reports would indeed prevent reporting corrected memory errors: the distributed code checks the wrong bit in memory CSR0, and will print "soft ecc" only when two uncorrectable errors have occurred. M750_CORERR should be 0x20000000 in mem.h. Another article, from Tucker Withington (vaxine.111), addresses a related problem. He suggests that it is wrong to check only the fault code when deciding to ignore a 750 TB parity error. I do not doubt the worth of his suggestion, but I would like to verify it. Unfortunately, the meanings of the fault codes stored in the machine-check frame are undocumented in both the Architecture Handbook (at least 1981 edition) and the Hardware Handbook (1982-1983). Where can one find this information? I do have a quibble about the suggestion: it makes sure that (mcesr & 0xf) == 4. In particular this test succeeds only when bit 0 is 0, and bit 0 is variously described as "prefetch reference" and "XB (Execution buffer) reference." It looks as if this bit is only advisory; indeed in looking over the logs it is often set when an ignorable TB parity error occurs. So perhaps the test should be (mcesr & 0xfe) == 4. Dennis Ritchie