[fa.info-vax] 785 CPU bug -- BEWARE

bruce@THINK.COM (Bruce Nemnich) (10/18/85)

We recently converted this machine to a VAX-11/785 after fighting a
cpu bug for weeks.  The symptom was that a running Unix would panic
with a "trap type 8" (Segmentation Fault) in the kernel on a virtual
address of either 0x3ffffffc or 0x5ffffffc.  Once thus crashed, it
would usually not reboot for hours, bombing immediately after BOOT's
kernel size and "start at 0x1150" line with an Interrupt Stack Invalid
crash to the console.

Sometimes frobbing with the hardware (reseating various things) would
make the wedged state go away and allow a successful boot, after which
it would run happily until the next segmentation fault panic, often
days later.

The Interrupt Stack Invalid fault was actually a manifestation of the
same Segmentation Fault; it happened too soon in the initialization
sequence to be recovered from gracefully, so it would recursively trap
until it fell off the kernel stack and eventually the interrupt stack.

Accumulated data from crash dumps revealed three separate instructions
on which it took the trap, but all had one thing in common: they were
trying to use a virtual address computed as a -4 offset from the
contents of a register whose value was 0x80000000; i.e., it was trying
to reference the top longword of the per-process kernel stack.  As it
turns out, this is a very frequent thing to do, since the first thing the
page fault and system call handlers do is check the mode bits of the old
PSL, which is normally the top longword on the KS.

Apparently there is some marginal timing in the relevent data paths on
the 785.  In our case, it doesn't do the computation of 0x80000000-4
correctly: it loses bit 29 or 30 (presumably because the carry chain
isn't making it all the way in time, since it seems to handle all
other virtual address computations correctly).  So it then tries to
lookup virtual address 0x5ffffffc instead of 0x7ffffffc and loses.  

DEC is aware of the problem and is working on an ECO.  Indeed, they
say this is one of several related marginal timing failure modes which
they attribute to to a problem of specs-vs-reality on some of the
chips they are using.  Any particular board set might be more or less
likely to exhibit the behavior, but the only way they can guarantee to
make the problem go away is to do an ECO.  By the way, our problem is
very temperature sensitive: it will fail consistently at or above 73
degrees, and will only fail every now and then (MTBF approx 2 days)
below 70.

By the way, until the ECO is ready, I have circumvented the failure we
see by a one-line modification of locore.s to start the KSP at
0x7ffffffc rather than at 0x80000000, so it never has to make the
address computation which has to carry through so many bits!  DEC
recommends running with SET CLOCK SLOW to avoid the marginal timing
problems, and I will probably do that once I fix the interval clock to
keep time correctly with SET CLOCK SLOW.  [Incidentally, our
time-of-day as kept by Unix is currently running about 2% fast, which
is rediculous...anyone seen this in a 785?]

--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
--bruce@think.com, ihnp4!think!bruce; +1 617 876 1111

info-vax@cca.UUCP (10/18/85)

From: bruce@THINK.COM (Bruce Nemnich)

We recently converted this machine to a VAX-11/785 after fighting a
cpu bug for weeks.  The symptom was that a running Unix would panic
with a "trap type 8" (Segmentation Fault) in the kernel on a virtual
address of either 0x3ffffffc or 0x5ffffffc.  Once thus crashed, it
would usually not reboot for hours, bombing immediately after BOOT's
kernel size and "start at 0x1150" line with an Interrupt Stack Invalid
crash to the console.

Sometimes frobbing with the hardware (reseating various things) would
make the wedged state go away and allow a successful boot, after which
it would run happily until the next segmentation fault panic, often
days later.

The Interrupt Stack Invalid fault was actually a manifestation of the
same Segmentation Fault; it happened too soon in the initialization
sequence to be recovered from gracefully, so it would recursively trap
until it fell off the kernel stack and eventually the interrupt stack.

Accumulated data from crash dumps revealed three separate instructions
on which it took the trap, but all had one thing in common: they were
trying to use a virtual address computed as a -4 offset from the
contents of a register whose value was 0x80000000; i.e., it was trying
to reference the top longword of the per-process kernel stack.  As it
turns out, this is a very frequent thing to do, since the first thing the
page fault and system call handlers do is check the mode bits of the old
PSL, which is normally the top longword on the KS.

Apparently there is some marginal timing in the relevent data paths on
the 785.  In our case, it doesn't do the computation of 0x80000000-4
correctly: it loses bit 29 or 30 (presumably because the carry chain
isn't making it all the way in time, since it seems to handle all
other virtual address computations correctly).  So it then tries to
lookup virtual address 0x5ffffffc instead of 0x7ffffffc and loses.  

DEC is aware of the problem and is working on an ECO.  Indeed, they
say this is one of several related marginal timing failure modes which
they attribute to to a problem of specs-vs-reality on some of the
chips they are using.  Any particular board set might be more or less
likely to exhibit the behavior, but the only way they can guarantee to
make the problem go away is to do an ECO.  By the way, our problem is
very temperature sensitive: it will fail consistently at or above 73
degrees, and will only fail every now and then (MTBF approx 2 days)
below 70.

By the way, until the ECO is ready, I have circumvented the failure we
see by a one-line modification of locore.s to start the KSP at
0x7ffffffc rather than at 0x80000000, so it never has to make the
address computation which has to carry through so many bits!  DEC
recommends running with SET CLOCK SLOW to avoid the marginal timing
problems, and I will probably do that once I fix the interval clock to
keep time correctly with SET CLOCK SLOW.  [Incidentally, our
time-of-day as kept by Unix is currently running about 2% fast, which
is rediculous...anyone seen this in a 785?]

--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
--bruce@think.com, ihnp4!think!bruce; +1 617 876 1111