bruce@THINK.COM (Bruce Nemnich) (10/18/85)
We recently converted this machine to a VAX-11/785 after fighting a cpu bug for weeks. The symptom was that a running Unix would panic with a "trap type 8" (Segmentation Fault) in the kernel on a virtual address of either 0x3ffffffc or 0x5ffffffc. Once thus crashed, it would usually not reboot for hours, bombing immediately after BOOT's kernel size and "start at 0x1150" line with an Interrupt Stack Invalid crash to the console. Sometimes frobbing with the hardware (reseating various things) would make the wedged state go away and allow a successful boot, after which it would run happily until the next segmentation fault panic, often days later. The Interrupt Stack Invalid fault was actually a manifestation of the same Segmentation Fault; it happened too soon in the initialization sequence to be recovered from gracefully, so it would recursively trap until it fell off the kernel stack and eventually the interrupt stack. Accumulated data from crash dumps revealed three separate instructions on which it took the trap, but all had one thing in common: they were trying to use a virtual address computed as a -4 offset from the contents of a register whose value was 0x80000000; i.e., it was trying to reference the top longword of the per-process kernel stack. As it turns out, this is a very frequent thing to do, since the first thing the page fault and system call handlers do is check the mode bits of the old PSL, which is normally the top longword on the KS. Apparently there is some marginal timing in the relevent data paths on the 785. In our case, it doesn't do the computation of 0x80000000-4 correctly: it loses bit 29 or 30 (presumably because the carry chain isn't making it all the way in time, since it seems to handle all other virtual address computations correctly). So it then tries to lookup virtual address 0x5ffffffc instead of 0x7ffffffc and loses. DEC is aware of the problem and is working on an ECO. Indeed, they say this is one of several related marginal timing failure modes which they attribute to to a problem of specs-vs-reality on some of the chips they are using. Any particular board set might be more or less likely to exhibit the behavior, but the only way they can guarantee to make the problem go away is to do an ECO. By the way, our problem is very temperature sensitive: it will fail consistently at or above 73 degrees, and will only fail every now and then (MTBF approx 2 days) below 70. By the way, until the ECO is ready, I have circumvented the failure we see by a one-line modification of locore.s to start the KSP at 0x7ffffffc rather than at 0x80000000, so it never has to make the address computation which has to carry through so many bits! DEC recommends running with SET CLOCK SLOW to avoid the marginal timing problems, and I will probably do that once I fix the interval clock to keep time correctly with SET CLOCK SLOW. [Incidentally, our time-of-day as kept by Unix is currently running about 2% fast, which is rediculous...anyone seen this in a 785?] --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA --bruce@think.com, ihnp4!think!bruce; +1 617 876 1111
info-vax@cca.UUCP (10/18/85)
From: bruce@THINK.COM (Bruce Nemnich) We recently converted this machine to a VAX-11/785 after fighting a cpu bug for weeks. The symptom was that a running Unix would panic with a "trap type 8" (Segmentation Fault) in the kernel on a virtual address of either 0x3ffffffc or 0x5ffffffc. Once thus crashed, it would usually not reboot for hours, bombing immediately after BOOT's kernel size and "start at 0x1150" line with an Interrupt Stack Invalid crash to the console. Sometimes frobbing with the hardware (reseating various things) would make the wedged state go away and allow a successful boot, after which it would run happily until the next segmentation fault panic, often days later. The Interrupt Stack Invalid fault was actually a manifestation of the same Segmentation Fault; it happened too soon in the initialization sequence to be recovered from gracefully, so it would recursively trap until it fell off the kernel stack and eventually the interrupt stack. Accumulated data from crash dumps revealed three separate instructions on which it took the trap, but all had one thing in common: they were trying to use a virtual address computed as a -4 offset from the contents of a register whose value was 0x80000000; i.e., it was trying to reference the top longword of the per-process kernel stack. As it turns out, this is a very frequent thing to do, since the first thing the page fault and system call handlers do is check the mode bits of the old PSL, which is normally the top longword on the KS. Apparently there is some marginal timing in the relevent data paths on the 785. In our case, it doesn't do the computation of 0x80000000-4 correctly: it loses bit 29 or 30 (presumably because the carry chain isn't making it all the way in time, since it seems to handle all other virtual address computations correctly). So it then tries to lookup virtual address 0x5ffffffc instead of 0x7ffffffc and loses. DEC is aware of the problem and is working on an ECO. Indeed, they say this is one of several related marginal timing failure modes which they attribute to to a problem of specs-vs-reality on some of the chips they are using. Any particular board set might be more or less likely to exhibit the behavior, but the only way they can guarantee to make the problem go away is to do an ECO. By the way, our problem is very temperature sensitive: it will fail consistently at or above 73 degrees, and will only fail every now and then (MTBF approx 2 days) below 70. By the way, until the ECO is ready, I have circumvented the failure we see by a one-line modification of locore.s to start the KSP at 0x7ffffffc rather than at 0x80000000, so it never has to make the address computation which has to carry through so many bits! DEC recommends running with SET CLOCK SLOW to avoid the marginal timing problems, and I will probably do that once I fix the interval clock to keep time correctly with SET CLOCK SLOW. [Incidentally, our time-of-day as kept by Unix is currently running about 2% fast, which is rediculous...anyone seen this in a 785?] --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA --bruce@think.com, ihnp4!think!bruce; +1 617 876 1111