bruce@think.ARPA (Bruce Nemnich) (10/18/85)
We recently converted this machine to a VAX-11/785 after fighting a cpu bug for weeks. The symptom was that a running Unix would panic with a "trap type 8" (Segmentation Fault) in the kernel on a virtual address of either 0x3ffffffc or 0x5ffffffc. Once thus crashed, it would usually not reboot for hours, bombing immediately after BOOT's kernel size and "start at 0x1150" line with an Interrupt Stack Invalid crash to the console. Sometimes frobbing with the hardware (reseating various things) would make the wedged state go away and allow a successful boot, after which it would run happily until the next segmentation fault panic, often days later. The Interrupt Stack Invalid fault was actually a manifestation of the same Segmentation Fault; it happened too soon in the initialization sequence to be recovered from gracefully, so it would recursively trap until it fell off the kernel stack and eventually the interrupt stack. Accumulated data from crash dumps revealed three separate instructions on which it took the trap, but all had one thing in common: they were trying to use a virtual address computed as a -4 offset from the contents of a register whose value was 0x80000000; i.e., it was trying to reference the top longword of the per-process kernel stack. As it turns out, this is a very frequent thing to do, since the first thing the page fault and system call handlers do is check the mode bits of the old PSL, which is normally the top longword on the KS. Apparently there is some marginal timing in the relevent data paths on the 785. In our case, it doesn't do the computation of 0x80000000-4 correctly: it loses bit 29 or 30 (presumably because the carry chain isn't making it all the way in time, since it seems to handle all other virtual address computations correctly). So it then tries to lookup virtual address 0x5ffffffc instead of 0x7ffffffc and loses. DEC is aware of the problem and is working on an ECO. Indeed, they say this is one of several related marginal timing failure modes which they attribute to to a problem of specs-vs-reality on some of the chips they are using. Any particular board set might be more or less likely to exhibit the behavior, but the only way they can guarantee to make the problem go away is to do an ECO. By the way, our problem is very temperature sensitive: it will fail consistently at or above 73 degrees, and will only fail every now and then (MTBF approx 2 days) below 70. By the way, until the ECO is ready, I have circumvented the failure we see by a one-line modification of locore.s to start the KSP at 0x7ffffffc rather than at 0x80000000, so it never has to make the address computation which has to carry through so many bits! DEC recommends running with SET CLOCK SLOW to avoid the marginal timing problems, and I will probably do that once I fix the interval clock to keep time correctly with SET CLOCK SLOW. [Incidentally, our time-of-day as kept by Unix is currently running about 2% fast, which is rediculous...anyone seen this in a 785?] --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA --bruce@think.com, ihnp4!think!bruce; +1 617 876 1111
chris@umcp-cs.UUCP (Chris Torek) (10/22/85)
On our 785, this bug takes the form of computing the value 0x7dfffffc on `extzv $0,$4,-4(r0),r0' instructions when r0 has the value 0x80000000. Here is what I did to keep it from crashing our machine. I also added options "AVOID_785_CPU_BUG" to the conf file, of course. RCS file: RCS/kern_exit.c,v retrieving revision 1.1 retrieving revision 1.2 diff -c2 -r1.1 -r1.2 *** /tmp/,RCSt1008733 Tue Oct 22 12:52:03 1985 --- /tmp/,RCSt2008733 Tue Oct 22 12:52:05 1985 *************** *** 181,184 { struct rusage ru, *rup; if ((u.u_ar0[PS] & PSL_ALLCC) != PSL_ALLCC) { --- 181,186 ----- { struct rusage ru, *rup; + #ifdef AVOID_785_CPU_BUG + int psl; psl = u.u_ar0[PS]; *************** *** 182,185 struct rusage ru, *rup; if ((u.u_ar0[PS] & PSL_ALLCC) != PSL_ALLCC) { u.u_error = wait1(0, (struct rusage *)0); --- 184,193 ----- int psl; + psl = u.u_ar0[PS]; + if ((psl & PSL_ALLCC) != PSL_ALLCC) { + u.u_error = wait1(0, (struct rusage *)0); + return; + } + #else if ((u.u_ar0[PS] & PSL_ALLCC) != PSL_ALLCC) { u.u_error = wait1(0, (struct rusage *)0); *************** *** 186,189 return; } rup = (struct rusage *)u.u_ar0[R1]; u.u_error = wait1(u.u_ar0[R0], &ru); --- 194,198 ----- return; } + #endif rup = (struct rusage *)u.u_ar0[R1]; u.u_error = wait1(u.u_ar0[R0], &ru); RCS file: RCS/kern_xxx.c,v retrieving revision 1.2 retrieving revision 1.3 diff -c2 -r1.2 -r1.3 *** /tmp/,RCSt1008740 Tue Oct 22 12:52:11 1985 --- /tmp/,RCSt2008740 Tue Oct 22 12:52:13 1985 *************** *** 283,286 struct rusage ru; struct vtimes *vtp, avt; if ((u.u_ar0[PS] & PSL_ALLCC) != PSL_ALLCC) { --- 283,288 ----- struct rusage ru; struct vtimes *vtp, avt; + #ifdef AVOID_785_CPU_BUG + int psl; psl = u.u_ar0[PS]; *************** *** 284,287 struct vtimes *vtp, avt; if ((u.u_ar0[PS] & PSL_ALLCC) != PSL_ALLCC) { u.u_error = wait1(0, (struct rusage *)0); --- 286,295 ----- int psl; + psl = u.u_ar0[PS]; + if ((psl & PSL_ALLCC) != PSL_ALLCC) { + u.u_error = wait1(0, (struct rusage *)0); + return; + } + #else if ((u.u_ar0[PS] & PSL_ALLCC) != PSL_ALLCC) { u.u_error = wait1(0, (struct rusage *)0); *************** *** 288,291 return; } vtp = (struct vtimes *)u.u_ar0[R1]; u.u_error = wait1(u.u_ar0[R0], &ru); --- 296,300 ----- return; } + #endif vtp = (struct vtimes *)u.u_ar0[R1]; u.u_error = wait1(u.u_ar0[R0], &ru); -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu