cbush@RAND-UNIX.ARPA (08/07/85)
Need help with persistent, at least one a day, system crashes on our new VAX 11/785 running 4.2BSD UNIX. Always get the same panic messages; ... Aug 6 12:10 trap type 9, code = 80001400, pc = 80001400 panic: Protection fault syncing disks... 3 3 2 2 2 2 1 done ... My reading of the above and from looking at the kernal stack frames says, in summary, the system was attempting to execute the instruction at location 80001400 while in user mode!! How can that be? I must be missing something. The kernal stack says were really in kernal code "copyout" having gotten there from various places not always the same (Not a wild branch). In the last crash "gettimeofday" was calling copyout. If the system screwed up and made an invalid context switch to user mode why crash at 80001400 why not the instruction before? or first kernal instruction after the switch? Is the address validation only done at the page bounderys? I don't know if its hardware or software. We have seen these same trap 9's at same address on two of our four VAX 11/780's but very infrequently, less than one a month on heavly loaded systems. At this point were stymied. Now looking at kernal code differances on the various machines and changing hardware configuration in attempt to narrow it down. A PORTION OF THE KERNAL STACK, with my incomplete comments (First vax dump I've looked at) ----adb command *(scb-4)$c gives ---- (adb crash man pages seems screwed up?) sbr 8002bc64 slr 4b68 --- system pte's look good. p0br 808aae00 p0lr 74 p1br 800ab200 p1lr 1fffea _boot() from 80021f3a _boot(0,0) from 80021f3a _panic(80043083) from 8000cf76 _trap() from 800224ec _Xtransflt() from 80001035 _syscall() from 800227fc _Xsyscall(7fffe6f0,0) from 80001054 ?(7fffe734) from 214d ?(1,7fffebd8,7fffebe0) from 47c ?() from 37 ---STACK FRAME created by call to trap 7ffffefc: 0 number of args (really 5) 7fffff00: 2fff0000 mask/psw 7fffff04: 7fffff8c ap 7fffff08: 7fffff74 fp 7fffff0c: 80001035 pc 7fffff10: 8 r0 7fffff14: 7fffff6c r1 7fffff18: 0 r2 7fffff1c: 7fffed8c r3 7fffff20: 0 r4 7fffff24: 0 r5 7fffff28: 0 r6 7fffff2c: 0 r7 7fffff30: 8003f9cc r8 7fffff34: 8 r9 7fffff38: 7fffe6e8 r10 7fffff3c: 7fffed84 r11 7fffff40: 0 ?? 7fffff44: 7fffe6d0 trap arg0 sp unused i think 7fffff48: 9 arg1 trap# 7fffff4c: 80001400 arg2 code 7fffff50: 80001400 arg3 pc 7fffff54: c00000 arg4 psl previous=USER???; current=kernal; is=0 7fffff58: 8000b3b3 pc in gettimeofday inst after jbs _Copyout 7fffff5c: 7fffff6c 7fffff60: 7fffe6f0 7fffff64: 8 7fffff68: 0 7fffff6c: 1d566af3 7fffff70: 1fbd0 ---STACK FRAME created by call to gettimeofday 7fffff74: 0 7fffff78: 28000000 7fffff7c: 7fffffe8 7fffff80: 7fffffa4 7fffff84: 800227fc 7fffff88: 80000000 7fffff8c: 0 7fffff90: 4 7fffff94: 0 7fffff98: 3 7fffff9c: 9c400 7fffffa0: 216e Any hints would be greatly appreciated.
mangler@cit-vax (System Mangler) (08/08/85)
You have run into an old microcode problem of the 780, that under obscure (to me) circumstances can cause a protection fault trap when a probe instruction lying within 8 bytes of the end of the page gets executed. I think (guess) that it's some wierd interaction involving the 8-byte prefetch buffer and a page fault on the probed page. Look again at the panic message, and notice that the pc is evenly divisible by 512. When it happened on our 780 (right after adding -DVAX750 to our kernel), the pc pointed to the instruction movl 8(sp),r3 which is certainly innocent enough. It just happened to be the first instruction on the new page. I don't know if it's important that the instruction landed right on a page boundary. I never did find a real solution to it - the local DEC office had never heard of the problem - so after a couple of weeks of daily panics, I kludged it by inserting a dozen pad bytes in front of _Copyout. (Easiest way is ".space 12"). You said this happened on a 785. Gosh, DEC wasn't kidding when they said the 785 was "bug-for-bug" compatible... Don Speck speck@cit-vax.arpa ihnp4!cithep!cit-vax!speck