[net.unix-wizards] 4.2BSD panic trap 9 problem on VAX 11/785

cbush@RAND-UNIX.ARPA (08/07/85)

Need help with persistent, at least one a day, system crashes on our
new VAX 11/785 running 4.2BSD UNIX.  Always get the same panic messages;

	...
	Aug  6 12:10
	trap type 9, code = 80001400, pc = 80001400
	panic: Protection fault
	syncing disks... 3 3 2 2 2 2 1 done
	...

My reading of the above and from looking at the kernal stack frames says,
in summary, the system was attempting to execute the instruction at location
80001400 while in user mode!!

How can that be?  I must be missing something.  The kernal stack says were
really in kernal code "copyout" having gotten there from various places
not always the same (Not a wild branch).  In the last crash "gettimeofday"
was calling copyout.  If the system screwed up and made an invalid context
switch to user mode why crash at 80001400 why not the instruction before?
or first kernal instruction after the switch?  Is the address validation
only done at the page bounderys?

I don't know if its hardware or software.  We have seen these same trap 9's
at same address on two of our four VAX 11/780's but very infrequently, less
than one a month on heavly loaded systems.

At this point were stymied. Now looking at kernal code differances on the
various machines and changing hardware configuration in attempt to narrow it
down.

A PORTION OF THE KERNAL STACK, with my incomplete comments (First vax dump
I've looked at)

----adb command *(scb-4)$c gives ---- (adb crash man pages seems screwed up?)
sbr 8002bc64 slr 4b68   --- system pte's look good.
p0br 808aae00 p0lr 74 p1br 800ab200 p1lr 1fffea
_boot()	from 80021f3a
_boot(0,0) from	80021f3a
_panic(80043083) from 8000cf76
_trap()	from 800224ec
_Xtransflt() from 80001035
_syscall() from	800227fc
_Xsyscall(7fffe6f0,0) from 80001054
?(7fffe734) from 214d
?(1,7fffebd8,7fffebe0) from 47c
?() from 37

---STACK FRAME created by call to trap
7ffffefc:       0         number of args (really 5)
7fffff00:       2fff0000  mask/psw
7fffff04:       7fffff8c  ap
7fffff08:       7fffff74  fp
7fffff0c:       80001035  pc
7fffff10:       8         r0
7fffff14:       7fffff6c  r1
7fffff18:       0         r2
7fffff1c:       7fffed8c  r3
7fffff20:       0         r4
7fffff24:       0         r5
7fffff28:       0         r6
7fffff2c:       0         r7
7fffff30:       8003f9cc  r8
7fffff34:       8         r9
7fffff38:       7fffe6e8  r10
7fffff3c:       7fffed84  r11
7fffff40:       0         ??
7fffff44:       7fffe6d0 trap arg0  sp unused i think
7fffff48:       9             arg1  trap#
7fffff4c:       80001400      arg2  code
7fffff50:       80001400      arg3  pc
7fffff54:       c00000        arg4  psl previous=USER???; current=kernal; is=0
7fffff58:       8000b3b3            pc in gettimeofday inst after jbs _Copyout
7fffff5c:	7fffff6c
7fffff60:	7fffe6f0
7fffff64:	8
7fffff68:	0
7fffff6c:	1d566af3
7fffff70:	1fbd0
---STACK FRAME created by call to gettimeofday
7fffff74:	0
7fffff78:	28000000
7fffff7c:	7fffffe8
7fffff80:	7fffffa4
7fffff84:	800227fc
7fffff88:	80000000
7fffff8c:	0
7fffff90:	4
7fffff94:	0
7fffff98:	3
7fffff9c:	9c400
7fffffa0:	216e

Any hints would be greatly appreciated.

mangler@cit-vax (System Mangler) (08/08/85)

    You have run into an old microcode problem of the 780, that
under obscure (to me) circumstances can cause a protection fault
trap when a probe instruction lying within 8 bytes of the end of
the page gets executed.  I think (guess) that it's some wierd
interaction involving the 8-byte prefetch buffer and a page
fault on the probed page.

    Look again at the panic message, and notice that the pc is
evenly divisible by 512.  When it happened on our 780 (right
after adding -DVAX750 to our kernel), the pc pointed to the
instruction
	movl	8(sp),r3
which is certainly innocent enough.  It just happened to be
the first instruction on the new page.	I don't know if it's
important that the instruction landed right on a page boundary.

    I never did find a real solution to it - the local DEC
office had never heard of the problem - so after a couple of
weeks of daily panics, I kludged it by inserting a dozen pad
bytes in front of _Copyout.  (Easiest way is ".space 12").

    You said this happened on a 785.  Gosh, DEC wasn't kidding
when they said the 785 was "bug-for-bug" compatible...

	Don Speck   speck@cit-vax.arpa	ihnp4!cithep!cit-vax!speck