[net.unix-wizards] Solved 4.2BSD panic trap 9 problem on VAX 11/785

cbush@RAND-UNIX.ARPA (08/09/85)

 "WARNING probe instructions may be hazardous to the health of your system."

>>>THANKS:

This is a thanks to all on quickly solving are problem.  No new crashes YET!!
Got 6 replies which all hit it on the nose or very close to it.

Restating the problem and solution may fill in the story for some.  I'm
somewhat suprised that at the very least a warning comment in not included
in locore.s if the bug has been know so so long.


>>>PROBLEM:

Need help with persistent, at least one a day, system crashes on our
new VAX 11/785 running 4.2BSD UNIX.  Always get the same panic messages;

	trap type 9, code = 80001400, pc = 80001400
	panic: Protection fault

My reading of the above and from looking at the kernal stack frames says,
in summary, the system was attempting to execute the instruction at location
80001400 while in user mode!!

>>>SOLUTION: ( Courtesy of Chris Torek <chris@maryland> )

    Sounds like you've managed to invoke the 780/785 CPU bug with prefetch.
    The probe instruction works by changing the CPU microstate to user mode.
    If a prefetch crosses a page boundary you can get a trap.

    The "fix" is to insert enough noops to push the probe ... in locore.s ...
    down past the start of a new page.  If that works, let us all know . .

I'm still puzzeled by the difference in rate of occurence on the 785
(~once/day) as compared to 780 (~once a month) but am leaving it as an
exercise for now.

ggs@ulysses.UUCP (Griff Smith) (08/12/85)

> 
>     The "fix" is to insert enough noops to push the probe ... in locore.s ...
>     down past the start of a new page.  If that works, let us all know . .
> 
No!  The fix is to insert a ".space n" directive immediately before
the function in locore.s that is causing the problem.  Noops take
time!  The probe is in the middle of a commonly called system function,
don't give it lead shoes.
-- 

Griff Smith	AT&T Bell Laboratories, Murray Hill
Phone:		(201) 582-7736
Internet:	ggs@ulysses.uucp
UUCP:		ulysses!ggs  ( {allegra|ihnp4}!ulysses!ggs )

chris@umcp-cs.UUCP (Chris Torek) (08/13/85)

>No!  The fix is to insert a ".space n" directive immediately before
>the function in locore.s that is causing the problem.

Yes, this is better, but is also more difficult to apply.  It's
much harder to push down, e.g, the probew in vax/trap.c this way.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

ggs@ulysses.UUCP (Griff Smith) (08/15/85)

> >No!  The fix is to insert a ".space n" directive immediately before
> >the function in locore.s that is causing the problem.
> 
> Yes, this is better, but is also more difficult to apply.  It's
				   ^^^^^^^^^^^^^^^^^^^^^^^?
> much harder to push down, e.g, the probew in vax/trap.c this way.
>

Huh?  The code in question is

	.
	.
	clrl	12(r2)
	ret

_Copyin:	.globl	_Copyin		# <<<massaged for jsb by asm.sed>>>
	movl	12(sp),r0		# copy length
	blss	ersb
	movl	4(sp),r1		# copy user address
	cmpl	$NBPG,r0		# probing one page or less ?
	bgeq	cishort			# yes
ciloop:
	prober	$3,$NBPG,(r1)		# bytes accessible ?
	beql	ersb			# no
	addl2	$NBPG,r1		# incr user address ptr
	acbl	$NBPG+1,$-NBPG,r0,ciloop	# reduce count and loop
cishort:
	prober	$3,r0,(r1)		# bytes accessible ?
	beql	ersb			# no
	.
	.

Change it to the following:

	.
	.
	clrl	12(r2)
	ret

	.space 50			# kludge to move probers to next page

_Copyin:	.globl	_Copyin		# <<<massaged for jsb by asm.sed>>>
	movl	12(sp),r0		# copy length
	blss	ersb
	movl	4(sp),r1		# copy user address
	cmpl	$NBPG,r0		# probing one page or less ?
	bgeq	cishort			# yes
ciloop:
	prober	$3,$NBPG,(r1)		# bytes accessible ?
	beql	ersb			# no
	addl2	$NBPG,r1		# incr user address ptr
	acbl	$NBPG+1,$-NBPG,r0,ciloop	# reduce count and loop
cishort:
	prober	$3,r0,(r1)		# bytes accessible ?
	beql	ersb			# no
	.
	.

There is nothing difficult about this.  References to probes in C code
are another problem; you probably WILL need to pad with nop's.  So what?
Nothing says you have to use the same solution in both languages.  When
casting spells in assembly, use the tools that are available.
-- 

Griff Smith	AT&T Bell Laboratories, Murray Hill
Phone:		(201) 582-7736
Internet:	ggs@ulysses.uucp
UUCP:		ulysses!ggs  ( {allegra|ihnp4}!ulysses!ggs )

chris@umcp-cs.UUCP (Chris Torek) (08/15/85)

>>>No!  The fix is to insert a ".space n" directive immediately before
>>>the function in locore.s that is causing the problem.
 
>> Yes, this is better, but is also more difficult to apply.  It's
>				    ^^^^^^^^^^^^^^^^^^^^^^^?

>There is nothing difficult about this.  References to probes in C code
>are another problem; you probably WILL need to pad with nop's.  So what?
>Nothing says you have to use the same solution in both languages.

That is what I meant.  I considered different phrasing, but thought
that was the shortest that conveyed what I meant.  I guess it did,
but just barely.  So here's the long version:

	Yes, it's better in this case to use a .space directive to
	push the probe down, as that saves CPU time (as you pointed
	out).  However, if you encounter the same problem later
	with one of the probe instructions which is embedded within
	the C code, it will be much more difficult (though not
	impossible) to use .space or other magic to move the probe
	instruction.  Using nop's may be inefficient, but it is
	easy to implement in every case in which a solution to the
	probe bug must be applied, therefore I present it as the
	general solution.

How's that? :-)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland