[net.unix-wizards] system crashes trap type 8

tb@qucis.UUCP (Tom Bradshaw) (08/16/83)

system crashes with an error message ...

trap type 8, code=8c20004, pc=800137c6

The pc always points to the dzxint routine. The parameter passed
to the dzxint routine, at the time of the crash, has the wrong value.
Therefore this parameter, which is a pointer into the tty structure
is pointing to invalid information causing the system to crash.
This can happen twice a day, generally during peak times.

questions are ...
1) In the trap message what does "code=8c20004" mean.
2) If anyone has experienced similar problems please reply any
   helpful information.

hardware ... VAX 11/780 with 5 dz11
		             1 dup11
			     2 dr11c
			     1 uda50
			     1 interlan ethernet board
			     1 smv15 controller for ampex drives
			     1 optronics scanner
	     running UNIX 4.1 BSD

v.wales%ucla-locus@sri-unix.UUCP (09/16/83)

From:            Rich Wales <v.wales@ucla-locus>

	Date: 16 Aug 83 16:02:56-PDT (Tue)
	To: Unix-Wizards@brl-vgr
	From: decvax!linus!utzoo!utcsrgv!qucis!tb@ucb-vax
	Subject: system crashes trap type 8

	system crashes with an error message ...

	trap type 8, code=8c20004, pc=800137c6

	The pc always points to the dzxint routine. The parameter
	passed to the dzxint routine, at the time of the crash, has the
	wrong value.  Therefore this parameter, which is a pointer into
	the tty structure is pointing to invalid information causing
	the system to crash.  This can happen twice a day, generally
	during peak times.

	questions are ...
	1) In the trap message what does "code=8c20004" mean.
	2) If anyone has experienced similar problems please reply any
	   helpful information.

Sorry this reply is so late -- but (hopefully) late is better than
never . . . .

I once encountered something very similar to your problem.  I was test-
ing out some mods to the DZ pseudo-DMA routine in "sys/locore.s" in our
4.1BSD kernel.  I made a mistake in my usage of registers (I think I
forgot to save and restore some register); as a consequence, the system
promptly screamed "trap type 8" as soon as it tried to access a DZ --
followed, I believe, with about half a page of "trap type 2" until it
finally dropped dead.  I fixed the problem by being very sure I saved
and restored -- via the "pushr" and "popr" instructions -- any regis-
ters I played with (something which you tend to forget about if you do
all your programming in C).

In the on-line manual ("man 8 crash"), there is a list of trap type
codes.  Trap types 8 and 9, for example, indicate attempts by the
kernel to access invalid addresses.  The "code" value is the address
the CPU was trying to access.  For more details, study the kernel code
that produced the error message (in "sys/trap.c"); you will see that,
with only a couple of exceptions, any trap in kernel mode will result
in a panic like the one you saw.

What you should try to do is look at the machine language at the point
of the crash, using the "i" (disassemble) command in "adb /vmunix".
Then figure out which register or memory location could be at fault (I
would guess that you probably smashed a register by accident).  From
there, I'm afraid you're on your own.

Recall also, by the way, that "dzxint" is NOT the "real" DZ transmitter
interrupt routine in 4.1BSD -- "dzdma" in "sys/locore.s" is.  "dzdma"
calls "dzxint" (indirectly via the "p_fcn" value in a "struct pdma";
see "h/pdma.h") when it needs more bytes to stuff down a DZ line.

If you haven't been playing around with your "sys/locore.s", I must
confess perplexity and confusion as to what is causing your problem.

-- Rich <wales@UCLA-LOCUS>

wls@astrovax.UUCP (09/24/83)

  We have seen this problem ourselves with a panic trap type 8 with an address
in the interrupt handler for dz3, followed by a series of trap type 2's.
This is very frustrating as there is no core dump, the system is just too sick.
This was solved (rather it was shoved under the rug) by removing all terminals
from dz3, which was possible as we had just acquired a DH11 emulator from
Emulex and had surplus ports.  We have not touched the DZ driver.
  However, the fun continues.  Periodically we get a trap type 8 followed by
a string of trap type 2's, again with no core dump.  This time the trap type
8 is in ubasetup() and the trap type 2's are coming from the RTE instruction
of the console output interrupt routine (we have a Vax 750). Clearly the
stack is messed up.  This usually occurs when running some device that does
unbuffered ("raw") dma. Also we periodically have a "dup iodone" panic again
followed by a string of trap type 2's coming from the console output interrupt
routine, also provoked by raw dma (usually from the Versatec).  These are
frustrating bugs as they occur too infrequently to provoke a major effort
(though the "dup iodone" bug when it does occur, often then happens several
times that day), and produce no core dump, making them hard to track down.  The
only thing I can think of doing is to insert some code to catch these panics in
their tracks, not to let them print out (because the system does not seem
to be able to print then) but to hang in a tight loop to let me examine
things from the console.
  Are there known bugs in the panic routine for which there have been
fixes reported? Our copy of 4.1 BSD dates from June 1982 and I have only
been reading Usenet since April this year.
  William L. Sebok {allegra,cbosgd,decvax,ihnp4,kpno}!astrovax!wls

swatt@ittvax.UUCP (Alan S. Watt) (09/26/83)

The first thing to check when you get "trap type 8" coming from some
device interrupt routine in a long-stable kernel is the boot-time
configuration information printed out.  Look for some device which
has dropped a bit in the interrupt vector setting and is now assigned
to the same interrupt vector as some other device.

As the meaning of minor device numbers passed to each interrupt routine
varies from device to device, you can get havoc if you call the wrong
interrupt routine.  I have seen this happen with Interlan Ethernet
boards, where the vector assignment switch is prone to being brushed by
cables on insertion and removal.

Unfortunately, the 4.1bsd autoconfigure routines will not detect two
devices which both claim to interrupt on the same (or overlapping)
vectors.  The last device to claim a vector will get it.

Also the interrupt routines for most Unibus drivers I have seen are
much too trusting of the unit argument they get.  For example, the code
in dz.c (4.1bsd) now reads:

	/*ARGSUSED*/
	dzrint(dz)
		int dz;
	{
		register struct tty *tp;
		register int c;
		register struct device *dzaddr;
		register struct tty *tp0;
		register int unit;
		int overrun = 0;
	 
		if ((dzact & (1<<dz)) == 0)
			return;
		unit = dz * 8;
		dzaddr = dzpdma[unit].p_addr;
		tp0 = &dz_tty[unit];
		while ((c = dzaddr->dzrbuf) < 0) {	/* char present */

			<...>

Clearly, if the 'dz' unit argument passed is out of range for legal DZ
unit numbers, there are numerous opportunities to tromp on random
memory.  If you're lucky, you will get a segmentation violation and a
crash.  If you're not lucky, you will twiddle bits in some essentially
random location.  A check similer to that in "dzopen" would not be at
all out of place.  The problem of the DZ driver is more serious
however, because transmitter interrupts first gets vectored to the
machine language pseudo-dma routine, and range checks will have to be
performed there as well.

	- Alan S. Watt
	{decvax,purdue,lbl-csam,research,allegra}!ittvax!swatt