[comp.periphs] Tower 32/600, Trap Type 2

jsloan@wright.EDU (John Sloan) (08/24/87)

We're hoping that there are some other NCR Tower users out in netland
that may have seen this problem before. We're getting frequent system
crashes of one of our two Tower 32/600s. The error is Trap Type 2,
"Unexpect trap in kernel space". Trap Type 2 appears to be a
segmentation fault.

The system is configured with two 140MB winchesters, streaming tape,
5.25" flex disk, 4MB RAM, two HPSIO terminal multiplexors (8 terminal
ports each), and an Excelan ethernet board with on-board TCP/IP. It
runs UNIX System V, and our kernel configuration is shown below. Nope,
we don't have software support nor do we have source for these systems.

A sample error log is shown below, but although all of the crashes are
Trap Type 2 as shown, this is the only one that made it into the error
log. This has been happening about once every three hours on the
average for the past two weeks, while under medium to heavy load: 8 to
16 users, some logged in directly, some rlogged in, all running vi, cc,
etc. in a pretty typical student software development environment.

About half of the traps display an "fcode" of 165 and the rest display
"115". It seems likely that this "function code" is perhaps a
displacement into a jump table, and so might offer a clue into what
kernel routine is being called when the failure occurred.

We've perused the vendor-supplied manuals, the SVID, Bach's book, etc.
We've disassembled the kernel and parts of the C library and are trying
to determine the (undocumented as far as we can tell) low level kernel
service routine calling conventions, which may give us a clue into what
kernel routines (if we're on the right track at all) 165 and 115 are.

We thought perhaps the fcode was the first argument of the
(undocumented) "syscall" system call interface, but perusal of all the
available header files, as well as perusing our System V and Berkeley
source for the VAX and the Ultrix-32 header files. Okay, admittedly
these are all long shots, although at least syscall is documented under
Berkeley, and Ultrix has a SysV emulation that we've successfully used
to prototype the same programs we're running on these Towers.  Now
we're trying to use adb to figure out where the "trap" instruction
vector really points to and then looking at that code to see if we can
figure out what these function codes mean.

We suspect a problem with the 141MB system disk, because we were
getting Trap Type 2's from an identical Tower, but the crashes went
away after the system disk failed hard and had to be replaced.  We're
following the obvious strategy of replacing the disk and seeing what
happens, but these 141MB disks seem in short supply. Fortunately,
NCR seems quite willing to replace the disk, so as soon as one is
available we will do so. I've tried to take off enough panels to see
who actually manufactures the winchesters, but can't find any label
without actually uncabling and pulling the disk module.

The system seems to fail more frequently under load (8 to 16 users).
This could be due to heavy activity on the disk, or to users hitting a
real software problem in the system. Our students are doing stuff with
SysV semaphores and message queues, so a problem in the kernel (or with
our configuration of it) may also be likely. Since the error is a
segmentation fault, it may be that the kernel is simply not robust
enough to recover from bogus pointers etc. passed to the semaphore and
message queue routines. In a student environment this is a real hazard,
and we are going to move to another method of teaching the same material
on these systems.

The system performs gracefully even with 16 users, so if the problem
were just load, we would expect to see some degradation before it
failed. We're used to UNIX boxes slowing down under heavy load, maybe
even overflowing some kernel table, but not just crashing.

I know this is not a lot of information, but we're wondering if anyone
has had similar problems. Or, are there known problems with these
disks? We've been told that the 141MB models have been in the field a
short time, at least in the Towers.

Below is an error log entry from one of the traps.

--------

Date/Time of Kernel Incident: Thu Aug 13 06:12:12 1987
Log sequence no:    2    Process name: KERN    Pid: ffff
Log month/day hr:min:sec: Aug 13 05:57:45   Error type: SOFTWARE
Error code 1: panic unexpected trap in kernel space,    Error code 2: 0000

TRAP INTERRUPT -- Thu Aug 13 05:57:46 1987
       Trap type = 2
       Trap fcode = 165
       Trap eaddr = f50000
       Trap ireg = 3014
       Program counter = e11802

--------

Our kernel configuration is shown below. Increasing the map tables to
support semaphores and message queues was obviously necessary, because
we're using these features in the courses that run on these systems. The
other changes were mostly done via not so educated guesswork. Its quite
possible that in our ignorance we've done something really brain damaged
in this configuration.

--------

*****************************
*dev	vect	addr	prio*
*****************************
hpm	0	0	0
hp 	700	761200	2
hp	704	761300	0
nec	1534	172520	0
ex	4	0	7
xtty	0	0	0
xm	0	0	0
xso	0	0	0
wd	460	172220	1
tp	3	0	7
*at10
**********************
*name	dev	minor*
**********************
root	wd	04
*swap device minor swplo nswap (Note: This nswap is a placeholder only.)
swap	wd	74	1	20000
pipe	wd	04
dump	tp	0
************
*parm	val*
************
power	1
mesg	1
msgmap	1024
msgmni	100
msgtql	256
msgseg	4096
msgssz	128
sema	1
semmap	1024
semmni	256
semmns	128
semmnu	256
semmsl	32
semopm	16
semume	16
shmem	1
buffers	256
inodes	300
files	300
mounts	8
swapmap	100
calls	100
procs	200
texts	50
clists	660
klns	1
* monflg = 0 if no debug monitor in firmware
monflg	0
dmanpb	8
* hashbuf must be a power of 2
hashbuf	256
maxproc	35

---------

We're working on reliably reproducing the problem, but so far with
little success. Thanks in advance for any suggestions.

-- John

-- 
John Sloan  CSNET: jsloan@CS.Wright.EDU UUCP: ...!cbosgd!wright!jsloan
Computer Science Department, Wright State University, Dayton OH, 45435        
+1 513 873 2491   belong(opinions,jsloan). belong(opinions,_):-!,fail.
The only thing that depreciates faster than a computer is fresh fruit.