jsloan@wright.EDU (John Sloan) (08/24/87)
We're hoping that there are some other NCR Tower users out in netland that may have seen this problem before. We're getting frequent system crashes of one of our two Tower 32/600s. The error is Trap Type 2, "Unexpect trap in kernel space". Trap Type 2 appears to be a segmentation fault. The system is configured with two 140MB winchesters, streaming tape, 5.25" flex disk, 4MB RAM, two HPSIO terminal multiplexors (8 terminal ports each), and an Excelan ethernet board with on-board TCP/IP. It runs UNIX System V, and our kernel configuration is shown below. Nope, we don't have software support nor do we have source for these systems. A sample error log is shown below, but although all of the crashes are Trap Type 2 as shown, this is the only one that made it into the error log. This has been happening about once every three hours on the average for the past two weeks, while under medium to heavy load: 8 to 16 users, some logged in directly, some rlogged in, all running vi, cc, etc. in a pretty typical student software development environment. About half of the traps display an "fcode" of 165 and the rest display "115". It seems likely that this "function code" is perhaps a displacement into a jump table, and so might offer a clue into what kernel routine is being called when the failure occurred. We've perused the vendor-supplied manuals, the SVID, Bach's book, etc. We've disassembled the kernel and parts of the C library and are trying to determine the (undocumented as far as we can tell) low level kernel service routine calling conventions, which may give us a clue into what kernel routines (if we're on the right track at all) 165 and 115 are. We thought perhaps the fcode was the first argument of the (undocumented) "syscall" system call interface, but perusal of all the available header files, as well as perusing our System V and Berkeley source for the VAX and the Ultrix-32 header files. Okay, admittedly these are all long shots, although at least syscall is documented under Berkeley, and Ultrix has a SysV emulation that we've successfully used to prototype the same programs we're running on these Towers. Now we're trying to use adb to figure out where the "trap" instruction vector really points to and then looking at that code to see if we can figure out what these function codes mean. We suspect a problem with the 141MB system disk, because we were getting Trap Type 2's from an identical Tower, but the crashes went away after the system disk failed hard and had to be replaced. We're following the obvious strategy of replacing the disk and seeing what happens, but these 141MB disks seem in short supply. Fortunately, NCR seems quite willing to replace the disk, so as soon as one is available we will do so. I've tried to take off enough panels to see who actually manufactures the winchesters, but can't find any label without actually uncabling and pulling the disk module. The system seems to fail more frequently under load (8 to 16 users). This could be due to heavy activity on the disk, or to users hitting a real software problem in the system. Our students are doing stuff with SysV semaphores and message queues, so a problem in the kernel (or with our configuration of it) may also be likely. Since the error is a segmentation fault, it may be that the kernel is simply not robust enough to recover from bogus pointers etc. passed to the semaphore and message queue routines. In a student environment this is a real hazard, and we are going to move to another method of teaching the same material on these systems. The system performs gracefully even with 16 users, so if the problem were just load, we would expect to see some degradation before it failed. We're used to UNIX boxes slowing down under heavy load, maybe even overflowing some kernel table, but not just crashing. I know this is not a lot of information, but we're wondering if anyone has had similar problems. Or, are there known problems with these disks? We've been told that the 141MB models have been in the field a short time, at least in the Towers. Below is an error log entry from one of the traps. -------- Date/Time of Kernel Incident: Thu Aug 13 06:12:12 1987 Log sequence no: 2 Process name: KERN Pid: ffff Log month/day hr:min:sec: Aug 13 05:57:45 Error type: SOFTWARE Error code 1: panic unexpected trap in kernel space, Error code 2: 0000 TRAP INTERRUPT -- Thu Aug 13 05:57:46 1987 Trap type = 2 Trap fcode = 165 Trap eaddr = f50000 Trap ireg = 3014 Program counter = e11802 -------- Our kernel configuration is shown below. Increasing the map tables to support semaphores and message queues was obviously necessary, because we're using these features in the courses that run on these systems. The other changes were mostly done via not so educated guesswork. Its quite possible that in our ignorance we've done something really brain damaged in this configuration. -------- ***************************** *dev vect addr prio* ***************************** hpm 0 0 0 hp 700 761200 2 hp 704 761300 0 nec 1534 172520 0 ex 4 0 7 xtty 0 0 0 xm 0 0 0 xso 0 0 0 wd 460 172220 1 tp 3 0 7 *at10 ********************** *name dev minor* ********************** root wd 04 *swap device minor swplo nswap (Note: This nswap is a placeholder only.) swap wd 74 1 20000 pipe wd 04 dump tp 0 ************ *parm val* ************ power 1 mesg 1 msgmap 1024 msgmni 100 msgtql 256 msgseg 4096 msgssz 128 sema 1 semmap 1024 semmni 256 semmns 128 semmnu 256 semmsl 32 semopm 16 semume 16 shmem 1 buffers 256 inodes 300 files 300 mounts 8 swapmap 100 calls 100 procs 200 texts 50 clists 660 klns 1 * monflg = 0 if no debug monitor in firmware monflg 0 dmanpb 8 * hashbuf must be a power of 2 hashbuf 256 maxproc 35 --------- We're working on reliably reproducing the problem, but so far with little success. Thanks in advance for any suggestions. -- John -- John Sloan CSNET: jsloan@CS.Wright.EDU UUCP: ...!cbosgd!wright!jsloan Computer Science Department, Wright State University, Dayton OH, 45435 +1 513 873 2491 belong(opinions,jsloan). belong(opinions,_):-!,fail. The only thing that depreciates faster than a computer is fresh fruit.