[comp.unix.xenix] SUMMARY: General Protection Trap

root@blender.UUCP (Herb Peyerl) (09/10/89)
Several weeks ago I asked the net about a problem I had been having
regarding a "General Protection Trap" kernel panic.  I had intended
to follow up on majority advice next time the problem cropped up. 
However, the problem seems to have disappeared.  It hasn't happened
for over two weeks now and the system's been busier than ever.  It's
a strange one but here's a summary of the help I've received.  
	A big thanx to Chip Rosenthal, Jim Morton, and Ragnar Paulsen
for offering help.  It is very much appreciated.

**************************************

From:	chip@vector.dallas.tx.us (Chip Rosenthal)

The key thing is the IP value, so you can find out where the thing was
when it crashed (via adb).  Hopefully, it is within a device driver, so
you can then easily go off and do some finger pointing.  Unfortunately,
if it's outside a device driver it's pretty tough to run this one down...

If you get any good suggestions, you might want to summarize to the net.

**************************************

From: ubc-cs!uunet!applix!jim (Jim Morton)

Sounds like a memory problem (Is this machine a Wyse?). Try turning the
machine's speed down or switching to 1 wait state if possible. Note: the
diagnostics from most vendors will NOT catch these problems. Wyse machines,
and other vendors who use banks of 8 memory chips instead of 9 have
"non-parity memory" which means when you get a failure, the machine crashes.

***************************************

From:	Ragnar Paulson <wilma!ragnar@uunet.UU.NET>

Herb,
    In your posting you ask:
> 
>   [Inclusion of my posting removed for brevity]
> 
Yep.  In particular you should record the cs:ip values.

A general protection trap is caused by a kernel write to an invalid
location.  Often this location is 0:0.  It is my contention
that user software, no matter how badly written should not cause
a kernel panic.  So this problem is a bug in some kernel routine.

Almost always the bug is in one of your add on peripherals, such as
the AT-Vantage.  A panic can also be caused by a hardware failure in
such a peripheral.  Then the hardware may cause the controlling software
to get an invalid pointer.  I suppose really robust software could even
handle this, but that often proves to be very difficult.  Seeing as
you have not recently added new software, it may be that you have a 
hardware problem.

If you have a development system, you can use adb to find out where the
Panic is occuring.  For example, if the cs:ip value is 18:3567
then use adb as follows:

	adb /xenix
	* $x
	* 18:3567?ia
		siointr+45	mov	[bx], 0

	* $q

Adb prompts with "*".  After typing in the adress?ia, adb will
print out the routine address and instruction that the address corresponds
to.  In this example, it is the interrupt service routine of the "sio" driver.
Of course any kernel I have is different from the one you have.

IF you don't have the developement system, or the above exercise yields
something useless like copyseg(), then you will just have to remove
peripherals one at a time until the problem goes away.  Once you have
identified the source of the problem, then call the manufacturer and
get them to fix it.

-- 
UUCP: herb@blender.UUCP   ||  ...calgary!xenlink!blender!{herb||root}
ICBM: 51 03 N / 114 05 W
"The other day, I...... No wait... That wasn't me!" <Steven Wright>