[comp.unix.xenix] What do panic messages mean?

root@conexch.UUCP (Larry Dighera) (05/04/88)

Does anyone have any idea what is causing these panics?   
Is the format of the panic message documented anywhere so that one
can interpret them?  I realize that the third through fifth lines are
register dumps of the 80286, and the last line apparently refers to
an out of bounds address (?).  But, how does one glean any meaning 
from this?  Is the cause hardware or software related?  Should I be 
looking at the accounting file to see what was running at the time of
the panic?

Fri Apr 8 13:41:40
TRAP 000D in SYSTEM
ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936
bp=03A4, fl=0206, uds=0018, es=0020
pc=0038:9AA1,  ksp=0388
panic: general protection trap

Thu Apr 28 17:32:11
TRAP 000D in SYSTEM
ax=1B00, bx=0098, cx=0000, dx=0000, si=0026, di=6726
bp=03A4, fl=0206, uds=0018, es=000C
pc=0038:9AA1,  ksp=0388
panic: general protection trap

Mon May 2 9:56:00
TRAP 000D in SYSTEM
ax=0005, bx=BD7A, cx=0000, dx=00A1, si=0002, di=0594
bp=039A, fl=0202, uds=0018, es=004F
pc=0030:A4C5,  ksp=0376
panic: general protection trap

System vitals:
        SCO Xenix 2.2.0
        Atronics AT compatible running at 10 MHz one wait state
        1 MB RAM on system board
        3 MB RAM on American Micronics "Elephant" board
        American Micronics 8 port "LAMB" card
        Archive Tape system
        Monochrome/graphics card
        WD HD/Floppy controller
        Maxtor XT1085 & XT1140 Hard drives
        Teac 1.2 MB and 360 K floppies
        200 W. power supply

Please respond via E-Mail to: cnexch!root
I will post a summary of useful responses.

Thanks in advance.
-- 
USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA  92712
TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71)
UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp
UUCP: ...!ucbvax!ucivax!icnvax!conexch!root || ...!trwrb!ucla-an!conexch!root

root@conexch.UUCP (Larry Dighera) (05/04/88)

Does anyone have any idea what is causing these panics?   
Is the format of the panic message documented anywhere so that one
can interpret them?  I realize that the third through fifth lines are
register dumps of the 80286, and the last line apparently refers to
an out of bounds address (?).  But, how does one glean any meaning 
from this?  Is the cause hardware or software related?  Should I be 
looking at the accounting file to see what was running at the time of
the panic?

Fri Apr 8 13:41:40
TRAP 000D in SYSTEM
ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936
bp=03A4, fl=0206, uds=0018, es=0020
pc=0038:9AA1,  ksp=0388
panic: general protection trap

Thu Apr 28 17:32:11
TRAP 000D in SYSTEM
ax=1B00, bx=0098, cx=0000, dx=0000, si=0026, di=6726
bp=03A4, fl=0206, uds=0018, es=000C
pc=0038:9AA1,  ksp=0388
panic: general protection trap

Mon May 2 9:56:00
TRAP 000D in SYSTEM
ax=0005, bx=BD7A, cx=0000, dx=00A1, si=0002, di=0594
bp=039A, fl=0202, uds=0018, es=004F
pc=0030:A4C5,  ksp=0376
panic: general protection trap

System vitals:
        SCO Xenix 2.2.0
        Atronics AT compatible running at 10 MHz one wait state
        1 MB RAM on system board
        3 MB RAM on American Micronics "Elephant" board
        American Micronics 8 port "LAMB" card
        Archive Tape system
        Monochrome/graphics card
        WD HD/Floppy controller
        Maxtor XT1085 & XT1140 Hard drives
        Teac 1.2 MB and 360 K floppies
        200 W. power supply

Please respond via E-Mail to: cnexch!root
I will post a summary of useful responses.

Thanks in advance.
Larry Dighera
-- 
USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA  92712
TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71)
UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp
UUCP: ...!ucbvax!ucivax!icnvax!conexch!root || ...!trwrb!ucla-an!conexch!root

dave@micropen (David F. Carlson) (05/06/88)

In article <308@conexch.UUCP>, root@conexch.UUCP (Larry Dighera) writes:
> 
> Does anyone have any idea what is causing these panics?   
> Is the format of the panic message documented anywhere so that one
> can interpret them?  I realize that the third through fifth lines are
> register dumps of the 80286, and the last line apparently refers to
> an out of bounds address (?).  But, how does one glean any meaning 
> from this?  Is the cause hardware or software related?  Should I be 
> looking at the accounting file to see what was running at the time of
> the panic?
> 
> Fri Apr 8 13:41:40
> TRAP 000D in SYSTEM
> ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936
> bp=03A4, fl=0206, uds=0018, es=0020
> pc=0038:9AA1,  ksp=0388
> panic: general protection trap
> Thanks in advance.
> Larry Dighera

Welcome to the Intel 80286!  We Microport users know *exactly* what that
error message means.  In the Intel '286 reference manual their are several
types of faults: TSS, protection, etc.  It seems that the stack region for
any process is 64K (ie., one segment).  But so is the kernel stack!  When
the kernel stack pointer rolls over its segment it causes the above panic.
There is really no efficient means to correct this as kernel stack is used
very frequently and to have multiple stack segments would require nasty
segment loads, etc.  In fact, even the Microsoft huge model compilers don't
allow multiple stack segments.  However, the UNIX (Xenix) kernel is large
and each process that runs will occupy area on the kernel stack.  In addition,
interrupt handlers also use the kernel stack and can cause very large (albeit
short term) stack usage.  Bottom line is that there are many circumstances
of a kernel stack requiring more that 64K to live and no practical way for
the '286 architecture to provide it.  Now just wait until the naive DOS
losers get suckered into OS/2 with exactly the same '286 limitations and 
no 386 version in sight!  (This architecture issue is exactly why I would
never recommend less than a 32 bit address space for UNIX:  buy a '386.

-- 
David F. Carlson, Micropen, Inc.
...!{ames|harvard|rutgers|topaz|...}!rochester!ur-valhalla!micropen!dave

"The faster I go, the behinder I get." --Lewis Carroll

jim@applix.UUCP (Jim Morton) (05/10/88)

In article <308@conexch.UUCP>, root@conexch.UUCP (Larry Dighera) writes:
> 
> Is the format of the panic message documented anywhere so that one
> can interpret them?  I realize that the third through fifth lines are
> 
> TRAP 000D in SYSTEM
> ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936
> bp=03A4, fl=0206, uds=0018, es=0020
> pc=0038:9AA1,  ksp=0388
> panic: general protection trap
> 

There have been a number of postings on this subject, so here's a little
help. First, do a "nm /xenix | sort >/tmp/foo". (If you don't have the
development system, this won't work - you need nm(CP)). Then, take the
PC address given in your panic message (In the above case, 38:9AA1) and
find the routine in the kernel (by looking at /tmp/foo) that is located
at the next lower address. This is the routine that crashed the kernel.
If you're lucky, the name of this routine will give you a clue as to why
the system crashed - if the routine was, for example, "_ttioctl" it would
point towards a problem doing an ioctl() call on a serial port line.

You have to use the first "pc=" value given, if there is a second one
printed it probably points to the trap routine itself. If you REALLY
want to hack further, you can find the /usr/sys/* module the routine
is located in and adb(CP) around in it to see what's going on. That way
the register values (ax= bx=) may show you why the routine crashed.

--
Jim Morton, APPLiX Inc., Westboro, MA
UUCP: ...harvard!m2c!applix!jim
      jim@applix.m2c.org

root@conexch.UUCP (Larry Dighera) (05/17/88)

In article <694@applix.UUCP> jim@applix.UUCP (Jim Morton) writes:
>In article <308@conexch.UUCP>, root@conexch.UUCP (Larry Dighera) writes:
>> 
>> Is the format of the panic message documented anywhere so that one
>> can interpret them?  I realize that the third through fifth lines are
>> register dumps.
>> 
>> TRAP 000D in SYSTEM
>> ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936
>> bp=03A4, fl=0206, uds=0018, es=0020
>> pc=0038:9AA1,  ksp=0388
>> panic: general protection trap
>>
[two panic message deleted]
>
>First, do a "nm /xenix | sort >/tmp/foo". [...]  Then, take the
>PC address given in your panic message (In the above case, 38:9AA1) and
>find the routine in the kernel (by looking at /tmp/foo) that is located
>at the next lower address. This is the routine that crashed the kernel.
>If you're lucky, the name of this routine will give you a clue as to why
>the system crashed - if the routine was, for example, "_ttioctl" it would
>point towards a problem doing an ioctl() call on a serial port line.
>
>You have to use the first "pc=" value given, if there is a second one
>printed it probably points to the trap routine itself. If you REALLY
>want to hack further, you can find the /usr/sys/* module the routine
>is located in and adb(CP) around in it to see what's going on. That way
>the register values (ax= bx=) may show you why the routine crashed.
>

Jim: 

Thank you for sharing your technique for determining what routine the 
kernel was executing at the time of the panic.  Your posting is most 
helpful.  Here is the relevant output from: 
		                        nm -vp /xenix | grep '0038:9' | sort
(nm complained about too many symbols to sort without the -p)

0038:98a2  T _nbmap
0038:9914  T _notincore
0038:9947  T _exrd
0038:9a59  S FIO_TEXT           <-- Segment Name
0038:9a59  T _getf
0038:9a91  T _closef            << --- The routine that contains 38:9aa1
0038:9c62  T _isinfile          <<< --- The routine that crashed the kernel?
0038:9cc2  T _openi
0038:9d4c  T _access
0038:9e2b  T _owner
0038:9ed7  T _suser
0038:9ef5  T _ufalloc
0038:9f2c  T _falloc
0038:9faf  S SYS4_TEXT          <-- Segment Name
0038:9faf  T _gtime
0038:9fc3  T _ftime

Is my notation, to the right of the nm output, a correct interpretation of
the method you have described of locating the routine that crashed the
kernel?  

I'd examine the routine with adb, but adb's syntax is a little too obscure
for me at this point.  If my interpretation of your intent is correct,
would you agree that the kernel was doing file i/o at the time of this
panic?  (presuming so, I'll continue)

So, now that I have a method of locating the kernel routine that was
running at the time of the panic let's consider the causes of kernel
panics.

Quoting David F. Carlson's follow up message on this subject:
<Welcome to the Intel 80286!  We Microport users know *exactly* what that
<error message means.  In the Intel '286 reference manual their are several
<types of faults: TSS, protection, etc.  It seems that the stack region for
<any process is 64K (ie., one segment).  But so is the kernel stack!  When
<the kernel stack pointer rolls over its segment it causes the above panic.

Is the only possible cause ?

<There is really no efficient means to correct this as kernel stack is used
<very frequently and to have multiple stack segments would require nasty
<segment loads, etc.  In fact, even the Microsoft huge model compilers don't
<allow multiple stack segments.  However, the UNIX (Xenix) kernel is large
<and each process that runs will occupy area on the kernel stack.  In addition,
<interrupt handlers also use the kernel stack and can cause very large (albeit
<short term) stack usage.  Bottom line is that there are many circumstances
<of a kernel stack requiring more than 64K to live and no practical way for
<the '286 architecture to provide it.  Now just wait until the naive DOS
<losers get suckered into OS/2 with exactly the same '286 limitations and 
<no 386 version in sight!  (This architecture issue is exactly why I would
<never recommend less than a 32 bit address space for UNIX:  buy a '386.

If this is true, the '286 kernel has an inherent limitation that can 
be a source of panics.  Failing hardware (RAM, CPU, ...), as well as
disruptive events (stray alpha particle, power glitch, ...) come to mind
as plausible causes too.  So, although I can tell what routine the kernel
was executing at the time of the panic, the complete origin is still 
somewhat obscure.

Fortunately, the panic messages usually end up in /lost+found after re-booting.
This is the result of fsck putting the /use/adm/messages file there (named by
it's inode number, of course).  Keeping a log of the PC value of each
panic, should shed some more light on their cause.  If the PC
value is the same in all panics, I would suspect only one cause.  If the
PC value was inconsistent, I would suspect either multiple causes or 
random events.

In my case, of three panics documented, in two cases the PC value was
as above, and the last was different (the kernel was apparently
in the _siowrite routine of the AMI LAMB 8 port driver).  From this 
limited data I would infer that indeed there are multiple causes for
the panics I am experiencing.

Am I on the right track?  

Larry Dighera

-- 
USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA  92712
TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71)
UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp
UUCP: ...!uunet!turnkey!conexch!root || ...!trwrb!ucla-an!conexch!root

Eric_D_Davis@cup.portal.com (06/01/88)

Yes....But have you ever encountered a "Triple panic trap"??
I have (toot toot..) The lady at SCO asked me where I was touching the
computer..in a jovial sort of way...

Eric Davis
Mytec Inc.

jfh@rpp386.UUCP (John F. Haugh II) (06/04/88)

In article <6132@cup.portal.com> Eric_D_Davis@cup.portal.com writes:
>Yes....But have you ever encountered a "Triple panic trap"??
>I have (toot toot..) The lady at SCO asked me where I was touching the
>computer..in a jovial sort of way...
>--
>Eric Davis

i went off and did a strings on /xenix.  i didn't find anything labeled
`[Tt]riple' anything, and here are the only `trap' things:

Reschedule trap
DNA trap in kernel mode

now, what i want to know is what is a DNA trap?  does this mean the
machine has been genetically mutated?

- john.

chapman@sco.COM (brian chapman) (06/15/88)

In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes:
< In article <6132@cup.portal.com> Eric_D_Davis@cup.portal.com writes:
< >
< >The lady at SCO asked me where I was touching the computer
< 
< i went off and did a strings on /xenix.  i didn't find anything labeled
< `[Tt]riple' anything, and here are the only `trap' things:
<
< Reschedule trap
< DNA trap in kernel mode
< 
< now, what i want to know is what is a DNA trap?  does this mean the
< machine has been genetically mutated?

On the chance that this is a serious question I will attempt
to answer it.
DNA is Device Not Availible (floating point device, that is)
Meaning the kernel is executing *87 instructions.  Or trying to.
-- 
Brian Chapman	 uunet!sco!chapman

haugj@pigs.UUCP (The Beach Bum) (06/18/88)

In article <446@sysco>, chapman@sco.COM (brian chapman) writes:
] In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes:
] [ DNA trap in kernel mode
] [ 
] [ now, what i want to know is what is a DNA trap?  does this mean the
] [ machine has been genetically mutated?
] 
] On the chance that this is a serious question I will attempt
] to answer it.
] DNA is Device Not Availible (floating point device, that is)
] Meaning the kernel is executing *87 instructions.  Or trying to.
] -- 
] Brian Chapman	 uunet!sco!chapman

OK - i thought it was a typo version of DMA trap.  now that i know
its True Meaning(tm), how come i don't see it when i am executing
80387 instructions?  i mean, i use floating point code, but never
see the message.  does this only apply if the kernel thought there
was an 80387 and then later found out it had disappeared?  you've
made me curious now!

- john.
-- 
 The Beach Bum                                 Big "D" Home for Wayward Hackers
 UUCP: ...!killer!rpp386!jfh                          jfh@rpp386.uucp :SMAILERS

 "You are in a twisty little maze of UUCP connections, all alike" -- fortune

)) (06/24/88)

In article <170@pigs.UUCP> haugj@pigs.UUCP (The Beach Bum) writes:
|In article <446@sysco>, chapman@sco.COM (brian chapman) writes:
|] In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes:
|] [ DNA trap in kernel mode
|] [ 
|] [ now, what i want to know is what is a DNA trap?  does this mean the
|] [ machine has been genetically mutated?
|] 
|] On the chance that this is a serious question I will attempt
|] to answer it.
|] DNA is Device Not Availible (floating point device, that is)
|] Meaning the kernel is executing *87 instructions.  Or trying to.
|] -- 
|] Brian Chapman	 uunet!sco!chapman
|
|OK - i thought it was a typo version of DMA trap.  now that i know
|its True Meaning(tm), how come i don't see it when i am executing
|80387 instructions?  i mean, i use floating point code, but never
|see the message.  does this only apply if the kernel thought there
|was an 80387 and then later found out it had disappeared?  you've
|made me curious now!

My suspicion is that "DNA trap in kernel mode" would only occur
if the kernel itself tried to issue a FPU (or some other co-processor)
instruction when no FPU (or other co-processor) was present.
Though it is possible that this would occur if the ROM BIOS was
so stupid as to tell the kernel during boot that the FPU was present,
but it wasn't.

Usually, UNIX kernels are designed to not have FPU instructions of their
own.  During system boot, the 386 kernel (this applies to most kernels
all the way back to PDP-11 V7 or V6) determines whether the FP hardware
is present.  If the FP hardware is not present, then the kernel makes
arrangements so that if a user tries to execute a FPU instruction, the
kernel will catch the exception, emulate the instruction in *software*,
and then continue the user's code.  So, the user is never really aware 
whether there's a true FPU present except that software emulation is slower.
Therefore, the compiler *always* emits true FPU instructions.

[Just consider all the grief that wouldn't happen if DOS did the emulation
of non-existant instructions in software too!]

However, if the kernel does detect that an FP instruction was issued
from itself, obviously something is wrong (the kernel should never
depend on optional co-processors) and it panics.  Sort of like having
a page fault while in kernel mode (unless your kernel can page itself
that is).

ISC 386/ix used to (and I suspect Microport and Bell Tech still do)
crash if the ROM BIOS was so stupid to tell the kernel during boot
that the FPU was present but it really wasn't (which was a bug in
several earlier 386 ROM BIOSes).

Not all systems work this way (eg: Spectrix programs are linked with
different libraries depending on whether a real hardware FPU is present),
but 386 UNIXes do.
-- 
Chris Lewis, Spectrix Microsystems Inc, Phone: (416)-474-1955
UUCP: {uunet!mnetor, utcsri!utzoo, lsuc, yunexus}!spectrix!clewis
Moderator of the Ferret Mailing List (ferret-list,ferret-request@spectrix)

allbery@ncoast.UUCP (Brandon S. Allbery) (06/26/88)

As quoted from <170@pigs.UUCP> by haugj@pigs.UUCP (The Beach Bum):
+---------------
| In article <446@sysco>, chapman@sco.COM (brian chapman) writes:
| ] In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes:
| ] [ now, what i want to know is what is a DNA trap?  does this mean the
| ] [ machine has been genetically mutated?
| ] 
| ] DNA is Device Not Availible (floating point device, that is)
| ] Meaning the kernel is executing *87 instructions.  Or trying to.
| 
| its True Meaning(tm), how come i don't see it when i am executing
| 80387 instructions?  i mean, i use floating point code, but never
+---------------

Here's how it works on the 386 boxes I've used:

The user-mode trap table is set up so that if an 80387 instruction is seen
and the 80387 is not installed, the instruction traps to an emulator.
However, there is no such arrangement in the kernel -- so if the kernel
tries to run an 80387 instruction and there's no 80387 chip the system panics.
This presumably is because the emulator can't easily be made to work in
kernel mode; also, the kernel shouldn't need to do floating point unless
some kind of special device driver is used -- in which case you probably
want the 80387 anyway, for speed.  (The system'd get pretty slow if the
kernel were spending large amounts of its time in the emulator called from
kernel mode.)
-- 
Brandon S. Allbery			  | "Given its constituency, the only
uunet!marque,sun!mandrill}!ncoast!allbery | thing I expect to be "open" about
Delphi: ALLBERY	       MCI Mail: BALLBERY | [the Open Software Foundation] is
comp.sources.misc: ncoast!sources-misc    | its mouth."  --John Gilmore