root@conexch.UUCP (Larry Dighera) (05/04/88)
Does anyone have any idea what is causing these panics? Is the format of the panic message documented anywhere so that one can interpret them? I realize that the third through fifth lines are register dumps of the 80286, and the last line apparently refers to an out of bounds address (?). But, how does one glean any meaning from this? Is the cause hardware or software related? Should I be looking at the accounting file to see what was running at the time of the panic? Fri Apr 8 13:41:40 TRAP 000D in SYSTEM ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936 bp=03A4, fl=0206, uds=0018, es=0020 pc=0038:9AA1, ksp=0388 panic: general protection trap Thu Apr 28 17:32:11 TRAP 000D in SYSTEM ax=1B00, bx=0098, cx=0000, dx=0000, si=0026, di=6726 bp=03A4, fl=0206, uds=0018, es=000C pc=0038:9AA1, ksp=0388 panic: general protection trap Mon May 2 9:56:00 TRAP 000D in SYSTEM ax=0005, bx=BD7A, cx=0000, dx=00A1, si=0002, di=0594 bp=039A, fl=0202, uds=0018, es=004F pc=0030:A4C5, ksp=0376 panic: general protection trap System vitals: SCO Xenix 2.2.0 Atronics AT compatible running at 10 MHz one wait state 1 MB RAM on system board 3 MB RAM on American Micronics "Elephant" board American Micronics 8 port "LAMB" card Archive Tape system Monochrome/graphics card WD HD/Floppy controller Maxtor XT1085 & XT1140 Hard drives Teac 1.2 MB and 360 K floppies 200 W. power supply Please respond via E-Mail to: cnexch!root I will post a summary of useful responses. Thanks in advance. -- USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA 92712 TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71) UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp UUCP: ...!ucbvax!ucivax!icnvax!conexch!root || ...!trwrb!ucla-an!conexch!root
root@conexch.UUCP (Larry Dighera) (05/04/88)
Does anyone have any idea what is causing these panics? Is the format of the panic message documented anywhere so that one can interpret them? I realize that the third through fifth lines are register dumps of the 80286, and the last line apparently refers to an out of bounds address (?). But, how does one glean any meaning from this? Is the cause hardware or software related? Should I be looking at the accounting file to see what was running at the time of the panic? Fri Apr 8 13:41:40 TRAP 000D in SYSTEM ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936 bp=03A4, fl=0206, uds=0018, es=0020 pc=0038:9AA1, ksp=0388 panic: general protection trap Thu Apr 28 17:32:11 TRAP 000D in SYSTEM ax=1B00, bx=0098, cx=0000, dx=0000, si=0026, di=6726 bp=03A4, fl=0206, uds=0018, es=000C pc=0038:9AA1, ksp=0388 panic: general protection trap Mon May 2 9:56:00 TRAP 000D in SYSTEM ax=0005, bx=BD7A, cx=0000, dx=00A1, si=0002, di=0594 bp=039A, fl=0202, uds=0018, es=004F pc=0030:A4C5, ksp=0376 panic: general protection trap System vitals: SCO Xenix 2.2.0 Atronics AT compatible running at 10 MHz one wait state 1 MB RAM on system board 3 MB RAM on American Micronics "Elephant" board American Micronics 8 port "LAMB" card Archive Tape system Monochrome/graphics card WD HD/Floppy controller Maxtor XT1085 & XT1140 Hard drives Teac 1.2 MB and 360 K floppies 200 W. power supply Please respond via E-Mail to: cnexch!root I will post a summary of useful responses. Thanks in advance. Larry Dighera -- USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA 92712 TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71) UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp UUCP: ...!ucbvax!ucivax!icnvax!conexch!root || ...!trwrb!ucla-an!conexch!root
dave@micropen (David F. Carlson) (05/06/88)
In article <308@conexch.UUCP>, root@conexch.UUCP (Larry Dighera) writes: > > Does anyone have any idea what is causing these panics? > Is the format of the panic message documented anywhere so that one > can interpret them? I realize that the third through fifth lines are > register dumps of the 80286, and the last line apparently refers to > an out of bounds address (?). But, how does one glean any meaning > from this? Is the cause hardware or software related? Should I be > looking at the accounting file to see what was running at the time of > the panic? > > Fri Apr 8 13:41:40 > TRAP 000D in SYSTEM > ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936 > bp=03A4, fl=0206, uds=0018, es=0020 > pc=0038:9AA1, ksp=0388 > panic: general protection trap > Thanks in advance. > Larry Dighera Welcome to the Intel 80286! We Microport users know *exactly* what that error message means. In the Intel '286 reference manual their are several types of faults: TSS, protection, etc. It seems that the stack region for any process is 64K (ie., one segment). But so is the kernel stack! When the kernel stack pointer rolls over its segment it causes the above panic. There is really no efficient means to correct this as kernel stack is used very frequently and to have multiple stack segments would require nasty segment loads, etc. In fact, even the Microsoft huge model compilers don't allow multiple stack segments. However, the UNIX (Xenix) kernel is large and each process that runs will occupy area on the kernel stack. In addition, interrupt handlers also use the kernel stack and can cause very large (albeit short term) stack usage. Bottom line is that there are many circumstances of a kernel stack requiring more that 64K to live and no practical way for the '286 architecture to provide it. Now just wait until the naive DOS losers get suckered into OS/2 with exactly the same '286 limitations and no 386 version in sight! (This architecture issue is exactly why I would never recommend less than a 32 bit address space for UNIX: buy a '386. -- David F. Carlson, Micropen, Inc. ...!{ames|harvard|rutgers|topaz|...}!rochester!ur-valhalla!micropen!dave "The faster I go, the behinder I get." --Lewis Carroll
jim@applix.UUCP (Jim Morton) (05/10/88)
In article <308@conexch.UUCP>, root@conexch.UUCP (Larry Dighera) writes: > > Is the format of the panic message documented anywhere so that one > can interpret them? I realize that the third through fifth lines are > > TRAP 000D in SYSTEM > ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936 > bp=03A4, fl=0206, uds=0018, es=0020 > pc=0038:9AA1, ksp=0388 > panic: general protection trap > There have been a number of postings on this subject, so here's a little help. First, do a "nm /xenix | sort >/tmp/foo". (If you don't have the development system, this won't work - you need nm(CP)). Then, take the PC address given in your panic message (In the above case, 38:9AA1) and find the routine in the kernel (by looking at /tmp/foo) that is located at the next lower address. This is the routine that crashed the kernel. If you're lucky, the name of this routine will give you a clue as to why the system crashed - if the routine was, for example, "_ttioctl" it would point towards a problem doing an ioctl() call on a serial port line. You have to use the first "pc=" value given, if there is a second one printed it probably points to the trap routine itself. If you REALLY want to hack further, you can find the /usr/sys/* module the routine is located in and adb(CP) around in it to see what's going on. That way the register values (ax= bx=) may show you why the routine crashed. -- Jim Morton, APPLiX Inc., Westboro, MA UUCP: ...harvard!m2c!applix!jim jim@applix.m2c.org
root@conexch.UUCP (Larry Dighera) (05/17/88)
In article <694@applix.UUCP> jim@applix.UUCP (Jim Morton) writes: >In article <308@conexch.UUCP>, root@conexch.UUCP (Larry Dighera) writes: >> >> Is the format of the panic message documented anywhere so that one >> can interpret them? I realize that the third through fifth lines are >> register dumps. >> >> TRAP 000D in SYSTEM >> ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936 >> bp=03A4, fl=0206, uds=0018, es=0020 >> pc=0038:9AA1, ksp=0388 >> panic: general protection trap >> [two panic message deleted] > >First, do a "nm /xenix | sort >/tmp/foo". [...] Then, take the >PC address given in your panic message (In the above case, 38:9AA1) and >find the routine in the kernel (by looking at /tmp/foo) that is located >at the next lower address. This is the routine that crashed the kernel. >If you're lucky, the name of this routine will give you a clue as to why >the system crashed - if the routine was, for example, "_ttioctl" it would >point towards a problem doing an ioctl() call on a serial port line. > >You have to use the first "pc=" value given, if there is a second one >printed it probably points to the trap routine itself. If you REALLY >want to hack further, you can find the /usr/sys/* module the routine >is located in and adb(CP) around in it to see what's going on. That way >the register values (ax= bx=) may show you why the routine crashed. > Jim: Thank you for sharing your technique for determining what routine the kernel was executing at the time of the panic. Your posting is most helpful. Here is the relevant output from: nm -vp /xenix | grep '0038:9' | sort (nm complained about too many symbols to sort without the -p) 0038:98a2 T _nbmap 0038:9914 T _notincore 0038:9947 T _exrd 0038:9a59 S FIO_TEXT <-- Segment Name 0038:9a59 T _getf 0038:9a91 T _closef << --- The routine that contains 38:9aa1 0038:9c62 T _isinfile <<< --- The routine that crashed the kernel? 0038:9cc2 T _openi 0038:9d4c T _access 0038:9e2b T _owner 0038:9ed7 T _suser 0038:9ef5 T _ufalloc 0038:9f2c T _falloc 0038:9faf S SYS4_TEXT <-- Segment Name 0038:9faf T _gtime 0038:9fc3 T _ftime Is my notation, to the right of the nm output, a correct interpretation of the method you have described of locating the routine that crashed the kernel? I'd examine the routine with adb, but adb's syntax is a little too obscure for me at this point. If my interpretation of your intent is correct, would you agree that the kernel was doing file i/o at the time of this panic? (presuming so, I'll continue) So, now that I have a method of locating the kernel routine that was running at the time of the panic let's consider the causes of kernel panics. Quoting David F. Carlson's follow up message on this subject: <Welcome to the Intel 80286! We Microport users know *exactly* what that <error message means. In the Intel '286 reference manual their are several <types of faults: TSS, protection, etc. It seems that the stack region for <any process is 64K (ie., one segment). But so is the kernel stack! When <the kernel stack pointer rolls over its segment it causes the above panic. Is the only possible cause ? <There is really no efficient means to correct this as kernel stack is used <very frequently and to have multiple stack segments would require nasty <segment loads, etc. In fact, even the Microsoft huge model compilers don't <allow multiple stack segments. However, the UNIX (Xenix) kernel is large <and each process that runs will occupy area on the kernel stack. In addition, <interrupt handlers also use the kernel stack and can cause very large (albeit <short term) stack usage. Bottom line is that there are many circumstances <of a kernel stack requiring more than 64K to live and no practical way for <the '286 architecture to provide it. Now just wait until the naive DOS <losers get suckered into OS/2 with exactly the same '286 limitations and <no 386 version in sight! (This architecture issue is exactly why I would <never recommend less than a 32 bit address space for UNIX: buy a '386. If this is true, the '286 kernel has an inherent limitation that can be a source of panics. Failing hardware (RAM, CPU, ...), as well as disruptive events (stray alpha particle, power glitch, ...) come to mind as plausible causes too. So, although I can tell what routine the kernel was executing at the time of the panic, the complete origin is still somewhat obscure. Fortunately, the panic messages usually end up in /lost+found after re-booting. This is the result of fsck putting the /use/adm/messages file there (named by it's inode number, of course). Keeping a log of the PC value of each panic, should shed some more light on their cause. If the PC value is the same in all panics, I would suspect only one cause. If the PC value was inconsistent, I would suspect either multiple causes or random events. In my case, of three panics documented, in two cases the PC value was as above, and the last was different (the kernel was apparently in the _siowrite routine of the AMI LAMB 8 port driver). From this limited data I would infer that indeed there are multiple causes for the panics I am experiencing. Am I on the right track? Larry Dighera -- USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA 92712 TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71) UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp UUCP: ...!uunet!turnkey!conexch!root || ...!trwrb!ucla-an!conexch!root
Eric_D_Davis@cup.portal.com (06/01/88)
Yes....But have you ever encountered a "Triple panic trap"?? I have (toot toot..) The lady at SCO asked me where I was touching the computer..in a jovial sort of way... Eric Davis Mytec Inc.
jfh@rpp386.UUCP (John F. Haugh II) (06/04/88)
In article <6132@cup.portal.com> Eric_D_Davis@cup.portal.com writes: >Yes....But have you ever encountered a "Triple panic trap"?? >I have (toot toot..) The lady at SCO asked me where I was touching the >computer..in a jovial sort of way... >-- >Eric Davis i went off and did a strings on /xenix. i didn't find anything labeled `[Tt]riple' anything, and here are the only `trap' things: Reschedule trap DNA trap in kernel mode now, what i want to know is what is a DNA trap? does this mean the machine has been genetically mutated? - john.
chapman@sco.COM (brian chapman) (06/15/88)
In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes: < In article <6132@cup.portal.com> Eric_D_Davis@cup.portal.com writes: < > < >The lady at SCO asked me where I was touching the computer < < i went off and did a strings on /xenix. i didn't find anything labeled < `[Tt]riple' anything, and here are the only `trap' things: < < Reschedule trap < DNA trap in kernel mode < < now, what i want to know is what is a DNA trap? does this mean the < machine has been genetically mutated? On the chance that this is a serious question I will attempt to answer it. DNA is Device Not Availible (floating point device, that is) Meaning the kernel is executing *87 instructions. Or trying to. -- Brian Chapman uunet!sco!chapman
haugj@pigs.UUCP (The Beach Bum) (06/18/88)
In article <446@sysco>, chapman@sco.COM (brian chapman) writes: ] In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes: ] [ DNA trap in kernel mode ] [ ] [ now, what i want to know is what is a DNA trap? does this mean the ] [ machine has been genetically mutated? ] ] On the chance that this is a serious question I will attempt ] to answer it. ] DNA is Device Not Availible (floating point device, that is) ] Meaning the kernel is executing *87 instructions. Or trying to. ] -- ] Brian Chapman uunet!sco!chapman OK - i thought it was a typo version of DMA trap. now that i know its True Meaning(tm), how come i don't see it when i am executing 80387 instructions? i mean, i use floating point code, but never see the message. does this only apply if the kernel thought there was an 80387 and then later found out it had disappeared? you've made me curious now! - john. -- The Beach Bum Big "D" Home for Wayward Hackers UUCP: ...!killer!rpp386!jfh jfh@rpp386.uucp :SMAILERS "You are in a twisty little maze of UUCP connections, all alike" -- fortune
)) (06/24/88)
In article <170@pigs.UUCP> haugj@pigs.UUCP (The Beach Bum) writes: |In article <446@sysco>, chapman@sco.COM (brian chapman) writes: |] In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes: |] [ DNA trap in kernel mode |] [ |] [ now, what i want to know is what is a DNA trap? does this mean the |] [ machine has been genetically mutated? |] |] On the chance that this is a serious question I will attempt |] to answer it. |] DNA is Device Not Availible (floating point device, that is) |] Meaning the kernel is executing *87 instructions. Or trying to. |] -- |] Brian Chapman uunet!sco!chapman | |OK - i thought it was a typo version of DMA trap. now that i know |its True Meaning(tm), how come i don't see it when i am executing |80387 instructions? i mean, i use floating point code, but never |see the message. does this only apply if the kernel thought there |was an 80387 and then later found out it had disappeared? you've |made me curious now! My suspicion is that "DNA trap in kernel mode" would only occur if the kernel itself tried to issue a FPU (or some other co-processor) instruction when no FPU (or other co-processor) was present. Though it is possible that this would occur if the ROM BIOS was so stupid as to tell the kernel during boot that the FPU was present, but it wasn't. Usually, UNIX kernels are designed to not have FPU instructions of their own. During system boot, the 386 kernel (this applies to most kernels all the way back to PDP-11 V7 or V6) determines whether the FP hardware is present. If the FP hardware is not present, then the kernel makes arrangements so that if a user tries to execute a FPU instruction, the kernel will catch the exception, emulate the instruction in *software*, and then continue the user's code. So, the user is never really aware whether there's a true FPU present except that software emulation is slower. Therefore, the compiler *always* emits true FPU instructions. [Just consider all the grief that wouldn't happen if DOS did the emulation of non-existant instructions in software too!] However, if the kernel does detect that an FP instruction was issued from itself, obviously something is wrong (the kernel should never depend on optional co-processors) and it panics. Sort of like having a page fault while in kernel mode (unless your kernel can page itself that is). ISC 386/ix used to (and I suspect Microport and Bell Tech still do) crash if the ROM BIOS was so stupid to tell the kernel during boot that the FPU was present but it really wasn't (which was a bug in several earlier 386 ROM BIOSes). Not all systems work this way (eg: Spectrix programs are linked with different libraries depending on whether a real hardware FPU is present), but 386 UNIXes do. -- Chris Lewis, Spectrix Microsystems Inc, Phone: (416)-474-1955 UUCP: {uunet!mnetor, utcsri!utzoo, lsuc, yunexus}!spectrix!clewis Moderator of the Ferret Mailing List (ferret-list,ferret-request@spectrix)
allbery@ncoast.UUCP (Brandon S. Allbery) (06/26/88)
As quoted from <170@pigs.UUCP> by haugj@pigs.UUCP (The Beach Bum): +--------------- | In article <446@sysco>, chapman@sco.COM (brian chapman) writes: | ] In article <2375@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes: | ] [ now, what i want to know is what is a DNA trap? does this mean the | ] [ machine has been genetically mutated? | ] | ] DNA is Device Not Availible (floating point device, that is) | ] Meaning the kernel is executing *87 instructions. Or trying to. | | its True Meaning(tm), how come i don't see it when i am executing | 80387 instructions? i mean, i use floating point code, but never +--------------- Here's how it works on the 386 boxes I've used: The user-mode trap table is set up so that if an 80387 instruction is seen and the 80387 is not installed, the instruction traps to an emulator. However, there is no such arrangement in the kernel -- so if the kernel tries to run an 80387 instruction and there's no 80387 chip the system panics. This presumably is because the emulator can't easily be made to work in kernel mode; also, the kernel shouldn't need to do floating point unless some kind of special device driver is used -- in which case you probably want the 80387 anyway, for speed. (The system'd get pretty slow if the kernel were spending large amounts of its time in the emulator called from kernel mode.) -- Brandon S. Allbery | "Given its constituency, the only uunet!marque,sun!mandrill}!ncoast!allbery | thing I expect to be "open" about Delphi: ALLBERY MCI Mail: BALLBERY | [the Open Software Foundation] is comp.sources.misc: ncoast!sources-misc | its mouth." --John Gilmore