mowgli@sioux.cis.ohio-state.edu (Mowgli Assor) (02/28/90)
Well, here it goes again! <Sigh> Recently, we have had a rash of page fault panics (4 in 5 days). We have not changed anything for weeks! So, can someone explain a few things to me? First off, what each of the various codes below stands for (I would presume PC is program counter, KSP is some form of stack pointer, but as for the rest - ?). I do not know much about the Intel arch- itecture, as far as what specific registers do & what the various traps stand for. Any information here would be appreciated. The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3 Digiport smart modem boards. Thursday, we got the following error messages: Trap 0000000E in SYSTEM error = 00000000 eax = F000272A ebx = 00000006 ecx = 00000000 edx = 000000A1 esi = 0000C420 edi = F000272A ebp = 060006F8 fl = 00010282 uds = 00000018 es = 00000018 fs = 0000003F gs = 0600003F pc = 00000020:0001D40D ksp = 060006D0 Panic - Non-recoverable Kernal Page Fault Trap 0000000E in SYSTEM error = 00000000 eax = F000272A ebx = 00000006 ecx = 00000000 edx = 000000A1 esi = 0000C420 edi = F000272A ebp = 060006DC fl = 00010282 uds = 06090018 es = 06000018 fs = 0000003F gs = 0000003F pc = 00000020:0001D40D ksp = 060006B4 Panic - Non-recoverable Kernal Page Fault Then, today (Tuesday) when I arrived at work, I again found a kernal page fault waiting for me. All values were the same as the above listed error messages, except for the following: esi = 0000B700 ebp = 06000404 gs = 0001003F ksp = 060003DC Now, the fact that it seems to die with the PC in the same place each time makes me very suspicious. Of course, it is likely that only SCO can tell me where the OS is dying (as far as what program causes it). Has anyone else ever had a problem remotely like this? As I said, we have not changed any of our hardware or software setup within the last month, & yet this only started happening last Thursday. Any help would be greatly ap- preciated! Thanks, <Mowgli> PS. I RTFM already. It had very little of use to say on the subject. -=- Address: mowgli@puffer.cis.ohio-state.edu (Mowgli Assor in real life) Or: mowgli@cis.ohio-state.edu, mowgli@osu-20.ircc.ohio-state.edu The 2 precepts of Semi-Divinity: (1) Mind Thine Own Business. (2) Don't Worry About It.
brad@microm.UUCP (Bradley W. Fisher) (03/01/90)
In article <77688@tut.cis.ohio-state.edu>, mowgli@sioux.cis.ohio-state.edu (Mowgli Assor) writes: > The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3 > Digiport smart modem boards. Thursday, we got the following error messages: > > Trap 0000000E in SYSTEM error = 00000000 <registers deleted> > pc = 00000020:0001D40D > ksp = 060006D0 This part of a panic trap is described (and I have to use that word loosely) in messages(M). A "Trap 0000000E in SYSTEM" is the trap number given by the processor, in this case 0x00E (in HEX) which according to Intel's data specs on the 386 is an "Exception 14 (in decimal) , which is a page fault". Therefore, from XENIX you get ... > Panic - Non-recoverable Kernal Page Fault > > Now, the fact that it seems to die with the PC in the same place each time > makes me very suspicious. Of course, it is likely that only SCO can tell me > where the OS is dying (as far as what program causes it). I must agree, SCO should include more info on panic traps with the Run Time System ... in fact there is more info included under messages(M) under 2.3.2 and even more about the specific errors in the appendices of the Developement System: panic:non-recoverable kernel page fault (The system could not process a page fault.) Doesn't tell you much, eh? I'm no guru (yet) but a page fault has to do with virtual memory processing. According to Intel's data manual on the 386 processor, many things can cause a page fault, the processor will "trap" to the operating system which is supposed to swap the page of memory back in from the disk. If this process succeeds you'll not see an error, but if it doesn't ... well you know that part. See 8.6.1 thru 8.6.3 of the Operations Guide for more info on when it ain't broke, and how to improve page swapping. I know this still doesn't explain WHY it is happening in the first place, the problem could be either hardware or software related. What you need to do is establish more of a reference point first. When *exactly* does it happen, what's the system usage when it happens, does it happen only during certain applications, etc. Also, have you run any read-only surface scan of the hard disk? You should be seeing other errors if there is a bad sector, even if it is in the swap area of the hard disk. You may also want to consider using custom(C) to re-install the LINK module so you can build a known clean kernel ... then re-add the drivers for serial cards, tape drive , etc. -- I'm just a wanna be UNIX guru (IJWBUG) | Micro Maintenance, Inc. | 2465 W. 12th St. #6 -== Brad Fisher ==- | Tempe, Arizona 85281 ...!asuvax!mcdphx!hrc!microm | 602/894-5526
rogerk@sco.COM (Roger Knopf 5502) (03/01/90)
In article <77688@tut.cis.ohio-state.edu> Mowgli Assor <mowgli@cis.ohio-state.edu> writes: >Well, here it goes again! <Sigh> Recently, we have had a rash of page fault >panics (4 in 5 days). We have not changed anything for weeks! So, can someone >explain a few things to me? First off, what each of the various codes below >stands for (I would presume PC is program counter, KSP is some form of stack >pointer, but as for the rest - ?). I do not know much about the Intel arch- >itecture, as far as what specific registers do & what the various traps stand >for. Any information here would be appreciated. You got PC and KSP right, usually (and for your purposes) none of the rest are important. Only the driver writer would care. PC is really the important one. >The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3 >Digiport smart modem boards. Thursday, we got the following error messages: > >Trap 0000000E in SYSTEM error = 00000000 > pc = 00000020:0001D40D >Panic - Non-recoverable Kernal Page Fault > >Trap 0000000E in SYSTEM error = 00000000 > pc = 00000020:0001D40D >Panic - Non-recoverable Kernal Page Fault > >Now, the fact that it seems to die with the PC in the same place each time >makes me very suspicious. Of course, it is likely that only SCO can tell me >where the OS is dying (as far as what program causes it). You have discovered an important clue. You are in luck though because you too can find out where it is dying: 1. Write down the PC (you did that). 2. Bring up your system (OK, I know this is obvious but....) 3. type the command "adb /xenix" You will get the adb prompt "* " 4. Type the command (use the offset from the PC in _your_ register dump): 1d40d?ia You should get something like: _iostart+45 ld ax,ax If it looks more like: 00020:0001d40d ld ax,ax then your kernel is stripped. Edit the file "/usr/sys/conf/link_xenix" and take the "strip xenix" line out, make and install a new kernel, wait for the problem to happen again and do this procedure. What this tells you is that iostart is the routine it died in. With any luck it will give be something that is recognizable and can localize it to either the sco kernel or some add-on driver. If you can call SCO Support and tell them this up front, whoever you talk to will love you forever. Makes it so much easier to figure out whats going on. >Has anyone else ever had a problem remotely like this? As I said, we have not >changed any of our hardware or software setup within the last month, & yet >this only started happening last Thursday. Any help would be greatly ap- >preciated! Yeah, we had this on a _very_important_ production system in house. It started when the air conditioning broke and cleared up after it was fixed. Clearly HW related. That doesn't mean that it is always HW related and especially when the PC is always the same. PC always the same is almost always software. Hope this helps, Roger Knopf SCO Consulting Services --- -- "His potential clients were always giving him the business." --Robert Thornton
barton@holston.UUCP (Barton A. Fisk) (03/01/90)
In article <77688@tut.cis.ohio-state.edu>, mowgli@sioux.cis.ohio-state.edu (Mowgli Assor) writes: > Well, here it goes again! <Sigh> Recently, we have had a rash of page fault > panics (4 in 5 days). We have not changed anything for weeks! So, can someone > explain a few things to me? First off, what each of the various codes below > stands for (I would presume PC is program counter, KSP is some form of stack > pointer, but as for the rest - ?). I do not know much about the Intel arch- According to an old Discover article I read (I don't have it handy) you should be able to use the info generated by a panic and by using a debugger ala adb, trace the problem back to the source. Get your hands on a copy if you can find one. If I remember right, these problems are almost always hardware and or device related. One suggestion is to unplug things until it stops. Sorry I can't be of more help but I'm not at the office now. -- uucp: holston!barton
bob@consult.UUCP (Bob Willey) (03/02/90)
In article <77688@tut.cis.ohio-state.edu> Mowgli Assor <mowgli@cis.ohio-state.edu> writes: >Well, here it goes again! <Sigh> Recently, we have had a rash of page fault >panics (4 in 5 days). We have not changed anything for weeks! So, can someone >The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3 >Digiport smart modem boards. Thursday, we got the following error messages: >Trap 0000000E in SYSTEM error = 00000000 >Panic - Non-recoverable Kernal Page Fault >Trap 0000000E in SYSTEM error = 00000000 >Panic - Non-recoverable Kernal Page Fault >Address: mowgli@puffer.cis.ohio-state.edu (Mowgli Assor in real life) > Or: mowgli@cis.ohio-state.edu, mowgli@osu-20.ircc.ohio-state.edu This problem has been report before. There was a problem with the dkinit program and the way that the PS/2 drives reported the info to it for the automatic install. Basically what happened was that the drive was being formatted as 112 cylinder instead of 110 and thus the page (swap) area is in non-existent space.!! Check how the drive is configured: ESDI Drive 70 MB 115 MB 300 MB Cylinders 70 110 300 Heads 64 64 64 Sectors per track 32 32 32 .. If you have this problem, it will necessitate reformatting the drive. The reason you probably did not notice it prior (Because we didn't notice it at first either), is due to activity on the system. When you finally had enough activity (bunch of tasks in Pro maybe...) then it filled up the swap space and then the PANIC. Hope this helps. -- .. Computer Consulting Service .. Bob Willey, CDP .. .. P.O. Drawer 1690 .. uunet!consult!bob .. .. Easton, Maryland 21601 .. (301) 820-4670 .. ...............................................................