[comp.unix.xenix] PANIC - Non-recoverable kernal page fault...

mowgli@sioux.cis.ohio-state.edu (Mowgli Assor) (02/28/90)

Well, here it goes again! <Sigh> Recently, we have had a rash of page fault
panics (4 in 5 days). We have not changed anything for weeks! So, can someone
explain a few things to me? First off, what each of the various codes below
stands for (I would presume PC is program counter, KSP is some form of stack
pointer, but as for the rest - ?). I do not know much about the Intel arch-
itecture, as far as what specific registers do & what the various traps stand
for. Any information here would be appreciated.

The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3
Digiport smart modem boards. Thursday, we got the following error messages:

Trap 0000000E in SYSTEM     error = 00000000
   eax = F000272A
   ebx = 00000006
   ecx = 00000000
   edx = 000000A1
   esi = 0000C420
   edi = F000272A
   ebp = 060006F8
   fl  = 00010282
   uds = 00000018
   es  = 00000018
   fs  = 0000003F
   gs  = 0600003F
   pc  = 00000020:0001D40D
   ksp = 060006D0
Panic - Non-recoverable Kernal Page Fault

Trap 0000000E in SYSTEM     error = 00000000
   eax = F000272A
   ebx = 00000006
   ecx = 00000000
   edx = 000000A1
   esi = 0000C420
   edi = F000272A
   ebp = 060006DC
   fl  = 00010282
   uds = 06090018
   es  = 06000018
   fs  = 0000003F
   gs  = 0000003F
   pc  = 00000020:0001D40D
   ksp = 060006B4
Panic - Non-recoverable Kernal Page Fault

Then, today (Tuesday) when I arrived at work, I again found a kernal page
fault waiting for me. All values were the same as the above listed error
messages, except for the following:

	esi = 0000B700
	ebp = 06000404
	gs  = 0001003F
	ksp = 060003DC

Now, the fact that it seems to die with the PC in the same place each time
makes me very suspicious. Of course, it is likely that only SCO can tell me
where the OS is dying (as far as what program causes it).

Has anyone else ever had a problem remotely like this? As I said, we have not
changed any of our hardware or software setup within the last month, & yet
this only started happening last Thursday. Any help would be greatly ap-
preciated!
					Thanks, <Mowgli>

PS. I RTFM already. It had very little of use to say on the subject.
-=-
Address: mowgli@puffer.cis.ohio-state.edu (Mowgli Assor in real life)
     Or: mowgli@cis.ohio-state.edu, mowgli@osu-20.ircc.ohio-state.edu
The 2 precepts of Semi-Divinity:	(1) Mind Thine Own Business.
					(2) Don't Worry About It.

brad@microm.UUCP (Bradley W. Fisher) (03/01/90)

In article <77688@tut.cis.ohio-state.edu>, mowgli@sioux.cis.ohio-state.edu (Mowgli Assor) writes:

> The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3
> Digiport smart modem boards. Thursday, we got the following error messages:
> 
> Trap 0000000E in SYSTEM     error = 00000000
	<registers deleted>
>    pc  = 00000020:0001D40D
>    ksp = 060006D0

This part of a panic trap is described (and I have to use that word loosely)
in messages(M). A "Trap 0000000E in SYSTEM" is the trap number given by the
processor, in this case 0x00E (in HEX) which according to Intel's data specs
on the 386 is an "Exception 14 (in decimal) , which is a page fault".
Therefore, from XENIX you get ...

> Panic - Non-recoverable Kernal Page Fault
> 
> Now, the fact that it seems to die with the PC in the same place each time
> makes me very suspicious. Of course, it is likely that only SCO can tell me
> where the OS is dying (as far as what program causes it).

I must agree, SCO should include more info on panic traps with the Run Time
System ...  in fact there is more info included under messages(M) under 2.3.2
and even more about the specific errors in the appendices of the Developement 
System:
	panic:non-recoverable kernel page fault
	(The system could not process a page fault.)

Doesn't tell you much, eh? I'm no guru (yet) but a page fault has to do
with virtual memory processing. According to Intel's data manual on the 386
processor, many things can cause a page fault, the processor will "trap"
to the operating system which is supposed to swap the page of memory back in
from the disk. If this process succeeds you'll not see an error, but if it
doesn't ... well you know that part. See 8.6.1 thru 8.6.3 of the Operations
Guide for more info on when it ain't broke, and how to improve page swapping.

I know this still doesn't explain WHY it is happening in the first place,
the problem could be either hardware or software related. What you need to
do is establish more of a reference point first. When *exactly* does it
happen, what's the system usage when it happens, does it happen only during
certain applications, etc. Also, have you run any read-only surface scan 
of the hard disk? You should be seeing other errors if there is a bad 
sector, even if it is in the swap area of the hard disk. You may also
want to consider using custom(C) to re-install the LINK module so you
can build a known clean kernel ... then re-add the drivers for serial
cards, tape drive , etc.
-- 
I'm just a wanna be UNIX guru (IJWBUG)               | Micro Maintenance, Inc.
						     | 2465 W. 12th St. #6
	   -== Brad Fisher ==- 		             | Tempe, Arizona  85281
     ...!asuvax!mcdphx!hrc!microm		     | 602/894-5526

rogerk@sco.COM (Roger Knopf 5502) (03/01/90)

In article <77688@tut.cis.ohio-state.edu> Mowgli Assor <mowgli@cis.ohio-state.edu> writes:
>Well, here it goes again! <Sigh> Recently, we have had a rash of page fault
>panics (4 in 5 days). We have not changed anything for weeks! So, can someone
>explain a few things to me? First off, what each of the various codes below
>stands for (I would presume PC is program counter, KSP is some form of stack
>pointer, but as for the rest - ?). I do not know much about the Intel arch-
>itecture, as far as what specific registers do & what the various traps stand
>for. Any information here would be appreciated.
 
You got PC and KSP right, usually (and for your purposes) none of the
rest are important. Only the driver writer would care. PC is really
the important one.

>The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3
>Digiport smart modem boards. Thursday, we got the following error messages:
>
>Trap 0000000E in SYSTEM     error = 00000000
>   pc  = 00000020:0001D40D
>Panic - Non-recoverable Kernal Page Fault
>
>Trap 0000000E in SYSTEM     error = 00000000
>   pc  = 00000020:0001D40D
>Panic - Non-recoverable Kernal Page Fault
>
>Now, the fact that it seems to die with the PC in the same place each time
>makes me very suspicious. Of course, it is likely that only SCO can tell me
>where the OS is dying (as far as what program causes it).
 
You have discovered an important clue. You are in luck though because
you too can find out where it is dying:

1. Write down the PC (you did that).
2. Bring up your system (OK, I know this is obvious but....)
3. type the command "adb /xenix"
   You will get the adb prompt "* "
4. Type the command (use the offset from the PC in _your_ register dump):

	1d40d?ia

   You should get something like:

	_iostart+45 	ld ax,ax

   If it looks more like:

	00020:0001d40d	ld ax,ax

   then your kernel is stripped. Edit the file "/usr/sys/conf/link_xenix"
   and take the "strip xenix" line out, make and install a new kernel,
   wait for the problem to happen again and do this procedure.

What this tells you is that iostart is the routine it died in. With any
luck it will give be something that is recognizable and can localize
it to either the sco kernel or some add-on driver. If you can call 
SCO Support and tell them this up front, whoever you talk to will
love you forever. Makes it so much easier to figure out whats going on.

>Has anyone else ever had a problem remotely like this? As I said, we have not
>changed any of our hardware or software setup within the last month, & yet
>this only started happening last Thursday. Any help would be greatly ap-
>preciated!

Yeah, we had this on a _very_important_ production system in house. It
started when the air conditioning broke and cleared up after it was
fixed. Clearly HW related. That doesn't mean that it is always HW
related and especially when the PC is always the same. PC always the
same is almost always software.

Hope this helps,

Roger Knopf
SCO Consulting Services
---
-- 

"His potential clients were always giving him the business."
	--Robert Thornton

barton@holston.UUCP (Barton A. Fisk) (03/01/90)

In article <77688@tut.cis.ohio-state.edu>, mowgli@sioux.cis.ohio-state.edu (Mowgli Assor) writes:
> Well, here it goes again! <Sigh> Recently, we have had a rash of page fault
> panics (4 in 5 days). We have not changed anything for weeks! So, can someone
> explain a few things to me? First off, what each of the various codes below
> stands for (I would presume PC is program counter, KSP is some form of stack
> pointer, but as for the rest - ?). I do not know much about the Intel arch-

According to an old Discover article I read (I don't have it handy) 
you should be able to use the info generated by a panic and by
using a debugger ala adb, trace the problem back to the source.
Get your hands on a copy if you can find one.

If I remember right, these problems are almost always hardware and
or device related.

One suggestion is to unplug things until it stops.

Sorry I can't be of more help but I'm not at the office now.
-- 
uucp: holston!barton

bob@consult.UUCP (Bob Willey) (03/02/90)

In article <77688@tut.cis.ohio-state.edu> Mowgli Assor <mowgli@cis.ohio-state.edu> writes:
>Well, here it goes again! <Sigh> Recently, we have had a rash of page fault
>panics (4 in 5 days). We have not changed anything for weeks! So, can someone
>The machine we are using is an IBM PS/2 Model 80, 6Meg RAM, ~100Meg HD, w/3
>Digiport smart modem boards. Thursday, we got the following error messages:
>Trap 0000000E in SYSTEM     error = 00000000
>Panic - Non-recoverable Kernal Page Fault
>Trap 0000000E in SYSTEM     error = 00000000
>Panic - Non-recoverable Kernal Page Fault
>Address: mowgli@puffer.cis.ohio-state.edu (Mowgli Assor in real life)
>     Or: mowgli@cis.ohio-state.edu, mowgli@osu-20.ircc.ohio-state.edu


This problem has been report before.
There was a problem with the dkinit program and the
way that the PS/2 drives reported the info to it for
the automatic install.   Basically what happened was that 
the drive was being formatted as 112 cylinder instead of 110
and thus the page (swap) area is in non-existent space.!!
Check how the drive is configured:
 ESDI Drive             70 MB           115 MB          300 MB
 Cylinders               70               110            300
 Heads                   64               64             64
 Sectors per track       32               32             32
..
If you have this problem, it will necessitate reformatting the
drive.  The reason you probably did not notice it prior
(Because we didn't notice it at first either), is due to
activity on the system.  When you finally had enough activity
(bunch of tasks in Pro maybe...) then it filled up the
swap space and then the PANIC.

Hope this helps.

-- 
.. Computer Consulting Service     ..    Bob Willey, CDP     ..
.. P.O. Drawer 1690                ..    uunet!consult!bob   ..
.. Easton, Maryland  21601         ..    (301) 820-4670      ..
...............................................................