[comp.unix.i386] Kernel trap type 0xE

eric@mks.mks.com (Eric Gisin) (07/23/90)

We have been getting a lot of kernel type 0xE traps
on our Interactive 2.2 system. Under 2.0 the machine
would crash a lot but there were usually no clues as to why.

I looked up interrupt 0xE in the Intel manual and it is a page fault.
Several people have claimed they are hardware related (memory or bus),
but I can't see how this can be the case. You only get this fault
when an address mapping results in a page table entry with the
page-present bit of zero or when page access is denied.
If a hardware problem was causing random page faults,
you would expect some of them to occur in user-mode
resulting in a process being killed with SIGSEGV,
but I haven't seen any of these.
Can anyone explain or speculate how a hardware problem
could be causing page faults?

Two crashes I did debug had a null-pointer dereference in namei(),
one called from open() the other for exece().
Two other crashes occured during reboot in wdrintr().
Could these be software-bug generated faults?

We also had to upgrade our CPU from the SX211 version to SX219
to get 2.2 boot/install to come up. This required removing the
motherboard to get the CPU to seat properly. Not recommend
for non-hardware gurus. Does Interactive acknowledge that
you may have to upgrade your CPU to install 2.2?

jackv@turnkey.tcc.com (Jack F. Vogel) (07/26/90)

In article <ERIC.90Jul23172008@mks.mks.com> eric@mks.mks.com (Eric Gisin) writes:
>We have been getting a lot of kernel type 0xE traps
>on our Interactive 2.2 system. Under 2.0 the machine
>would crash a lot but there were usually no clues as to why.
 
You see, this is improvement, now instead of crashing and burning without
any clue you get a trap type :-} :-}.

>I looked up interrupt 0xE in the Intel manual and it is a page fault.
>Several people have claimed they are hardware related (memory or bus),
>but I can't see how this can be the case. You only get this fault
>when an address mapping results in a page table entry with the
>page-present bit of zero or when page access is denied.

>Can anyone explain or speculate how a hardware problem
>could be causing page faults?
 
Yes, quite simply. First off, understand that the kernel unlike a user
process is not demand paged, it does not have pages stolen if not referenced.
Therefore, a page fault, something quite common and frequent for a user
process, is indicative of a crisis when running in the kernel. All kernel
pages should be in core at all times (note that this is implementation
specific). Now you wonder how page-faulting in the kernel can be a hardware
problem, well most easily imagined is where the memory subsystem just isn't
quite up to snuff and when the CPU goes to read something on the data bus
it gets garbage instead of legitimate data. In fact, case in point, your
own words:

>Two crashes I did debug had a null-pointer dereference in namei(),
>one called from open() the other for exece().
>Two other crashes occured during reboot in wdrintr().
>Could these be software-bug generated faults?
 
A pointer is a perfect example, we read memory to get an address, and in fact
we expect a legitmate kernel address, however in that one-in-a-million case
the hardware burps and we read garbage on the data bus, we branch to that
spurious address, oh no, guess what, page fault! Of course, this is only one
simple scenario, things could be much more complex but I hope it illustrates
the point.

In fact, the very fact that your crashes ranged all over in the code should
have been a clue, I mean do you think the kernel code is mysteriously
changing from one minute to the next? The typical kernel software problem
is where you get a fixed location where a panic happens like in some third
party device driver :-}!

Naturally, there are cases where it is a combination of hardware and software,
like when some code only gets executed when certain unique hardware is present
and where that code has a problem and/or it doesn't handle the hardware
properly ( again device drivers being the most common ), but far more common
these days are scenarios similar to mine where the code is just fine but
the hardware has heartburn :-}.

Hope this sheds some light.

Disclaimer: Me speak for LCC or IBM ?? I don't wear suits and ties :-}!

-- 
Jack F. Vogel			jackv@locus.com
AIX370 Technical Support	       - or -
Locus Computing Corp.		jackv@turnkey.TCC.COM