eric@mks.mks.com (Eric Gisin) (07/23/90)
We have been getting a lot of kernel type 0xE traps on our Interactive 2.2 system. Under 2.0 the machine would crash a lot but there were usually no clues as to why. I looked up interrupt 0xE in the Intel manual and it is a page fault. Several people have claimed they are hardware related (memory or bus), but I can't see how this can be the case. You only get this fault when an address mapping results in a page table entry with the page-present bit of zero or when page access is denied. If a hardware problem was causing random page faults, you would expect some of them to occur in user-mode resulting in a process being killed with SIGSEGV, but I haven't seen any of these. Can anyone explain or speculate how a hardware problem could be causing page faults? Two crashes I did debug had a null-pointer dereference in namei(), one called from open() the other for exece(). Two other crashes occured during reboot in wdrintr(). Could these be software-bug generated faults? We also had to upgrade our CPU from the SX211 version to SX219 to get 2.2 boot/install to come up. This required removing the motherboard to get the CPU to seat properly. Not recommend for non-hardware gurus. Does Interactive acknowledge that you may have to upgrade your CPU to install 2.2?
jackv@turnkey.tcc.com (Jack F. Vogel) (07/26/90)
In article <ERIC.90Jul23172008@mks.mks.com> eric@mks.mks.com (Eric Gisin) writes: >We have been getting a lot of kernel type 0xE traps >on our Interactive 2.2 system. Under 2.0 the machine >would crash a lot but there were usually no clues as to why. You see, this is improvement, now instead of crashing and burning without any clue you get a trap type :-} :-}. >I looked up interrupt 0xE in the Intel manual and it is a page fault. >Several people have claimed they are hardware related (memory or bus), >but I can't see how this can be the case. You only get this fault >when an address mapping results in a page table entry with the >page-present bit of zero or when page access is denied. >Can anyone explain or speculate how a hardware problem >could be causing page faults? Yes, quite simply. First off, understand that the kernel unlike a user process is not demand paged, it does not have pages stolen if not referenced. Therefore, a page fault, something quite common and frequent for a user process, is indicative of a crisis when running in the kernel. All kernel pages should be in core at all times (note that this is implementation specific). Now you wonder how page-faulting in the kernel can be a hardware problem, well most easily imagined is where the memory subsystem just isn't quite up to snuff and when the CPU goes to read something on the data bus it gets garbage instead of legitimate data. In fact, case in point, your own words: >Two crashes I did debug had a null-pointer dereference in namei(), >one called from open() the other for exece(). >Two other crashes occured during reboot in wdrintr(). >Could these be software-bug generated faults? A pointer is a perfect example, we read memory to get an address, and in fact we expect a legitmate kernel address, however in that one-in-a-million case the hardware burps and we read garbage on the data bus, we branch to that spurious address, oh no, guess what, page fault! Of course, this is only one simple scenario, things could be much more complex but I hope it illustrates the point. In fact, the very fact that your crashes ranged all over in the code should have been a clue, I mean do you think the kernel code is mysteriously changing from one minute to the next? The typical kernel software problem is where you get a fixed location where a panic happens like in some third party device driver :-}! Naturally, there are cases where it is a combination of hardware and software, like when some code only gets executed when certain unique hardware is present and where that code has a problem and/or it doesn't handle the hardware properly ( again device drivers being the most common ), but far more common these days are scenarios similar to mine where the code is just fine but the hardware has heartburn :-}. Hope this sheds some light. Disclaimer: Me speak for LCC or IBM ?? I don't wear suits and ties :-}! -- Jack F. Vogel jackv@locus.com AIX370 Technical Support - or - Locus Computing Corp. jackv@turnkey.TCC.COM