[comp.unix.i386] "PANIC: kernel mode trap. Type 0x0000000E" msg in 386/ix 2.0.2 ?????

jdg0@GTE.COM (Jose Diaz-Gonzalez) (07/17/90)

Hi there,

My machine has been crashing about twice daily for the last week or so.
The msg in the subject line shows up with all the register contents just
before it crashes.  I have contacted my vendor, they contacted ISC, and
all they were able to tell me was that it the problem is a hardware
error.  Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
everything appears to be OK.  Does anyone have any idea of what a type
0x0000000E error means?  This might help me to narrow down the
alternatives.   Any pointers will be appreciated.  Thanks,

	-- Jose


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+				     +					+
+	Jose Pedro Diaz-Gonzalez     +		                    	+
+	SrMTS			     +					+
+	GTE Laboratories, Inc.       +	   Tel:   (617) 466-2584	+
+	MS-46                 	     +	   email: jdiaz@gte.com    	+
+	40 Sylvan Rd.		     +					+
+	Waltham, MA 02254	     +					+
+				     +					+
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

peter@ncsbv.UUCP (Peter Jannesen) (07/18/90)

In article <9480@bunny.GTE.COM> jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
>Hi there,
>
>My machine has been crashing about twice daily for the last week or so.
>The msg in the subject line shows up with all the register contents just
>before it crashes.  I have contacted my vendor, they contacted ISC, and
>all they were able to tell me was that it the problem is a hardware
>error.  Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
>everything appears to be OK.  Does anyone have any idea of what a type
>0x0000000E error means?  This might help me to narrow down the
>alternatives.   Any pointers will be appreciated.  Thanks,
>

Panic 0x00000E is a memory violation of the kernel. This can by a hardware
problem. a other posibilty is a bug in the TCP/IP modules in the kernel.
There is a bug in the TCP/IP streams module which generate a memory violation.

This is a very old problem in 368/ix version 2.0.2. Our hope is that the
new version in better.

===============================================================================
Peter Jannesen
Network Communication Systems (N.C.S), The Netherlands
Phone:  +31104130093				 Fax:    +31104146452
Address: Westbaak 96                             Email:  peter@ncsbv
	 3012 KM Rotterdam, The Netherlands
===============================================================================

There is in the TCP/IP drivers in the kernel a bug

aland@infmx.UUCP (Colonel Panic) (07/18/90)

In article <9480@bunny.GTE.COM> jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
>Hi there,
>
>My machine has been crashing about twice daily for the last week or so.
>The msg in the subject line shows up with all the register contents just
>before it crashes.  I have contacted my vendor, they contacted ISC, and
>all they were able to tell me was that it the problem is a hardware
>error.  

On what basis?  If they can confidently make such a statement, they 
should have a clue as to WHICH piece of hardware is the culprit...

>        Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
>everything appears to be OK.

Don't count on it -- those diags aren't very complete.  (At least you GOT
some -- I'm STILL waiting for the diag disks for the machines we bought
in December!  I have one set from our sales rep to tide me over...)

For example, the don't provide diags for all peripherals, the POST memory 
diags are hosed, etc.  

>                              Does anyone have any idea of what a type
>0x0000000E error means?  This might help me to narrow down the
>alternatives.   Any pointers will be appreciated.  Thanks,

I'm no kernel hacker, but I have seen this error when DMA was attempted
on two channels simultaneously when DMAEXCL was set at default.  Do you
use >1 device that uses DMA?  (Cartridge drives, caching controllers, etc.)
Otherwise, it may help to list all of the hardware in use.

>+	Jose Pedro Diaz-Gonzalez

Cross-posted to comp.sys.att in case the helpful folks there have 
add'l info.  Followups to comp.unix.i386.

--
Alan Denney  @  Informix Software, Inc.          "We're homeward bound
aland@informix.com  {pyramid|uunet}!infmx!aland   ('tis a damn fine sound!)
-----------------------------------------------   with a good ship, taut & free
 Disclaimer:  These opinions are mine alone.      We don't give a damn, 
 If I am caught or killed, the secretary          when we drink our rum
 will disavow any knowledge of my actions.        with the girls of old Maui."

darryl@ism780c.isc.com (Darryl Richman) (07/18/90)

In article <9480@bunny.GTE.COM> jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
"My machine has been crashing about twice daily for the last week or so.
"The msg in the subject line shows up with all the register contents just
"before it crashes.  I have contacted my vendor, they contacted ISC, and
"all they were able to tell me was that it the problem is a hardware
"error.  Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
"everything appears to be OK.  Does anyone have any idea of what a type
"0x0000000E error means?  This might help me to narrow down the
"alternatives.   Any pointers will be appreciated.  Thanks,

You can do a bit of tracing yorself to see what is going on.  A trap E
is a page fault--which usually means that there is a bad pointer being
followed in the kernel.  You can discover what routine within the kernel
is causing the problem by noting the EIP value in the register dump,
and after rebooting, do "nm -vexp /unix | sort >/tmp/foo".  Then edit
/tmp/foo and look for the first 5 digits or so of the EIP value...the
greatest address less than or equal to your EIP value is the routine
that was executing.

An even easier way to do this is to configure your kernel
with the kernel debugger.  When the panic occurs, you will drop into the
debugger.  Type "stack" to see a stack backtrace.  You will also see the
instruction that caused the fault.  This will give you much more information
with which to use to get an answer out of your reseller, and ultimately,
ISC.

A "hardware error" means nothing.  Either your vendor misunderstood the
reply or hasn't pushed very hard on your behalf.  Unix tends to be a
much harder test of the hardware than the vendor's diagnostics;  we had
a case where a certain vendor was shipping cards that worked fine under
DOS and passed all of their tests just fine, but would never send an
interrupt;  needless to say, Unix found this out quickly.  When discussing
a problem like this, it is extremely important to pass along as much
information about your configuration as possible--all of the boards,
their interrupt and DMA numbers, how much memory, the make, model, and
geometry of the disks (if they are involved), whose motherboard, any
coprocessors, and so on.  All of these things tend to interact.

		--Darryl Richman
-- 
Copyright (c) 1990 Darryl Richman    The views expressed are the author's alone
darryl@ism780c.isc.com 		      INTERACTIVE Systems Corp.-A Kodak Company
 "For every problem, there is a solution that is simple, elegant, and wrong."
	-- H. L. Mencken

thssdwv@iitmax.IIT.EDU (David William Vrona) (07/19/90)

In article <45326@ism780c.isc.com> darryl@ism780c.UUCP (Darryl Richman) writes:
>In article <9480@bunny.GTE.COM> jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
>"My machine has been crashing about twice daily for the last week or so.
>"The msg in the subject line shows up with all the register contents just
>"before it crashes.  I have contacted my vendor, they contacted ISC, and
>"all they were able to tell me was that it the problem is a hardware
>"error.  Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
>"everything appears to be OK.  Does anyone have any idea of what a type
>"0x0000000E error means?  This might help me to narrow down the
>"alternatives.   Any pointers will be appreciated.  Thanks,
>
>You can do a bit of tracing yorself to see what is going on.  A trap E
>is a page fault--which usually means that there is a bad pointer being
>followed in the kernel.  You can discover what routine within the kernel
>is causing the problem by noting the EIP value in the register dump,
>and after rebooting, do "nm -vexp /unix | sort >/tmp/foo".  Then edit
>/tmp/foo and look for the first 5 digits or so of the EIP value...the
>greatest address less than or equal to your EIP value is the routine
>that was executing.
>
>An even easier way to do this is to configure your kernel
>with the kernel debugger.  When the panic occurs, you will drop into the
>debugger.  Type "stack" to see a stack backtrace.  You will also see the
>instruction that caused the fault.  This will give you much more information
>with which to use to get an answer out of your reseller, and ultimately,
>ISC.
>
>A "hardware error" means nothing.  Either your vendor misunderstood the
>reply or hasn't pushed very hard on your behalf.  Unix tends to be a
>much harder test of the hardware than the vendor's diagnostics;  we had
>a case where a certain vendor was shipping cards that worked fine under
>DOS and passed all of their tests just fine, but would never send an
>interrupt;  needless to say, Unix found this out quickly.  When discussing
>a problem like this, it is extremely important to pass along as much
>information about your configuration as possible--all of the boards,
>their interrupt and DMA numbers, how much memory, the make, model, and
>geometry of the disks (if they are involved), whose motherboard, any
>coprocessors, and so on.  All of these things tend to interact.

It's a hardware problem.  Exact same thing happened to me.  Took me two 
months to realize it was a noisy (electrically that is) power supply.
Borrow a supply from a friend before you knock yourself out with all the
other stuff.

tyager@maxx.UUCP (Tom Yager) (07/19/90)

In article <9480@bunny.GTE.COM>, jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
> Hi there,
> 
> My machine has been crashing about twice daily for the last week or so.
> ...Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
> everything appears to be OK.  Does anyone have any idea of what a type
> 0x0000000E error means?  This might help me to narrow down the
> alternatives.   Any pointers will be appreciated.  Thanks,

I just went through a round of problems revolving around this error message.
I wish I could tell you what the problem is, but neither the vendor nor I
were able to put our fingers on it. The only thing I can tell you about my
own experience with this error is that it seems to flag a fundamental
problem with the way the system talks to memory. The system in question had
16MB of memory, 8 of which was on a 32-bit expansion card. With the card
installed, the system would panic. Sometimes immediately, sometimes after
running OK for hours. With the extra memory removed, it would hum along and
run forever.

There are probably a thousand conditions that could trigger this error, but
what you've been told so far jives with my own experience: It is a hardware
problem. See if your dealer/distributor/whoever is willing to swap your
machine for you. My problem wasn't solved until my vendor sent me a system
based on a completely different motherboard design, so you might see how
an older (25 or 20 MHz) 6386 behaves. If, that is, you can lived with the
decrease in speed.

(ty)

-- 
+--Tom Yager, Technical Editor, BYTE----Reviewer, UNIX World---------------+
|  UUCP: decvax!maxx!tyager          NET: maxx!tyager@bytepb.byte.com      |
|  Always looking for qualified UNIX,Mac,DOS and OS/2 software reviewers-- |
+--mail to "reviews" instead of "tyager" above.---I speak only for myself.-+

beser@tron.UUCP (Eric Beser) (07/19/90)

I had the same problem. It turned outto be a memory chip that was not
seated properly. What a way to find the problem!

Eric Beser                beser@tron.bwi.wec.com
(301)-765-1010

"Captain, I think we can do it!"
"Make it so, number one!"

overby@plains.UUCP (Glen Overby) (07/19/90)

In article <9480@bunny.GTE.COM>, jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
> My machine has been crashing about twice daily for the last week or so.
> ...Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
> everything appears to be OK.  Does anyone have any idea of what a type
> 0x0000000E error means?  This might help me to narrow down the
> alternatives.   Any pointers will be appreciated.  Thanks,

We have a older Zenith 386 running 2.0.2 does also panic with the same
obscure error message, but only when we yank serial port plugs on our
DigiBoard COM/8.  I looked up the error code in a 386 data sheet, but that
didn't help much :-)

Our DigiBoard, one of the "dumb" ones with no CPU, is set up to use the COM2
interrupt, since COM1 is on the main CPU board.  We have talked to the
people at DigiBoard about the problem, and their solution has always been to
send us a new version of the driver (they release a new one about every two
or three months).

The panic occurs not only when we yank a port plug off, but also when our
Gandalf StarMaster gets a headache and drops carrier on all of our ports
(such as when they power it down for service, or when the power circut the
Gandalf is on goes down but ours doesn't).  It only happens once a month at
the most, so we've been living with it.  We are, obviously, using the
modem control minor devices (/dev/ttyd1[A-h] rather than /dev/ttyd1[a-h]).
The problem does NOT occur on lines without modem control.

I suspect we're fighting different fires, but this might be something to
look at.

Good Luck!
-- 
		Glen Overby	<overby@plains.nodak.edu>
	uunet!plains!overby (UUCP)  overby@plains (Bitnet)

erik@westworld.esd.sgi.com (Erik Fortune) (07/20/90)

*My* "Kernel mode trap, type 0x00000E" under SCO ODT turned out to be
a motherboard problem.   Never can tell...

-- Erik
   (erik@sgi.com)

dpi@loft386.uucp (Doug Ingraham) (07/20/90)

In article <63@maxx.UUCP>, tyager@maxx.UUCP (Tom Yager) writes:
> In article <9480@bunny.GTE.COM>, jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
> > Hi there,
> > 
> > My machine has been crashing about twice daily for the last week or so.
> > ...Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
> > everything appears to be OK.  Does anyone have any idea of what a type

Nobody has diagnostics that are any good.  In our case Unix uses lots
more of the hardware than any Diagnostic.  Diagnostics only seem to be
able to find really broken hardware.

> > 0x0000000E error means?  This might help me to narrow down the
> > alternatives.   Any pointers will be appreciated.  Thanks,
> 

This is a kernel panic because a parity error occured while executing in
the kernel.

> I just went through a round of problems revolving around this error message.
> I wish I could tell you what the problem is, but neither the vendor nor I
> were able to put our fingers on it. The only thing I can tell you about my
> own experience with this error is that it seems to flag a fundamental
> problem with the way the system talks to memory. The system in question had
> 16MB of memory, 8 of which was on a 32-bit expansion card. With the card
> installed, the system would panic. Sometimes immediately, sometimes after
> running OK for hours. With the extra memory removed, it would hum along and
> run forever.

Unix kernel lives up at the top of memory.  You could probably have moved
simm's or dips around until you found the problem if it was an oddending
component.  It might also be a slow bus tranceiver on the motherboard or
memory card.

Find vendors you can work with unless you are very hardware and software
savy.  You will save time, money and graying of the hair.

-- 
Doug Ingraham (SysAdmin)
Lofty Pursuits (Public Access for Rapid City SD USA)
uunet!loft386!dpi

jon@savant.UUCP (Jon Gefaell) (07/22/90)

In article <9480@bunny.GTE.COM> jdg0@GTE.COM (Jose Diaz-Gonzalez) writes:
>Hi there,
>
>My machine has been crashing about twice daily for the last week or so.
>The msg in the subject line shows up with all the register contents just
>before it crashes.  I have contacted my vendor, they contacted ISC, and
>all they were able to tell me was that it the problem is a hardware
>error.  Now, I've run my diagnostics (I'm using an AT&T 6386E/33) and
>everything appears to be OK.  Does anyone have any idea of what a type
>0x0000000E error means?  This might help me to narrow down the
>alternatives.   Any pointers will be appreciated.  Thanks,
>

Well, I get the same panic when I boot the bootable install disk from
my ESIX (rev C) distribution on my AGI 33Mhz 386 box. It happens EVERY
time, and there has only been one way to aleviate the problem...

I set my bus speed for 8.33Mhz instead of 11Mhz, boot the disk, and all
works fine. Then, at the point when it reboots and you start back up with
the second disk of the base package, I set the bus speed back to 11Mhz and
never see the error again...

Strange, eh? 

recap: I get the 0000000000000000000E :) error when my bus speed is 11Mhz
and I'm booting the base package installation disk. Lowering the bus speed
to 8.33Mhz 'cures' the problem.

Oh, I also got the E panic when I used to have a motherboard (Orchid) that
simply WOULD NOT run ESIX... (returned it for a full refund from Orchid 
direct, it didn't work, but they made it right!)

Hope this has helped a tad... I don't think it's gonna work for you, but
it does seem to point out that: 1.) This error seems to be a catch all,
and/or 2.) These folks (vendors, manufacturers) can't problem solves them
selves out of a paper bag from the inside even with all the error messages
and register contents in the world... :(
-- 
+----------- Domain? DOMAIN? We Don't Need No Steeeenkin' Domain! -----------+
| __/\                                                                       |
| \/~~                                                                       |
+-savant!jon@virginia.edu {...}!uunet!virginia!savant!jon jeg7e@virginia.edu-+