[comp.unix.microport] --- A System V/AT crash

fortin@zap.UUCP (Denis Fortin) (05/16/88)

Here is a Microport System V/AT, hardware-related question for all of the
`crash dump' fans out there in Usenet-land...

	Remember a while ago, I was complaining how my Microport System V/AT
would crash when running at 10MHz on my machine?  Well, someone suggested the
following:

> To find the routine in the kernel which caused the panic, you do this:
> 
> 	nm -x /system5 >/tmp/xxxxx  (dump list of kernel to file)
> 
> Now, go looking for the address you panic'd at.  You put the 'cs' and 'ip'
> values together to get this number (code segment & instruction pointer).  
> In this case, you get 0x0208005807.
> 
> Find the routine which has the largest address LESS THAN the panic address.
> This is the routine which was executing when the system crashed.
> [...]
> If the routine is NOT 'rmsd' then please post the name of the routine 
> as it's probably a new one... and might give all us net.gurus some ideas!

Well, after many tribulations, I finally got around to trying my system
at 10MHz for a reasonnable length of time *without* the memory card in
it.  It runs much better than before (i.e. no more NMI message and it
doesn't crash after 30 seconds), but it still DOES seem to want to
crash all over.

After running for a while (anywhere between 5 minutes and an hour), I seem
to get the a crash dump very similar to the following fairly consistently.

	user=0x10
	cs=0x200 ds=0x220 es=0x220 ss=0x200 di=0x0 si=0x5BE0
	bp=0x37C bx=0x0 dx=0xA1 cx=0x0 ax=0x7 ip=0xEAF flags=0x246
	trap type 0xD
	err = 0x1173
	stack frame address = 2208B6A
	400, 8, 0, FFFF, 0, 0, 0, 3ff, 11, 200
	0, 88, 89e2, 220, 400, a, 0, 200, 0, 0
	0, 400, 11, 200, 0, aa, 8a62, 220, 88, 3
	3f9, 0, 1a9, 6, 3f9, 1, 204, 5, 3f9, 2
	26c, 6, 3f9, 3, 295, 1, 3f9, 4, 2af, 5
	3f9, 5, 2b7, 1, 3f9, 6, 2dc, 2, 3f9, 7
	307, 4, 3fa, 0, 30b, 6, 3fa, 1, 326, 1
	3fa, 2, 350, 0, 3fa, 3, 358, 0, 3fa, 4
	359, 0, 3fa, 5, 38c, 0, 3fa, 6, 392, 4
	3fa, 7, 3a9, 0, 3fb, 0, 3aa, 7, 3fb, 1

So...  I did what was suggested, and according to `nm -x /system5`, the
closest thing to 02000eaf is 02000ea6, and that is "clkint1" in "trap.s".

Does this give you net.gurus any idea what might be wrong?  

						Denis, hopeful.

PS. How does one go about looking at the code around "clkint1" in the
    kernel?  sdb?  crash?  adb isn't there...

PPS. Is there any way to force a "crash dump" that might be investigated
     with "crash" afterwards?

--
Denis Fortin
fortin@zap.uucp                         | Real-Time Systems Group
philabs!micomvax!zap!fortin             | CAE Electronics Ltd
fortin%zap.uucp@Larry.McRCIM.McGill.EDU | The opinions expressed above are mine

root@uwspan.UUCP (Sue Peru Sr.) (05/22/88)

+---- fortin@zap.UUCP (Denis Fortin) writes in <445@zap.UUCP> ----
| After running for a while (anywhere between 5 minutes and an hour), I seem
| to get the a crash dump very similar to the following fairly consistently.
| 
| 	cs=0x200 ds=0x220 es=0x220 ss=0x200 di=0x0 si=0x5BE0
| 	bp=0x37C bx=0x0 dx=0xA1 cx=0x0 ax=0x7 ip=0xEAF flags=0x246
| 	trap type 0xD
| 
| So...  I did what was suggested, and according to `nm -x /system5`, the
| closest thing to 02000eaf is 02000ea6, and that is "clkint1" in "trap.s".
| 
| Does this give you net.gurus any idea what might be wrong?  

  Not me, but I'll pass it along...

| PS. How does one go about looking at the code around "clkint1" in the
|     kernel?  sdb?  crash?  adb isn't there...

  I haven't tried - it's enuf for me to pass the info on to Microport.

| PPS. Is there any way to force a "crash dump" that might be investigated
|      with "crash" afterwards?

  Nope - this is a major gripe of mine...

| Denis Fortin
| fortin@zap.uucp

  -John
-- 
Comp.Unix.Microport is now unmoderated!  Use at your own risk :-)

manzie@ttl.UUCP (Roddy Manzie) (05/24/88)

If you want to look at a routine within the kernel, the easiest way is to
use 'dis' with the -F option, ie

# dis -F clkint1 /unix

Also (as an aside), when doing nm's on the kernel, you may find it useful
to use:

nm -hxe /unix | sort -t\| +1 -2

This will sort the output by hex address, andd makes searching for those elusive
trap addresses far easier.

As to the bug - I haven't seen it before, but it didn't seem to be in 1.3.8
which we run at 12.5 MHZ with no bugs.

Roderick Manzie
Trading Technology Ltd.

ncegeber@ndsuvax.UUCP (Roger Egeberg) (05/25/88)

I missed replying to the original message, but I have had a similar experience.
In fact, my system crashed at exactly the same place as the original posters.
 
I 'talked' to Microport about it through their BBS.  It seems that several
of their customers have had this problem, as well as one of their own people.
They believe it happens to only a few machines, and is not brand-specific.  In
fact the machine they had trouble with was a real IBM-AT.  Their 'solution'
was to convince IBM to give them a new motherboard.

By the way, the crash is actually happening by a label called 'idle', which is
a few bytes past 'clkint1'.  When I disassembled the code there, I found out
that it had just 'turned on' all of the interrupts (in the 8259A's) and then
executed (or was about to) a HALT instruction.  It seems to me that it must
be a problem with some component (such as the 8259's) and one should be able
to eliminate it by exchanging the components, instead of obtaining a new
motherboard.

I'm going to try some things soon, if anyone has any ideas about this I'd sure
be interested in hearing them.
-- 
Roger Egeberg                 USENET:    ncegeber@ndsuvax.UUCP
NDSU Extension Service        BITNET:    nu062423@ndsuvm1.BITNET