[comp.sys.sun] System crashes

kevin@uunet.uu.net (Kevin Kelleher) (03/01/89)

I have a 3/280 running 4.0.1 and have had two system crashes in the last
two days during periods of heavy nfs activity (there are 7 3/60s connected
to it).  I am hoping that someone can make since out of the panic messages
below.  I have a call into sun about it, but we all know how long that can
take.

First crash:
    diff: 
    trap address 0x8, pid 29905, pc = f027448, sr = 2004, stkfmt b, context 1
    Bus Error Reg 80<INVALID>
    data fault address c faultc 0 faultb 0 dfault 1 rw 1 size 0 fcode 5
    KERNEL MODE
    page map 0 pmgrp aa
    D0-D7  0 f 4 0 0 0 100 20
    A0-A7  f117ce8 f117ce8 f117ce8 0 f0b2cc0 f2922cc ffff96b0 ffff9684
    Begin traceback...fp = ffff96b0, sp = ffff9684
    Called from f045d72, fp=ffff96d8, args=ffff9738 1 0 1
    Called from f03da06, fp=ffff973c, args=ffff9738 1 f117560 0
    Called from f03c85c, fp=ffff9768, args=21cdc 0 1 0
    Called from f03c7da, fp=ffff9780, args=21cdc 1 0 ffff97b4
    Called from f067cec, fp=ffff97a8, args=ffff9a18 28c54 281ac 0
    Called from f004768, fp=efffb90, args=5 220f1 fffffffb 4
    End traceback...
    panic: Bus error
    syncing file systems... 
Hung at this point....

Second crash (about 26 hours later):
    syslogd: 
    trap address 0x8, pid 96, pc = f031e4a, sr = 2004, stkfmt b, context 1
    Bus Error Reg 80<INVALID>
    data fault address 0 faultc 0 faultb 0 dfault 1 rw 1 size 0 fcode 5
    KERNEL MODE
    page map 0 pmgrp 28
    D0-D7  0 f19ba37 1 0 0 0 2 f112004
    A0-A7  f0a34e8 f19ba38 0 0 f0b2020 f17ad0c ffff95f8 ffff95e8
    Begin traceback...fp = ffff95f8, sp = ffff95e8
    Called from f028a38, fp=ffff9628, args=f17ad0c f0b2020 38 ffff976c
    Called from f05c9ca, fp=ffff9638, args=f117ce8 ffff976c ffff9684 f0461da
    Called from f0461da, fp=ffff9684, args=0 ffff976c 0 2
    Called from f03b7e4, fp=ffff96b0, args=f1e11c4 ffff976c 1 2
    Called from f02de10, fp=ffff96d8, args=f0e3574 1 ffff976c effffdc
    Called from f02dd20, fp=ffff9780, args=ffff976c 1 ffff97b4 effe29f
    Called from f067cec, fp=ffff97a8, args=ffff9a18 effe250 222f0 0
    Called from f004768, fp=effe268, args=79 d8 ffffffdd 1
    End traceback...
    panic: Bus error
    syncing file systems... [32] 6 [32] [28] [21] [10] done

    dumping to vp f117c74, offset 50648

I kept copies of the core dumps if anyone can tell me how to gleem useful
information out of them.

Much thanks

Kevin Kelleher				{uunet,pyramid,altos}!xilinx!kevin
Xilinx Inc.				(408) 559-7778 x269
2069 Hamilton Ave
San Jose CA 95125

[[ Don't rule out a hardware problem.  This could, possibly, be caused by
bad memory (but that's just a guess).  --wnl ]]

chris@uunet.uu.net (Chris Brown) (03/16/89)

Reference: Kevin Kelleher's query in v7n173

>  diff:
>  trap address 0x8, pid 29905, pc = f027448, sr = 2004, stkfmt b, context 1

>  syslogd:
>  trap address 0x8, pid 96, pc = f031e4a, sr = 2004, stkfmt b, context 1

We saw similar messages whilst we were debugging a device driver.  The
problem was a bus error occurring during an interrupt service routine,
caused by a device interrupting when it wasn't expected to, so that when
it tried to use some pointers into what had been (once upon a time) a
user's data area, it found garbage addresses.

I think the program name (`diff' and `syslogd' in the report given) refers
to the process which was executing at the time of the error.  If the error
is really in an interrupt routine, the name of the current process is a
complete red herring. The fact that it's different in these two cases is
what makes me think your error is an interrupt problem. Have you installed
a new device or device driver recently?

Sorry this isn't very explicit, but it's the best I can do!  Hope it
helps.

Chris Brown, A.I. Vision Research Unit, Sheffield University 
(chris@aivru.sheffield.ac.uk)

guy@uunet.uu.net (Guy Harris) (03/31/89)

 >I think the program name (`diff' and `syslogd' in the report given) refers
 >to the process which was executing at the time of the error.

This is entirely correct.

 >If the error is really in an interrupt routine, the name of the current
 >process is a complete red herring.

Precisely.