[net.unix-wizards] Help with 4.3bsd kernel problem needed

news@pucc-j (Usenet news) (04/01/86)

We recently had occasion to investigate a recurring problem in our
kernel; we experienced crashes (about one per day) on one of our vaxen,
with only the message:

panic: smr 0x0666 m_err_in

Oddly, the cpu didn't halt or reboot.  There was still disk activity
and the machine's indicator lights showed the cpu was running, but the
only way we could regain control of the machine was through the lsi-11
halt command.

Using "adb -k", we began to plow through the various core dumps we had
lying about.  We were surprised to find a lot of symbols whose origin
could not be clearly traced back to the kernel sources; along
with the usual "fork()" and "kill()", things such as "horn()", and
"tail()" and "clovn()" started showing up.

[At this point, we were distracted by a bizarre hardware failure; one
of our 9766 disk threw the platters right out the side of the drive,
narrowly missing several programmers and slightly nicking the arm of a
DEC field circus representative.  Although he received immediate
medical attention, the wound became infected and gangrenous and may
require amputation.  We repaired the drive and formatted a new pack;
the problem hasn't recurred, but occasionally we notice a peculiar
sulphurous odor emanating from the drive.]

Anyhow, we ran "nm" over /vmunix, and got things like:

00001384 T __cleanup
00000a74 T __doprnt
00006660 T __exit
0000084c T __filbuf
000011e0 T __possess
00001950 D __iob
00001ae0 D __lastbuf
00001b44 B __siflame
00003b44 B __sobuf
0006667c T _atoi
0000067c T _destruct
00001834 D _baud
000016e0 T _daemon
00001b1c B _charct
00066674 T _close
00001838 D _twis_sis

Last night the machine crashed again, same symptoms.  An attempted reboot
from the lsi-11 failed, with no response to 'HALT'.  We advised our operators
to power it off and back on and reboot, but when the operator tried to turn
the key switch, he received a shock that threw him back against a disk drive.
It's been down since then and we're waiting on the DEC field circus to show up.

We are totally mystified; any help that you can lend us would be great.

rsk/jms/dls/gh3

jso@edison.UUCP (John Owens) (04/08/86)

> We recently had occasion to investigate a recurring problem in our
> kernel; we experienced crashes (about one per day) on one of our vaxen,
> with only the message:
> 
> panic: smr 0x0666 m_err_in
> 

Amazing.  The problem seems to have spread to your news system;
interesting that your article number in decimal has the same special
properties as the smr number in hex....

	-John Owens
	jso@edison.UUCP

jsdy@hadron.UUCP (Joseph S. D. Yao) (04/09/86)

In article <666@pucc-s> news@pucc-j (Usenet news) writes:
>Date: 1 Apr 86 16:23:35
>We recently had occasion to investigate a recurring problem in our
>kernel; we experienced crashes (about one per day) on one of our vaxen,
>with only the message:
>
>panic: smr 0x0666 m_err_in
>	...
>Using "adb -k", we began to plow through the various core dumps we had
>lying about.  We were surprised to find a lot of symbols whose origin
>could not be clearly traced back to the kernel sources; along
>with the usual "fork()" and "kill()", things such as "horn()", and
>"tail()" and "clovn()" started showing up.

[ et cute cetera ]

Obviously, you're not running the correct diagnostics.  I have a tape
labelled EXOR-11: I believe it is a system exorciser.  Perhaps DEC has
one for the VAX as well?  For this procedure, of course, you'll need
a consecrated Host.  This should leave you with all virtuous memory,
and no trace of zombie processes.

"Computrem benedico in nomine Patris et Filii at Spiritus Sancti."