news@pucc-j (Usenet news) (04/01/86)
We recently had occasion to investigate a recurring problem in our kernel; we experienced crashes (about one per day) on one of our vaxen, with only the message: panic: smr 0x0666 m_err_in Oddly, the cpu didn't halt or reboot. There was still disk activity and the machine's indicator lights showed the cpu was running, but the only way we could regain control of the machine was through the lsi-11 halt command. Using "adb -k", we began to plow through the various core dumps we had lying about. We were surprised to find a lot of symbols whose origin could not be clearly traced back to the kernel sources; along with the usual "fork()" and "kill()", things such as "horn()", and "tail()" and "clovn()" started showing up. [At this point, we were distracted by a bizarre hardware failure; one of our 9766 disk threw the platters right out the side of the drive, narrowly missing several programmers and slightly nicking the arm of a DEC field circus representative. Although he received immediate medical attention, the wound became infected and gangrenous and may require amputation. We repaired the drive and formatted a new pack; the problem hasn't recurred, but occasionally we notice a peculiar sulphurous odor emanating from the drive.] Anyhow, we ran "nm" over /vmunix, and got things like: 00001384 T __cleanup 00000a74 T __doprnt 00006660 T __exit 0000084c T __filbuf 000011e0 T __possess 00001950 D __iob 00001ae0 D __lastbuf 00001b44 B __siflame 00003b44 B __sobuf 0006667c T _atoi 0000067c T _destruct 00001834 D _baud 000016e0 T _daemon 00001b1c B _charct 00066674 T _close 00001838 D _twis_sis Last night the machine crashed again, same symptoms. An attempted reboot from the lsi-11 failed, with no response to 'HALT'. We advised our operators to power it off and back on and reboot, but when the operator tried to turn the key switch, he received a shock that threw him back against a disk drive. It's been down since then and we're waiting on the DEC field circus to show up. We are totally mystified; any help that you can lend us would be great. rsk/jms/dls/gh3
jso@edison.UUCP (John Owens) (04/08/86)
> We recently had occasion to investigate a recurring problem in our > kernel; we experienced crashes (about one per day) on one of our vaxen, > with only the message: > > panic: smr 0x0666 m_err_in > Amazing. The problem seems to have spread to your news system; interesting that your article number in decimal has the same special properties as the smr number in hex.... -John Owens jso@edison.UUCP
jsdy@hadron.UUCP (Joseph S. D. Yao) (04/09/86)
In article <666@pucc-s> news@pucc-j (Usenet news) writes: >Date: 1 Apr 86 16:23:35 >We recently had occasion to investigate a recurring problem in our >kernel; we experienced crashes (about one per day) on one of our vaxen, >with only the message: > >panic: smr 0x0666 m_err_in > ... >Using "adb -k", we began to plow through the various core dumps we had >lying about. We were surprised to find a lot of symbols whose origin >could not be clearly traced back to the kernel sources; along >with the usual "fork()" and "kill()", things such as "horn()", and >"tail()" and "clovn()" started showing up. [ et cute cetera ] Obviously, you're not running the correct diagnostics. I have a tape labelled EXOR-11: I believe it is a system exorciser. Perhaps DEC has one for the VAX as well? For this procedure, of course, you'll need a consecrated Host. This should leave you with all virtuous memory, and no trace of zombie processes. "Computrem benedico in nomine Patris et Filii at Spiritus Sancti."