page@ulowell.UUCP (Bob Page) (11/01/86)
[This message is somewhat long in the hope that I can provide enough information to get a resolution --Bob] Our VAX 11/785 (Ultrix 1.2) crashed the other day, with the message: machine check 0: cp read timeout or error confirmation fault [with 14 more lines of data not reproduced here] panic: mchk the message was printed twice (some of the values were different the second time) then a dump was done (successfully) and the machine rebooted itself. Some files were corrupted & corrected by fsck, until it couldn't do any more. An operator fan fsck by hand and fixed the bad/dup inodes, rebooted and things came back up fine & dandy. A while later (could have been days, we weren't spitting dates at the console every n minutes) another message: uba0: uba error sr=10(IVMR) fmer=71 fubar=772150 [ honest, fubar! --BP ] uqssp0 being reset [ uqssp0 ? -- BP ] uda50a0: hard error, 100040 0 0 a0100 c9424ec5 1060081 b0005 0 0 0 80065d08 0 0 0 0 which appeared five times, each time exactly the same, followed by two more 'machine check 0: cp read timeout or error confirmation fault' messages as described above, followed by another dump. (Note that the 1's could be 'l's, the LA100 console prints them identically. Also note that 772150 is the address of the uda). Upon reboot, the root file system was partially trashed. LOTS of bad/dup inodes, incorrect block counts, etc. We were able to bring up a crippled system and were doing a level 0 dump of the root to make sure we didn't lose anything else when - it crashed again (same error messages as above). After lots of prodding it would come back up and then crash again. This went on about six times. Once, just after the device information, it went down again with: panic: sbi0flt and another dump. Other times it might mchk after printing the device information. Now it won't finish booting at all. I can specify ra(0,0)vmunix or our backup vmunix and it will boot them OK , but the machine hangs without a message someplace between autoconfiguring the devices (it gets them all OK) and printing the message "Automatic reboot in progress..." I never see the message. I doubt it's the /etc/rc file because it hangs even when I boot ANY at the `>>>' prompt. The CPU run light stays on. It's possible a special file in /dev was trashed - we saw some directories (like /usr) that had magically turned into normal files, and also saw /dev/rmt0 turn into a normal file (which sure didn't help us with the level 0 dump! We flooded the root partition!) So, I have a 785 that won't boot, and don't know why. DEC FS has been in and run every diagnostic, replaced boards, etc, all to no avail. The hardware is clean, according to DEC. One suggestion DEC made is that Ultrix can't handle our 24-line DMZ's DMA, and that's why it was hanging. I don't buy that, but if I get the system back up I'll turn off the DMA and see what that gets me. I can't shut off the DMA until I get the system back up anyway. ---------------------------------- So, I don't know what to do, short of rebuilding the kernel and restoring the root file system from the Ultrix distribution tape. I'd rather not do that. I'm about to restore a mini-root on the swap partition and look at the ra0 partition, hopefully to discover some critical file that must be restored/fixed before I can reboot. Even if I am successful in rebooting the system, I don't know how to interpret all the mchk data (I don't have Ultrix source and unfortunately can't put BSD on it for various reasons), so I can't be sure why it's crashing or how to prevent it. Surely, 'cp read timeout or error confirmation fault' errors look like hardware problems. Any help would be greatly welcomed, acknowledged and appreciated. ..Bob PS The devices are all DEC: VAX 11/785, serial no. 2079, hardware level =16 mcr0 (MS780-E) at address 0x20002000, 14Mbytes, internal interleave uba0 at address 0x20006000 uda0 at uba0 csr 172150 vec 744, ipl 15 ra0 at uda0 slave 0 [ra81 - system disk] tmscp0 at uba0 csr 174500 vec 770, ipl 15 tms0 at tmscp0 slave 0 [tu81] lp0 at uba0 csr 177514 vac 200, ipl 14 [lp27] uba1 at address 0x20008000 uda1 at uba1 csr 172150 vec 774, ipl 15 ra1 at uda1 slave 1 [ra60] ra2 at uda1 slave 2 [ra60] de1 at uba1 csr 174510 vec 120, ipl 15 [deuna] [end of article] -- UUCP: wanginst!ulowell!page Bob Page, U of Lowell CS Dept VOX: +1 617 452 5000 x2976 Lowell MA 01854 USA
amos@instable.UUCP (Amos Shapir) (11/02/86)
We had something similar on a 785 with 4.3 and 2 eagles; the mchk was in our case (different messages though) due to a memory controller bug not caught by DEC's diagnostics (try moving memory boards around to other controllers to prove it!). The mem bug might have caused a bad dma transfer, and your uda just hit a bad block at a soft spot in the file system; as it does its own bad block forwarding, you may have ended with a corrupted file system. WARNING: do not attempt full restore off a tape created on a corrupted file system! use only for individual files, and even then *very* cautiously - bad block appear as 'holes' in the dump, which the restore ignores, thus getting wrong blocks into files. I hope this may help (or at least encourage) you in your misery. Good Luck - you'll need it -- Amos Shapir National Semiconductor (Israel) 6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel (01-972) 52-522261 amos%nsta@nsc 34.48'E 32.10'N
chris@umcp-cs.UUCP (Chris Torek) (11/03/86)
In article <707@ulowell.UUCP> page@ulowell.UUCP (Bob Page) writes: >Our VAX 11/785 (Ultrix 1.2) crashed the other day, with the message: > >machine check 0: cp read timeout or error confirmation fault A `cp read timeout' is an SBI fault (usually?). Look at the value printed for `sbifs' (the SBI Fault Signal register), and get a copy of the `red book' (the Vax Hardware Handbook, not the Red Book of Westmarch). >... Some files were corrupted & corrected by fsck, until >it couldn't do any more. An operator fan fsck by hand and fixed >the bad/dup inodes, rebooted and things came back up fine & dandy. > >A while later (could have been days, we weren't spitting dates at >the console every n minutes) another message: > >uba0: uba error sr=10(IVMR) fmer=71 fubar=772150 [ honest, fubar! --BP ] FUBAR stands for Failed UniBus Address Register. No doubt the DEC engineer who ... `engineered?' ... that one is still chortling. IVMR stands for InValid Map Register, which is either a bug in the driver (probably not) or a hardware problem in the Unibus adapter (probably). >uqssp0 being reset [ uqssp0 ? -- BP ] Never heard of a uqssp. Probably a typographic error in the kernel; `uba0' makes more sense here. >uda50a0: hard error, [numbers] Uh oh. >... followed by two more 'machine check 0: cp read timeout or error >confirmation fault' messages.... Also note that 772150 is the >address of the uda). I sense a hardware problem in the adapter, similar to the one we had recently. It invariably escapes DEC diagnostics. >[more troubles] > >Now it won't finish booting at all. Most likely a scrobiculate root file system. (That *is* a word, even if I bent its meaning a bit. Look it up!) >DEC FS has been in and run every diagnostic, replaced boards, etc, >all to no avail. The hardware is clean, according to DEC. Hah. >One suggestion DEC made is that Ultrix can't handle our 24-line >DMZ's DMA, and that's why it was hanging. Probably not. The DMZ does have a peculiar tendency to drop master sync, confusing the adapter, and it is safest to keep DMZs away from other DMA devices, at least until DEC comes out with revised ROMs. But the symptoms do not match.