[net.unix-wizards] VAX 785 won't fully boot

page@ulowell.UUCP (Bob Page) (11/01/86)

[This message is somewhat long in the hope that I can provide
 enough information to get a resolution --Bob]

Our VAX 11/785 (Ultrix 1.2) crashed the other day, with the message:

machine check 0: cp read timeout or error confirmation fault
	[with 14 more lines of data not reproduced here]
panic: mchk

the message was printed twice (some of the values were different
the second time) then a dump was done (successfully) and the machine
rebooted itself.  Some files were corrupted & corrected by fsck, until
it couldn't do any more.  An operator fan fsck by hand and fixed
the bad/dup inodes, rebooted and things came back up fine & dandy.

A while later (could have been days, we weren't spitting dates at
the console every n minutes) another message:

uba0: uba error sr=10(IVMR) fmer=71 fubar=772150	[ honest, fubar! --BP ]
uqssp0 being reset					[ uqssp0 ? -- BP ]
uda50a0: hard error, 100040 0 0 a0100 c9424ec5 1060081 b0005 0 0 0 80065d08 0 0 0 0

which appeared five times, each time exactly the same, followed by two
more 'machine check 0: cp read timeout or error confirmation fault'
messages as described above, followed by another dump.  (Note that
the 1's could be 'l's, the LA100 console prints them identically.
Also note that 772150 is the address of the uda).

Upon reboot, the root file system was partially trashed.  LOTS of
bad/dup inodes, incorrect block counts, etc.  We were able to bring
up a crippled system and were doing a level 0 dump of the root
to make sure we didn't lose anything else when - it crashed again
(same error messages as above).  After lots of prodding it would
come back up and then crash again.  This went on about six times.
Once, just after the device information, it went down again with:
	panic: sbi0flt
and another dump.  Other times it might mchk after printing the
device information.

Now it won't finish booting at all.  I can specify ra(0,0)vmunix
or our backup vmunix and it will boot them OK , but the machine
hangs without a message someplace between autoconfiguring the
devices (it gets them all OK) and printing the message "Automatic
reboot in progress..."  I never see the message.  I doubt it's
the /etc/rc file because it hangs even when I boot ANY at the
`>>>' prompt.  The CPU run light stays on.

It's possible a special file in /dev was trashed - we saw some
directories (like /usr) that had magically turned into normal files,
and also saw /dev/rmt0 turn into a normal file (which sure didn't
help us with the level 0 dump!  We flooded the root partition!)

So, I have a 785 that won't boot, and don't know why.  DEC FS has
been in and run every diagnostic, replaced boards, etc, all to
no avail.  The hardware is clean, according to DEC.

One suggestion DEC made is that Ultrix can't handle our 24-line
DMZ's DMA, and that's why it was hanging.  I don't buy that, but
if I get the system back up I'll turn off the DMA and see what
that gets me.  I can't shut off the DMA until I get the system
back up anyway.

----------------------------------

So, I don't know what to do, short of rebuilding the kernel and
restoring the root file system from the Ultrix distribution tape.
I'd rather not do that.  I'm about to restore a mini-root on
the swap partition and look at the ra0 partition, hopefully to
discover some critical file that must be restored/fixed before I
can reboot.

Even if I am successful in rebooting the system, I don't know
how to interpret all the mchk data (I don't have Ultrix source
and unfortunately can't put BSD on it for various reasons), so
I can't be sure why it's crashing or how to prevent it.  Surely,
'cp read timeout or error confirmation fault' errors look like
hardware problems.

Any help would be greatly welcomed, acknowledged and appreciated.

..Bob

PS The devices are all DEC:

VAX 11/785, serial no. 2079, hardware level =16
mcr0 (MS780-E) at address 0x20002000, 14Mbytes, internal interleave
uba0 at address 0x20006000
uda0 at uba0 csr 172150 vec 744, ipl 15
ra0 at uda0 slave 0				[ra81 - system disk]
tmscp0 at uba0 csr 174500 vec 770, ipl 15
tms0 at tmscp0 slave 0				[tu81]
lp0 at uba0 csr 177514 vac 200, ipl 14		[lp27]
uba1 at address 0x20008000
uda1 at uba1 csr 172150 vec 774, ipl 15
ra1 at uda1 slave 1				[ra60]
ra2 at uda1 slave 2				[ra60]
de1 at uba1 csr 174510 vec 120, ipl 15		[deuna]

[end of article]
-- 
UUCP: wanginst!ulowell!page	Bob Page, U of Lowell CS Dept
VOX:  +1 617 452 5000 x2976	Lowell MA 01854 USA

amos@instable.UUCP (Amos Shapir) (11/02/86)

We had something similar on a 785 with 4.3 and 2 eagles; the
mchk was in our case (different messages though) due to a memory controller
bug not caught by DEC's diagnostics (try moving memory boards around to
other controllers to prove it!). The mem bug might have caused a bad dma
transfer, and your uda just hit a bad block at a soft spot in the file system;
as it does its own bad block forwarding, you may have ended with a corrupted
file system.
WARNING: do not attempt full restore off a tape created on a corrupted
file system! use only for individual files, and even then *very* cautiously -
bad block appear as 'holes' in the dump, which the restore ignores, thus
getting wrong blocks into files.
I hope this may help (or at least encourage) you in your misery.
	Good Luck - you'll need it
-- 
	Amos Shapir

National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel
(01-972) 52-522261  amos%nsta@nsc
34.48'E 32.10'N

chris@umcp-cs.UUCP (Chris Torek) (11/03/86)

In article <707@ulowell.UUCP> page@ulowell.UUCP (Bob Page) writes:
>Our VAX 11/785 (Ultrix 1.2) crashed the other day, with the message:
>
>machine check 0: cp read timeout or error confirmation fault

A `cp read timeout' is an SBI fault (usually?).  Look at the value
printed for `sbifs' (the SBI Fault Signal register), and get a copy
of the `red book' (the Vax Hardware Handbook, not the Red Book of
Westmarch).

>...  Some files were corrupted & corrected by fsck, until
>it couldn't do any more.  An operator fan fsck by hand and fixed
>the bad/dup inodes, rebooted and things came back up fine & dandy.
>
>A while later (could have been days, we weren't spitting dates at
>the console every n minutes) another message:
>
>uba0: uba error sr=10(IVMR) fmer=71 fubar=772150	[ honest, fubar! --BP ]

FUBAR stands for Failed UniBus Address Register.  No doubt the
DEC engineer who ... `engineered?' ... that one is still chortling.
IVMR stands for InValid Map Register, which is either a bug in the
driver (probably not) or a hardware problem in the Unibus adapter
(probably).

>uqssp0 being reset					[ uqssp0 ? -- BP ]

Never heard of a uqssp.  Probably a typographic error in the kernel;
`uba0' makes more sense here.

>uda50a0: hard error, [numbers]

Uh oh.

>... followed by two more 'machine check 0: cp read timeout or error
>confirmation fault' messages....  Also note that 772150 is the
>address of the uda).

I sense a hardware problem in the adapter, similar to the one we
had recently.  It invariably escapes DEC diagnostics.

>[more troubles]
>
>Now it won't finish booting at all.

Most likely a scrobiculate root file system.  (That *is* a word,
even if I bent its meaning a bit.  Look it up!)

>DEC FS has been in and run every diagnostic, replaced boards, etc,
>all to no avail.  The hardware is clean, according to DEC.

Hah.

>One suggestion DEC made is that Ultrix can't handle our 24-line
>DMZ's DMA, and that's why it was hanging.

Probably not.  The DMZ does have a peculiar tendency to drop master
sync, confusing the adapter, and it is safest to keep DMZs away
from other DMA devices, at least until DEC comes out with revised
ROMs.  But the symptoms do not match.