[comp.unix.ultrix] Losing ULTRIX 2.0 File Systems

condyles@talos.UUCP (Nick Condyles) (04/27/88)
VAX 8650 with ra81s, ULTRIX 2.0-1 Question
------------------------------

Has anyone running a VAX minimally meeting the above
description repeatedly experienced the loss of a file system after a
normal shutdown, a panic, or a machine check?

More specifically, we have two 8650s that consistently lose
the file system, /dev/ra2c, on an ra81 located on uda0.  /dev/ra2c is
preceded on uda0 by two ra60's, the first of which contains
/, /usr, and /tmp, and the second is a spare spindle.

The typical scenario is that a machine will be shut down
purposely or unavoidably (a crash) and when the machine
reboots fsck will indicate a problem and request that fsck
be run manually on /dev/rra2c.  Occasionally fsck will
tell us that there is a bad magic number in the super block
but more often will start normally with Phase 1.  Soon there
after a list of bad inodes will appear followed by a list of
duplicate inodes and finally the message "Excessive Bad
Blocks   Continue?".

The course of events that usually ensue is one of the
following:

	1.  We will move through the alternate superblocks
	until a good one is found and all will be well.

	2.  We will not find a valid superblock and newfs
	followed by a full restore will ensue.

	3.  The "bad inode" and "continue" prompts will be so
	great that fsck -y must be run.  When fsck -y is run
	we will find that one or more cg (cylinder groups)
	have bad magic numbers.  Fsck -y will run from
	45 minutes to several hours excising bad cylinder
	groups.  At the end if there were any salvageable
	cylinder groups we will do an incremental back up 
	of what was left and then proceed with newfs and a
	full restore, followed by application of the final
	incremental.

The problem has occurred several times in a given day.
Digital has not been very helpful in resolving the problem.
We have had a call on it since January.  The best advice so
far has been to wait for all channel lights to extinguish
before halting the processor, but even that has not been a
very reliable prophylaxis.

I am aware of at least one other installation experiencing
this problem.  I would like to hear anecdotes, insights,
remedies, etc. from people who may have experienced the
problem.

One observation that we have made is that this never
happened under ULTRIX 1.2.  A difference between ULTRIX'
behavior during shutdown under 1.2 and 2.0 is that under
1.2, there was a several minute delay between the syncing disks
message and the "Processor can now be halted" message.  Under
2.0 the "Processor can now be halted" message is almost
instantaneous.

Send any information you may have to me and I will post a
summary.
----------------------
nick condyles
mcnc!rti!talos!condyles