condyles@talos.UUCP (Nick Condyles) (04/27/88)
VAX 8650 with ra81s, ULTRIX 2.0-1 Question ------------------------------ Has anyone running a VAX minimally meeting the above description repeatedly experienced the loss of a file system after a normal shutdown, a panic, or a machine check? More specifically, we have two 8650s that consistently lose the file system, /dev/ra2c, on an ra81 located on uda0. /dev/ra2c is preceded on uda0 by two ra60's, the first of which contains /, /usr, and /tmp, and the second is a spare spindle. The typical scenario is that a machine will be shut down purposely or unavoidably (a crash) and when the machine reboots fsck will indicate a problem and request that fsck be run manually on /dev/rra2c. Occasionally fsck will tell us that there is a bad magic number in the super block but more often will start normally with Phase 1. Soon there after a list of bad inodes will appear followed by a list of duplicate inodes and finally the message "Excessive Bad Blocks Continue?". The course of events that usually ensue is one of the following: 1. We will move through the alternate superblocks until a good one is found and all will be well. 2. We will not find a valid superblock and newfs followed by a full restore will ensue. 3. The "bad inode" and "continue" prompts will be so great that fsck -y must be run. When fsck -y is run we will find that one or more cg (cylinder groups) have bad magic numbers. Fsck -y will run from 45 minutes to several hours excising bad cylinder groups. At the end if there were any salvageable cylinder groups we will do an incremental back up of what was left and then proceed with newfs and a full restore, followed by application of the final incremental. The problem has occurred several times in a given day. Digital has not been very helpful in resolving the problem. We have had a call on it since January. The best advice so far has been to wait for all channel lights to extinguish before halting the processor, but even that has not been a very reliable prophylaxis. I am aware of at least one other installation experiencing this problem. I would like to hear anecdotes, insights, remedies, etc. from people who may have experienced the problem. One observation that we have made is that this never happened under ULTRIX 1.2. A difference between ULTRIX' behavior during shutdown under 1.2 and 2.0 is that under 1.2, there was a several minute delay between the syncing disks message and the "Processor can now be halted" message. Under 2.0 the "Processor can now be halted" message is almost instantaneous. Send any information you may have to me and I will post a summary. ---------------------- nick condyles mcnc!rti!talos!condyles