[comp.os.vms] Repairing disks with corrupted index files

liu@trlsasb.oz (01/06/88)

 We have an eagle disk with a SI 9900 controller which has had it's index file
corrupted 4 times in the last 4 months. Has anybody else experienced these
problems. Is it a software or hardware problem.?
 The error message I get when I try and mount the disk is
BITMAPERR- I/O error on storage bitmap; volume locked
 Now the system error message manual implies you can repair this by
using the verify utility but the catch 22 is that you have to mount
the volume files-11 ... and the mount won't let you do this.
 Is there a way of repairing a disk with a bad index file ?
D
D

carl@CITHEX.CALTECH.EDU (Carl J Lydick) (01/09/88)

 >  We have an eagle disk with a SI 9900 controller which has had it's index file
 > corrupted 4 times in the last 4 months. Has anybody else experienced these
 > problems. Is it a software or hardware problem.?

We've had similar problems.  In both cases, it appears to have been a hardware
problem.   On  one  system, we had a crash resulting from a machine check, and
when the system came up again, the  disk  was  corrupted.   This  is  not  the
problem you're having, presumably.

 >  The error message I get when I try and mount the disk is
 > BITMAPERR- I/O error on storage bitmap; volume locked

On the other system, one of the servo boards for the Eagle in question failed.
The result was that we could read the home block on the volume (you can
check this by doing a MOUNT/FOREIGN: if it works, you can read the home
block), but we couldn't read the BITMAP or the bad block track (you obviously
can't read the BITMAP; to see if you can read the bad block track, mount
the disk/foreign, and try doing a physical backup).

 >  Now the system error message manual implies you can repair this by
 > using the verify utility but the catch 22 is that you have to mount
 > the volume files-11 ... and the mount won't let you do this.
 >  Is there a way of repairing a disk with a bad index file ?

This depends on just how bad the I/O error is.  If you just have trouble  with
the  bitmap,  mount will succeed, but with the volume locked against extending
any files.  You can then  run  verify  and  fix  the  problem  (unless  you've
actually got a bad block in the bitmap; in that case, do an image backup, then
run EXOR on the disk to reformat it [you should do this  if  the  disk  hasn't
been  formatted  in  the  last  few  years;  if you can't remember when it was
formatted, it's due for reformatting] and find the bad  blocks  on  the  disk,
then restore the saveset).

On the other hand, if you can't read anything except the track with  the  home
block  (and it sounds like this is your problem), you should have the hardware
problem repaired (as I said above, this is probably a problem with the servo),
and then run verify on the disk.

ESMP09@SLACTWGM.BITNET (Ed Miller SLAC x3291 or [415]854-1055) (01/13/88)

>  We have an eagle disk with a SI 9900 controller which has had it's index file
> corrupted 4 times in the last 4 months. Has anybody else experienced these
> problems. Is it a software or hardware problem.?
>  The error message I get when I try and mount the disk is
> BITMAPERR- I/O error on storage bitmap; volume locked
>  Now the system error message manual implies you can repair this by
> using the verify utility but the catch 22 is that you have to mount
> the volume files-11 ... and the mount won't let you do this.
>  Is there a way of repairing a disk with a bad index file ?

We had a similar problem several months ago with disks on an
SI9900 controller.  The problem happened twice within a few
days (once on a 9751 disk, once on a 9798 disk), but has not
recurred since.  There was no evidence that it was software related
(we'd been running VMS 4.5 for months before, and still are running
it).  There was no obvious connection to hardware changes, but there may
have been some upgrades of CPU (not disk) interfaces in the controller
since the two occurences.

Our problem took the following form:  when we tried to MOUNT a
disk, MOUNT complained that it was the member of a shadow set, and
proceeded to mount it, but with a software writelock.  (We don't
use shadow volumes, so that was the first puzzle.)  It turns out
that the indication that a disk is a member of a shadow set is
stored in the first block of BITMAP.SYS--when we dumped that file
it was obvious that it had been overwritten with irrelevant data--
not only the first block, but the first few blocks.

For our situation, the fix was easy:

        MOUNT/OVERRIDE=SHADOW
        ANAL/DISK/REPAIR

(There was a lot of repair to be done, since the first few blocks of
BITMAP.SYS needed to be reconstructed, but there were no damage that
could not be repaired.)

If your problem is similar, you might be able to make the same
kind of fix with

        MOUNT/OVERRIDE=LOCK
        ANAL/DISK/REPAIR

                                Ed Miller
                                ESMP09@SLACTWGM.BITNET
                                Stanford Linear Accelerator Center

rde@eagle.ukc.ac.uk (R.D.Eager) (01/15/88)

I believe 4.6 has a fix to allow volumes with trashed bitmaps to be mounted.
-- 
           Bob Eager
           rde@ukc.UUCP
           ...!mcvax!ukc!rde
           Phone: +44 227 764000 ext 7589

scott@stl.stc.co.uk (Mike Scott) (01/20/88)

In article <880109071621.025@CitHex.Caltech.Edu> carl@CITHEX.CALTECH.EDU (Carl J Lydick) writes:
>
> >  We have an eagle disk with a SI 9900 controller which has had it's index file
> > corrupted 4 times in the last 4 months. Has anybody else experienced these
> > problems. Is it a software or hardware problem.?
>
>We've had similar problems.  In both cases, it appears to have been a hardware

We've also had some nasty problems with a supereagle and QD32 controller (on a
uVAX-II/VMS4.5). We were getting corrupted data without any warning apart from
the disk write-locking itself. After reformatting the disk and restoring from a
backup tape which had the corrupted disk data on it, there were a number of
files apparently entered in two directories, one correctly, one wrongly.  The
symptoms were consistent with the bad block replacement algorithm failing by
revectoring a supposed bad block in the index file, then forgetting it had done
this.  It makes me suspicious of the very idea of automatic bad block
replacement, if this sort of thing happens with no warning: at least the old
badblk.sys was pretty foolproof. 

The killer is, I don't even think it was a media problem: I suspect a head
amplifier - the reformatting program carefully prints out all the replaced
block numbers, and hides the fact that they are all on the same disk head! 

-- 
Regards. Mike Scott (scott@stl.stc.co.uk <or> ...uunet!mcvax!ukc!stl!scott)
phone +44-279-29531 xtn 3133.

ted@blia.BLI.COM (Ted Marshall) (01/23/88)

In article <613@acer.stl.stc.co.uk>, scott@stl.stc.co.uk (Mike Scott) writes:
> We've also had some nasty problems with a supereagle and QD32 controller (on a
> uVAX-II/VMS4.5). We were getting corrupted data without any warning apart from
> the disk write-locking itself. After reformatting the disk and restoring from a
> backup tape which had the corrupted disk data on it, there were a number of
> files apparently entered in two directories, one correctly, one wrongly.  The
> symptoms were consistent with the bad block replacement algorithm failing by
> revectoring a supposed bad block in the index file, then forgetting it had done
> this.  It makes me suspicious of the very idea of automatic bad block
> replacement, if this sort of thing happens with no warning: at least the old
> badblk.sys was pretty foolproof. 

I had a similar problem on a DEC RA-80 on a massbus controller on a 750.
I found that reading certain blocks yielded garbage with no warning except
that maybe 1 in 30 reads of that block would yield the correct data! Again,
all of this occured with no indication of errors from the driver!

One point on your seeing files in two directories. The backup of this disk
was made on a semi-live system (i.e. I was on, doing work that created files).
When that was restored to the new disk and the system brought up, I noticed
that while all of the directory entries for those files existed, several of
the files themselves didn't. In addition, some of the other directory entries
where linked to files that other people had created since the restore! It
appears that although all of the directory entries were caught in the backup,
some of the INDEXF.SYS entries weren't! Then since the directory entries
specified FIDs with last sequence number + 1, these new files also got the
same FID.

The bottom line is that the double-entry files you saw may not have had
anything to do with failures of bad-block replacement.

-- 
Ted Marshall       ...!ucbvax!mtxinu!blia!ted <or> mtxinu!blia!ted@Berkeley.EDU
Britton Lee, Inc., 14600 Winchester Blvd, Los Gatos, Ca 95030     (408)378-7000
The opinions expressed above are those of the poster and not his employer.

scott@stl.stc.co.uk (Mike Scott) (02/02/88)

In article <3968@blia.BLI.COM> ted@blia.BLI.COM (Ted Marshall) writes:
>In article <613@acer.stl.stc.co.uk>, scott@stl.stc.co.uk (Mike Scott) writes:
>> We've also had some nasty problems with a supereagle and QD32 controller (on a
.......
>> files apparently entered in two directories, one correctly, one wrongly.  The
>> symptoms were consistent with the bad block replacement algorithm failing by
.......

>One point on your seeing files in two directories. The backup of this disk
>was made on a semi-live system (i.e. I was on, doing work that created files).
>When that was restored to the new disk and the system brought up, I noticed
>that while all of the directory entries for those files existed, several of
>the files themselves didn't. In addition, some of the other directory entries
>where linked to files that other people had created since the restore! It
.......

>The bottom line is that the double-entry files you saw may not have had
>anything to do with failures of bad-block replacement.


I'm afraid I was rather misleading in my article. Certainly, the
restored disk had the problems I noted.  But I know that at least one
of the files was afftected before I did the backup/reformat/restore.
It was one of mine, and was why I realised we had a major problem! It
was only after the restore that I carried out a post-mortem. 

I haven't noticed any problems doing backups on live systems, but we
very rarely need to do a restore, so wouldn't notice probably :-(

-- 
Regards. Mike Scott (scott@stl.stc.co.uk <or> ...uunet!mcvax!ukc!stl!scott)
phone +44-279-29531 xtn 3133.