[comp.sys.apollo] Can anyone help with a salvol problem?

dvadura@watdragon.waterloo.edu (Dennis Vadura) (08/30/90)

Recently we had a bad electrical storm here and it nocked out one of our
DN3500's.  When rebooting it attempts to run salvol and fails.  I have
managed to boot the node over the network and to forcibly mount it's disk.
I have dumped the contents of it's disk (using wbak) to another node and
will invol and restore the files.  I expect this to remove the file system
errors reported by salvol (can anyone confirm this?).

Also, does anyone know what if anything can be done about the following output
from salvol (attached below).

-thanks
-dennis
----
$ /etc/salvol -c w0:0 -f


Salvol, revision 10.2, October 12, 1989  2:55:07 am


    Preparing file list...

    Salvaging...  % complete
        20
    (chk_hdr) page should be: 37 but is: 112
    (chk_hdr) page should be: 40 but is: 119
    (chk_hdr) page should be: 45 but is: 11C
    (chk_hdr) page should be: 4E but is: 122
    (chk_hdr) page should be: 53 but is: 126
    (chk_hdr) page should be: 54 but is: 127
    (chk_hdr) page should be: 56 but is: 128
    (chk_hdr) page should be: 59 but is: 12A
    (chk_hdr) page should be: 5A but is: 12B
    (chk_hdr) page should be: 5C but is: 12C
    (chk_hdr) page should be: 5D but is: 12D
    (chk_hdr) page should be: 5F but is: 12E
    (chk_hdr) page should be: 60 but is: 12F
    (chk_hdr) page should be: 62 but is: 130
    (chk_hdr) page should be: 64 but is: 131
    (chk_hdr) page should be: 65 but is: 132
    (chk_hdr) page should be: 67 but is: 133
    (chk_hdr) page should be: 68 but is: 134
    (chk_hdr) page should be: 6A but is: 135
    (chk_hdr) page should be: 6B but is: 136
    (chk_hdr) page should be: 6D but is: 137
    (chk_hdr) page should be: 6E but is: 138
    (chk_hdr) page should be: 70 but is: 13B
    (chk_hdr) page should be: 71 but is: 165
    (chk_hdr) page should be: 73 but is: 166
    (chk_hdr) page should be: 74 but is: 167
    (chk_hdr) page should be: 76 but is: 168
    (chk_hdr) page should be: 78 but is: 169
    (chk_hdr) page should be: 79 but is: 16A
    (chk_hdr) page should be: 7B but is: 16C
    (chk_hdr) page should be: 7C but is: 16D
    (chk_hdr) page should be: 7E but is: 16E
    (chk_hdr) page should be: 7F but is: 16F
    (chk_hdr) page should be: 81 but is: 170
    (chk_hdr) page should be: 81 but is: 170
    (chk_hdr) page should be: 84 but is: 149
    (chk_hdr) page should be: 85 but is: 14B
    (chk_hdr) page should be: 8A but is: 14F
    (chk_hdr) page should be: 8D but is: 151
    (chk_hdr) page should be: 8F but is: 152
    (chk_hdr) page should be: 92 but is: 154
    (chk_hdr) page should be: 93 but is: 155
    (chk_hdr) page should be: 96 but is: 180
    (chk_hdr) page should be: 98 but is: 182
    (chk_hdr) page should be: 9B but is: 185
    (chk_hdr) page should be: 9C but is: 187
    (chk_hdr) page should be: 9E but is: 1AF
    (chk_hdr) page should be: A0 but is: 1B0
    (chk_hdr) page should be: A1 but is: 1B1
    (chk_hdr) page should be: A3 but is: 18E
    (chk_hdr) page should be: A4 but is: 194
    (chk_hdr) page should be: A6 but is: 195
    (chk_hdr) page should be: A7 but is: 197
    (chk_hdr) page should be: A9 but is: 199
    (chk_hdr) page should be: AA but is: 1A0
    (chk_hdr) page should be: AC but is: 1A1
    (chk_hdr) page should be: AD but is: 1A2
    (chk_hdr) page should be: B0 but is: 1A8
    (chk_hdr) page should be: B2 but is: 1A9
    (chk_hdr) page should be: B5 but is: 1AE
    (chk_hdr) page should be: B7 but is: 1B2
    (chk_hdr) page should be: B8 but is: 1B5
    (chk_hdr) page should be: C1 but is: 1B6
    (chk_hdr) page should be: CA but is: 1BA
    (chk_hdr) page should be: D3 but is: 1B7
    (chk_hdr) page should be: DC but is: 1B8
    (chk_hdr) page should be: 120 but is: 1C5
    (chk_hdr) blk_type should be: 2 but is: 0
    (chk_hdr) page should be: 120 but is: 1C5
    (chk_hdr) blk_type should be: 2 but is: 0
        40
        60
        80


    Verifying reference counts...

115 multiply allocated blocks were found

Starting second Pass...
    Preparing file list...

    Looking for Mutiply Allocated Blocks
        20
Internal Error: hash table for bad daddrs is full, too many MAB's or header erro
rs.

RUN ABORTED
-- 
--------------------------------------------------------------------------------
"This is almost worth the HIGH blood pressure!" he  |Dennis Vadura
thought as yet another mosquito exploded.-R.Patching|dvadura@dragon.uwaterloo.ca
================================================================================

pha@CAEN.ENGIN.UMICH.EDU (Paul H. Anderson) (08/31/90)

	
	Recently we had a bad electrical storm here and it nocked out one of our
	DN3500's.  When rebooting it attempts to run salvol and fails.  I have
	managed to boot the node over the network and to forcibly mount it's disk.
	I have dumped the contents of it's disk (using wbak) to another node and
	will invol and restore the files.  I expect this to remove the file system
	errors reported by salvol (can anyone confirm this?).
	 
	Also, does anyone know what if anything can be done about the following output
	from salvol (attached below).
	 
	 
	Starting second Pass...
	    Preparing file list...
	 
	    Looking for Mutiply Allocated Blocks
	        20
	Internal Error: hash table for bad daddrs is full, too many MAB's or header erro
	rs.
	 
	RUN ABORTED
	-- 

This is fixed in 10.3 salvol.  Look for it on the august patch tape, as well.

We lost a minimum of 3 gigs of disk to this one.  It may be related to crashes
of salvol that have cost us upwards of 30 gigs of disk space.

If someone from Apollo could elaborate about this bug, and explain what causes
it, I would be very grateful.

Paul Anderson
CAEN
University of Michigan

ced@apollo.HP.COM (Carl Davidson) (08/31/90)

From article <1990Aug30.152435.23881@watdragon.waterloo.edu>, by dvadura@watdragon.waterloo.edu (Dennis Vadura):
> Recently we had a bad electrical storm here and it nocked out one of our
> DN3500's.  When rebooting it attempts to run salvol and fails.  I have
> managed to boot the node over the network and to forcibly mount it's disk.
> I have dumped the contents of it's disk (using wbak) to another node and
> will invol and restore the files.  I expect this to remove the file system
> errors reported by salvol (can anyone confirm this?).
> 

Dennis,

Presuming that the drive hardware itself was not damaged, and it doesn't appear
from the output of salvol that it was, invol should be able to return your 
drive to a usable state, although as you expect all data will be erased.

When you run invol, you should be sure to check the bad spot list that is on 
the media against the hard copy list that came with the drive. On DN3500s, 
the bad spot list for the drive is printed on top of the drive itself, and is
generally, although not always, only a few items -- maybe a dozen or so.

It appears that the electrical storm caused the drive to scribble random
garbage over the surface of the drive media. Salvol, when attempting to 
repair the disk, sees this scribbling as bad formatting (chk_hdr is, I assume, 
the internal salvol routine that validates the disk block header) and flags
the blocks to be replaced. Unfortunately, in the second pass salvol runs out
of space to store the bad blocks in and so can't complete its job. It's just as
well, really, because it's unlikely that the disk media is actually bad here
and you don't really want to throw all those bnlocks away unnecessarily.

>
> Also, does anyone know what if anything can be done about the following output
> from salvol (attached below).
> 
> -thanks
> -dennis
> ----
> $ /etc/salvol -c w0:0 -f
> 
> 
> Salvol, revision 10.2, October 12, 1989  2:55:07 am
> 
> 
>     Preparing file list...
> 
>     Salvaging...  % complete
>         20
>     (chk_hdr) page should be: 37 but is: 112
>
>      < many lines of similar errors deleted>
>
>         40
>         60
>         80
> 
> 
>     Verifying reference counts...
> 
> 115 multiply allocated blocks were found
> 
> Starting second Pass...
>     Preparing file list...
> 
>     Looking for Mutiply Allocated Blocks
>         20
> Internal Error: hash table for bad daddrs is full, too many MAB's or header erro
> rs.
> 
> RUN ABORTED
> -- 

There really isn't much you can do with this output except go on and invol 
the disk. Any repairs attempted by hand would be difficult at best. It's better
to let invol straighten things out itself.

Regards,
Carl

Carl Davidson  (508) 256-6600 x4361    | In the High and Far-Off Time, the
The Apollo Systems Divison of          | Elephant, Oh Best Beloved, had no
The Hewlett-Packard Company            | trunk.
DOMAIN: ced@apollo.HP.COM              |  -- Rudyard Kipling, Just So Stories

rees@pisa.ifs.umich.edu (Jim Rees) (08/31/90)

In article <4c83a20b.20b6d@apollo.HP.COM>, ced@apollo.HP.COM (Carl Davidson) writes:
  It appears that the electrical storm caused the drive to scribble random
  garbage over the surface of the drive media. Salvol, when attempting to 
  repair the disk, sees this scribbling as bad formatting (chk_hdr is, I assume, 
  the internal salvol routine that validates the disk block header) and flags
  the blocks to be replaced. Unfortunately, in the second pass salvol runs out
  of space to store the bad blocks in and so can't complete its job. It's just as
  well, really, because it's unlikely that the disk media is actually bad here
  and you don't really want to throw all those bnlocks away unnecessarily.

I don't think that's quite true.  Bad block headers don't imply bad media.
Salvol should be able to fix these, but sr10.2 salvol doesn't have enough
internal table space to remember them all for pass 2.  sr10.3 salvol should
fix this.

  There really isn't much you can do with this output except go on and invol 
  the disk. Any repairs attempted by hand would be difficult at best. It's better
  to let invol straighten things out itself.

Fixvol can fix bad block headers, but it's a bit tedious.

Each disk block contains three sets of data.  The first is the format data,
which is used by the drive to find the right sector.  If this is bad, you
need to reformat the sector (fixvol can do this on a track by track basis).
The second set of data is the block header.  This is used by Domain/OS to
store the uid and logical block number of the file the data goes with.  Most
normal operating systems don't have this.  I first came across it in the
Alto operating system, where it was used successfully to recover the disk if
the inodes (vtoc) got corrupted.  Domain/OS uses it somewhat less
successfully.  In fact, it seems to cause more trouble than it's worth.  I'm
not even sure salvol ever uses it to reconstruct a corrupted vtoc.  It also
makes it difficult to build a disk controller for Domain/OS, since the
blocks are non-standard size (not a power of two) and ideally should be
transferred into memory in two pieces.

ced@apollo.HP.COM (Carl Davidson) (09/03/90)

From article <1990Aug31.124334@pisa.ifs.umich.edu>, by rees@pisa.ifs.umich.edu (Jim Rees):
> In article <4c83a20b.20b6d@apollo.HP.COM>, ced@apollo.HP.COM (Carl Davidson) writes:
>   It appears that the electrical storm caused the drive to scribble random
>   garbage over the surface of the drive media... 
>   ... it's unlikely that the disk media is actually bad here
>   and you don't really want to throw all those blocks away unnecessarily.
> 
> I don't think that's quite true.  Bad block headers don't imply bad media.
> Salvol should be able to fix these, but sr10.2 salvol doesn't have enough
> internal table space to remember them all for pass 2.  sr10.3 salvol should
> fix this.
> 

Sorry. I didn't mean to imply that bad block headers always meant bad disk 
media. It is often true, however, that the first indication of bad media 
is a high rate of disk block header errors in the system error log. In any 
case, Jim, you are correct that the sr10.3 salvol is better at handling this
situation. The patch tape contains a fixed salvol for sr10.2 that is capable
of handling this kind of corruption. 

>
>   There really isn't much you can do with this output except go on and invol 
>   the disk. Any repairs attempted by hand would be difficult at best. It's better
>   to let invol straighten things out itself.
> 
> Fixvol can fix bad block headers, but it's a bit tedious.
> 
> Each disk block contains three sets of data.  The first is the format data,
> which is used by the drive to find the right sector.  If this is bad, you
> need to reformat the sector (fixvol can do this on a track by track basis).
> The second set of data is the block header.  This is used by Domain/OS to
> store the uid and logical block number of the file the data goes with.  Most
> normal operating systems don't have this.  I first came across it in the
> Alto operating system, where it was used successfully to recover the disk if
> the inodes (vtoc) got corrupted.  Domain/OS uses it somewhat less
> successfully.  In fact, it seems to cause more trouble than it's worth. 

I stand by my statement that for this user, given that he already has his files
saved to other media and he doesn't have the improved version of salvol
readily available and he probably needs the disk storage that the damaged
volume represents otherwise he never would have spent the money to purchase
it, the best course of action is to simply invol the damaged volume and 
restore his files to it. Waiting for a patched salvol to arrive would 
probably take much longer and using fixvol is not a good idea.

Characterizing the use of fixvol to repair bad block headers as 
"a bit tedious" is the understatement of the century. The author of fixvol
thought the program dangerous enough in the hands of the uninitiated to 
have it print the following warning:

   Warning: this program is intended only for the use of Apollo service
   representatives.  Misuse of this program may irreparably damage your
   disk.  To exit, type "q" to the following prompt.

I cannot think of a case where I would recommend that a customer attempt to
repair a disk using fixvol. As Jim points out above, the Domain/OS disk 
structures are different from those of most operating systems and require
that someone using fixvol be knowledgeable of the internals of the file 
system. Please use the standard utilities to handle disk problems and leave
fixvol and its kin to folks like Jim who used to work in Apollo R&D and have
more than a passing acquaintance with Domain/OS internals.

Regards,
Carl

Carl Davidson  (508) 256-6600 x4361    | In the High and Far-Off Time, the
The Apollo Systems Divison of          | Elephant, Oh Best Beloved, had no
The Hewlett-Packard Company            | trunk.
DOMAIN: ced@apollo.HP.COM              |  -- Rudyard Kipling, Just So Stories