[comp.sys.apollo] Bad blocks on disk: do not live with them

szabo_p@maths.su.oz.au (Paul Szabo) (08/13/90)
This article describes how to get rid of bad blocks on disks. Bad blocks
will naturally develop during the useful life of the disk. There is no
cause for alarm as long as the total number or the rate of growth of bad
blocks is not excessive.

Once these bad blocks develop, they should be avoided (i.e. should not
be used). While the problems are intermittent or recoverable, you may be
inclined to put up with the problem. But bad blocks usually deteriorate,
and may cause your node to crash. (Our DN10000 developed a bad block in
a directory, and any access to this directory sometimes caused it to
crash.) Simply, you need to add the block numbers to the bad spot list
using INVOL.

If you are happy to wipe the disk and start from scratch, everything is
easy. Run EX DEX, RUN WIN (no defaults, all disk: start 0, end last
address, write enabled) and this will tell you about every single bad
block. Add these to the bad spot list using INVOL, re-format the disk,
and install the OS. There is no need to go to this extreme, however.

Get a listing of problem blocks using /systest/ssr_util/lsyserr. You
should use this periodically to monitor the behaviour of the disk. Look
for repeated problems with disk blocks; you may want to skip the
once-only problems. Use the physical disk addresses. (In case of striped
disks, ignore the RELATIVE addresses. Run the output of lsyserr through
"grep 'Phys daddr =' | sort | uniq -c".) You could also run EX DEX, RUN
WIN -ENTIRE. This will read all your disk (without re-formatting or
writing it).

You may simply tell INVOL about the bad block addresses, and then run
SALVOL to fix up the disk. This seems to work reasonably well, but then
... do you trust them (or any other Apollo utility :-) to work properly?
(Note that SALVOL occasionally uses addresses relative to a logical
volume, these are one smaller than the physical addresses. Then again,
the discrepancy is sometimes not one but two... this may be related to
a physical volume PV label on each of our striped disks.)

To give you confidence in what you are doing, you would like to know
what files are at those disk addresses.

You may use /systest/ssr_util/rwvol (select READ, enter DADDR, then just
[RETURN] for start and end) to display UIDs of objects, then
/systest/ssr_util/upath to display pathnames.

Probably it is easier to use /systest/ssr_util/fixvol (this has online
help, type help). Use the read command to display UIDs/pathnames:
(fv [p])> r 12345
   uid:       478771C7.3001A581 /y/sfw/reduce3.3/fasl/int.b
   page:      9
   dtm:       478774A5   Wednesday, December 20, 1989   11:40:12 am (EST)
   blk_type:  0 
   sys_type:  0 (file_$file_type)
   pad:       00000000 00000000
   checksum:  0000
   daddr:      12345 ( 163- 1- 0)  disk# 1

Now that you know the pathname, you may wish to move it somewhere 'out
of the way' and copy it back to its proper place
/bin/mv file /lost+found
/bin/cp -pPiov /lost+found/file dir
This may not be necessary, but it is cheap insurance.

It seems to me that you cannot do much about vtoce blocks:
(fv [p])> r 1234
   uid:       202.00000000 vtoc_$uid
   page:      1232
   dtm:       4AF72F18   Wednesday, June 13, 1990   9:53:49 am (EST)
   blk_type:  0 
   sys_type:  0 (file_$file_type)
   pad:       00000000 00000000
   checksum:  0000
   daddr:       1234 (  16- 2- C)  disk# 0

You are now ready to tell INVOL about the bad blocks.

Run SALVOL to fix the disk. SALVOL will find 'multiply allocated blocks'
(since they are also in the bad block list), and then go into 'second
pass' looking for these multiply allocated blocks. SALVOL will report to
fix some objects with the correct names, but for others it will report
to repair objects at 'vtocx = something' (when the block is not at the
beginning of the file?). It will attempt to copy the bad block somewhere
else, and usually it will succeed.

There is one problem with SALVOL. If the bad block is in a directory,
SALVOL will orphan the files catalogued there; but as it succeeds in
copying the bad block, the files will still be catalogued in the
original directory. When you boot the node, find_orphans will catalogue
these files in /lost+found, but the reference count (number of hard
links) will be wrong (one instead of two). If you remove the file
pointed to by /lost+found, then when listing the original directory you
get the message 'object not found'. Admittedly, SALVOL at the end of its
run said '... errors ... require that Salvol be run again ...' which I
did, but that did not seem to do anything. Maybe it needed find_orphans
between the two runs. Anyway, I made another copy of the files...


Paul Szabo       szabo_p@maths.su.oz