szabo_p@maths.su.oz.au (Paul Szabo) (08/13/90)
This article describes how to get rid of bad blocks on disks. Bad blocks will naturally develop during the useful life of the disk. There is no cause for alarm as long as the total number or the rate of growth of bad blocks is not excessive. Once these bad blocks develop, they should be avoided (i.e. should not be used). While the problems are intermittent or recoverable, you may be inclined to put up with the problem. But bad blocks usually deteriorate, and may cause your node to crash. (Our DN10000 developed a bad block in a directory, and any access to this directory sometimes caused it to crash.) Simply, you need to add the block numbers to the bad spot list using INVOL. If you are happy to wipe the disk and start from scratch, everything is easy. Run EX DEX, RUN WIN (no defaults, all disk: start 0, end last address, write enabled) and this will tell you about every single bad block. Add these to the bad spot list using INVOL, re-format the disk, and install the OS. There is no need to go to this extreme, however. Get a listing of problem blocks using /systest/ssr_util/lsyserr. You should use this periodically to monitor the behaviour of the disk. Look for repeated problems with disk blocks; you may want to skip the once-only problems. Use the physical disk addresses. (In case of striped disks, ignore the RELATIVE addresses. Run the output of lsyserr through "grep 'Phys daddr =' | sort | uniq -c".) You could also run EX DEX, RUN WIN -ENTIRE. This will read all your disk (without re-formatting or writing it). You may simply tell INVOL about the bad block addresses, and then run SALVOL to fix up the disk. This seems to work reasonably well, but then ... do you trust them (or any other Apollo utility :-) to work properly? (Note that SALVOL occasionally uses addresses relative to a logical volume, these are one smaller than the physical addresses. Then again, the discrepancy is sometimes not one but two... this may be related to a physical volume PV label on each of our striped disks.) To give you confidence in what you are doing, you would like to know what files are at those disk addresses. You may use /systest/ssr_util/rwvol (select READ, enter DADDR, then just [RETURN] for start and end) to display UIDs of objects, then /systest/ssr_util/upath to display pathnames. Probably it is easier to use /systest/ssr_util/fixvol (this has online help, type help). Use the read command to display UIDs/pathnames: (fv [p])> r 12345 uid: 478771C7.3001A581 /y/sfw/reduce3.3/fasl/int.b page: 9 dtm: 478774A5 Wednesday, December 20, 1989 11:40:12 am (EST) blk_type: 0 sys_type: 0 (file_$file_type) pad: 00000000 00000000 checksum: 0000 daddr: 12345 ( 163- 1- 0) disk# 1 Now that you know the pathname, you may wish to move it somewhere 'out of the way' and copy it back to its proper place /bin/mv file /lost+found /bin/cp -pPiov /lost+found/file dir This may not be necessary, but it is cheap insurance. It seems to me that you cannot do much about vtoce blocks: (fv [p])> r 1234 uid: 202.00000000 vtoc_$uid page: 1232 dtm: 4AF72F18 Wednesday, June 13, 1990 9:53:49 am (EST) blk_type: 0 sys_type: 0 (file_$file_type) pad: 00000000 00000000 checksum: 0000 daddr: 1234 ( 16- 2- C) disk# 0 You are now ready to tell INVOL about the bad blocks. Run SALVOL to fix the disk. SALVOL will find 'multiply allocated blocks' (since they are also in the bad block list), and then go into 'second pass' looking for these multiply allocated blocks. SALVOL will report to fix some objects with the correct names, but for others it will report to repair objects at 'vtocx = something' (when the block is not at the beginning of the file?). It will attempt to copy the bad block somewhere else, and usually it will succeed. There is one problem with SALVOL. If the bad block is in a directory, SALVOL will orphan the files catalogued there; but as it succeeds in copying the bad block, the files will still be catalogued in the original directory. When you boot the node, find_orphans will catalogue these files in /lost+found, but the reference count (number of hard links) will be wrong (one instead of two). If you remove the file pointed to by /lost+found, then when listing the original directory you get the message 'object not found'. Admittedly, SALVOL at the end of its run said '... errors ... require that Salvol be run again ...' which I did, but that did not seem to do anything. Maybe it needed find_orphans between the two runs. Anyway, I made another copy of the files... Paul Szabo szabo_p@maths.su.oz