[net.unix-wizards] Revectoring bad blocks on RA81 disks

grandi@noao.UUCP (Steve Grandi) (06/29/85)

Rumor hath it that a program is available through DEC field service to
revector bad blocks on UDA disk drives (RA81s in particular).  Details
of the rumor are that the program is a "standalone" program written by
the Ultrix folks called /rabads that can be booted instead of vmunix
and that non-Ultrix sites running 4.2BSD can obtain the program
through their field service reps. 

Has any non-Ultrix site obtained this program?  Is there a part number
or any identifying information that our Friendly Field Service Man can use 
to pry it out of DEC's bureaucracy?

We are already running the Riacs UDA driver on our 750's that will try
to revector blocks that generate hard errors; our problem are blocks
that generate lots and lots of soft errors.  Since soft errors tend to
turn into hard errors and since these are rather important blocks (see
below) and revectoring blocks with hard errors often generates data
which is not guaranteed to be correct (a "forced error" in MSCP-speak),
I would dearly love to revector these marginal blocks now and avoid the 
massive pain that a trashed file system can bring. (Once burned, twice
shy; 5 times in the last year burned, 10**6 times shy!!).  Also, the
system REALLY slows down when the disk driver is printing error
messages on the console.

Obviously, we could probably hack the Riacs driver to give us a
utility to revector disk blocks, but another rumor hath it that the
procedure used in the driver is not REALLY correct (since DEC is
incredibly reluctant to reveal details of the very complicated song
and dance that has to be gone through to accomplish this feat, I'm
not surprised).  Also it would be nice to have a tool that our
Friendly Field Service Rep believed in as opposed to the incredulous
looks I get when I explain the history of our disk driver.

Two details of our problems might be of interest to students of MSCP soft 
error datagrams or of the 4.2BSD file system.  The "drive detected error" we
are getting is code 1A39 (that's the contents of word 27 of the SDI
error variant of the MSCP packet) which indicates a "servo fine
position error" generated when "a write command is attempted while the
positioner is off track (not detented)".  The servo boards and the R/W
boards in the drives showing these errors have all been replaced, so
the HDA is obviously showing marginal behavior at these locations.

The disk blocks showing errors are also interesting.  For several file 
systems on several disks on several 750's, relative block numbers 576 
and 577 are repeatedly showing up with fine-positioning errors (and 
these cases constitute about 75% of our total collection of these errors).  
A morning's study of the output from dumpfs(8) and fs.h indicates that for 
our 8K/2K and 8K/1K file systems, blocks 576-7 contain the csum structure,
which contains a summary of information about all the cylinder groups
(number of directories, number of free blocks, number of free inodes,
number of free frags).  Obviously, since our disks are
figuratively digging holes in the oxide at these blocks, this
structure is used a lot, presumably everytime a file is created (and
extended?).  Is this structure a single point of failure?  If block
576 is destroyed, is the file-system totally trashed or just incapable
of creating new files?  (in other words, can I dump(8) the
file-system?).  Can fsck completely regenerate the data in the csum
structure? (I know fsck can correct things; one often sees "SUMMARY
INFORMATION ... BAD" messages on a post-crash reboot).

All in all, I think I might have been better off with Eagles....

Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228
{arizona,decvax,hao,ihnp4,seismo}!noao!grandi  noao!grandi@lbl-csam.ARPA
-- 
Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228
{arizona,decvax,hao,ihnp4,seismo}!noao!grandi  noao!grandi@lbl-csam.ARPA

sdyer@bbnccv.UUCP (Steve Dyer) (06/30/85)

> Rumor hath it that a program is available through DEC field service to
> revector bad blocks on UDA disk drives (RA81s in particular).  Details
> of the rumor are that the program is a "standalone" program written by
> the Ultrix folks called /rabads that can be booted instead of vmunix
> and that non-Ultrix sites running 4.2BSD can obtain the program
> through their field service reps. 

We have Ultrix and RA81's, which means that we were incredibly eager to
get our hands on "rabads" as soon as it was available from our field
service organization.  At least on the particular RA81's we had, "rabads"
was no solution at all, promptly crapping out as soon as it hit our
bad spots.  Same thing happened with a newer version of the program
"which works", as our field rep quaintly told me.

We are still having trouble with one of our RA81's almost 9 months
after installation of the machine.  DEC seems eager to help, but
remains confounded by the complexity of the drive and the UDA50 MSCP.

Result: the next two VAX 785's we got had an Emulex/Eagle combination.
Not a problem from the day they were installed.
-- 
/Steve Dyer
{decvax,linus,ima,ihnp4}!bbncca!sdyer
sdyer@bbnccv.ARPA

chris@umcp-cs.UUCP (Chris Torek) (06/30/85)

Important announcement:  I have set up a new mailing list,
info-uda50@maryland, a.k.a umcp-cs!info-uda50.  Anyone who wants
to subscribe, please send mail to info-uda50-request@maryland
(or umcp-cs!info-uda50-request).
------
(enough of that)

Speaking of forced error data errors, I found that doing a replace
operation with a forced error modifier made our UDA50 controller
say "yes, I did the replacement", but the bad block descriptors
were unchanged.  Removing the forced error modifier made it work.
Since then the controller microcode has been "upgraded", so I
suppose it could have been just another UDA50 bug....

The cylinder group summary information must (obviously) be rewritten
every time the cylinder groups change, which is certain to be very
often, but it can be regenerated, so I wouldn't worry about losing
it.  Interesting that the sectors containing that info are the ones
to first start showing errors.  Perhaps the oxide is getting
attracted to the disk heads :-).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland