[net.unix-wizards] UDA50 and bad block forwarding

Mike Caplinger <mike@rice.ARPA> (01/29/85)

Could somebody with a real UDA50 programmer's manual tell me two things?

1)  Does the UDA50 do its own bad block forwarding?  This is a
    persistent rumor, but totally false as far as I can tell.

2)  Is there DEC 144 forwarding information on the pack?  Does bad144(8)
    understand it?

We have what appears to be bad blocks in the middle of one of our swap
regions, so I assume that the answer to (1) is no.  Is there any way,
short of reformatting the pack, to get those blocks forwarded?

I thought I heard something about a bad block forwarding uda.c for 4.2
a while ago.  Does it work?  Can somebody send me a copy?

Thanks for the information.

	- Mike

ps.  If you have the choice, buy Eagles.

David L. Gehrt <dave@RIACS.ARPA> (01/29/85)

There is a flag mentioned in the mscp header file in 4.2

#define	M_UF_REPLC 0100000 /* Controller initiated bad block replacement */

(sys/vax/mscp) which seems to imply that the controller will do bad
block on its own.  I don't know what that means.  VMS goes through a
very elaborate tap dance with the controller to accomplish "host
initiated bad block replacement."  The work done in this tap dance
implies that bad block reports (a bad block report is *not* just any
hard error, it is an end message flag with the M
----------

David L. Gehrt <dave@RIACS.ARPA> (01/29/85)

To all wizards recipients -  sorry about the earlier partial message-
There is a flag mentioned in the mscp header file in 4.2:

#define	M_UF_REPLC 0100000 /* Controller initiated bad block replacement */

(sys/vax/mscp.h) which seems to imply that the controller will do bad
block on its own.  I don't know what that flag means.  I couldn't find
any one who knows what the flag means, so I looked at VMS, it is not
used as far as my microfiched out eyes could tell.

VMS goes through a very elaborate tap dance with the controller to
accomplish "host initiated bad block replacement."  The work done in
this tap dance implies that bad block reports are not to be taken at
face value.  The allegedly bad block must be checked out, if it is in
fact a bad block then there are records maintained (as I understand
solely for the benefit of the host) for finding a suitable replacement
block, and subsequent replacement of bad replacement blocks, and
precautions against a crash in mid replacement.  Finally, a replacement
block is located, and the controller is directed to replace the bad
block with the selected replacement block.  As distributed the 4.2
supports *neither* host nor controller initiated bad block replacement.

As far as the DEC 144 information being on disk,  there is data beyond
the end of the user area (starting at logical block 891072 on an RA81)
which complies at least in part with DEC 144, although I've not seen
144, the info looks alot as I recall the bad 144 referenced structs
looked. These tables are called RCT (replacement and cacheing tables)
and they are crucial to host initiated bad block replacement, so even
if you figure out how to mess with them, don't, unless you know what
you are doing.

By the way, a bad block report is *not* just any hard error, it is an
end message flag with the M_EF_BBLKR bit set. Although it does sound like
you may be experiencing bad blocks, you need to make sure that the end
message flag above is being checked, and reported properly .  Also, the
distributed driver reports all errors from datagrams, as opposed to end
messages, as hard errors.  I think about everybody has that one fixed.
But unless you are getting processes killed on swap errors, you may be
seeing hard error messages from datagrams which are really soft.

If you check and you are *sure* you have bad blocks, the hope is to
switch to a driver which does host iniated bad block replacement.  One
should be announced here within a week (I hope).  In the mean time
there are a few things any one should do before they start replacing
bad blocks wholesale.  They all envolve getting DEC envolved, and they
apply to RA81s, but then it is the only kind on which I have much
experience, our RA60 is  a new kid on the block.  Also, as far as I
know they apply pretty much to drives manufactured before last summer.

1.  There is a field upgrade to the hda hub grounding brush. Early ones
    had carbon contacts, new ones have space age materials (rejected
    shuttle tiles?) for contacts.  The failure of brush can cause ecc
    errors (mostly soft, but now and again a hard one).

2.  The next piece of business is have the level of the read/write
    board checked.  This board sits right on top of the hda, and there
    is an eyeball check for the newer rev level, which your ce can make
    while removing the board to replace the brush.  This upgrade
    istoincrease the readwrite current, that is "press harder while
    writing" to make things easier to read.

3.  The final thing might be to have the the ce do an audit to make
    sure you are up snuff as to the rev level of all the boards.  For
    example, I did the brush business a while back on a drive on which
    I had replaced about 500 blocks, the r/w board was replaced, and
    still were taking a number of 5-8 symbol ecc errors (soft) in the
    swap partition.  No bad blocks, just annoying messages, and a
    little degradation in system performance while the kernel printf
    runs, and runs (10 or 12 times a day). I guess it turns out the rev
    upgrade of the r/w board should be accompanied by an upgrade of the
    hda.  Actually, they are checking all the boards in all three
    drives.  When our audit is done we will take a look at what to do.
    One drive is giving almost no problems, and it is probably as out
    of rev as the other two which are reporting soft errors.  "If it
    isn't broke don't fix it," are sound words of advice.

My advice is to find out for sure what you have, we can communicate
privately about that before rushing to start doing bad block
replacements.  One word of warning, there is a bad block relacement
driver which circulated a while back, which used a cut down algorithm,
which while effective has some problems.  The search algorithm is
probably non-optimal, it makes no attempt to save the data and restore it
after replacement, and I think it *may* be worse than nothing, but I am
not all that sure.  There is so little hard, reliable information about
these devices that we have set up a mailing list for those who are in
deep with these devices.  I don't know how long it will last, probably
only until DEC helps clear the smoke away.  To sign up write
uda-request@RIACS-icarus.arpa.

dave
----------