[net.unix-wizards] more tales of RA81s--handling of bad sectors

sdyer@bbncca.ARPA (Steve Dyer) (11/07/84)

We have three RA81 drives on a single UDA50.  We installed ULTRIX last week
without any problems and have been burning the system in before opening it
up for general use.  Within the last few days, two of the drives have developed
"hard errors" which were not present at the installation.  Naturally a few
of them reside in the swap area, thus randomly killing processes, and a few
reside in files like /usr/lib/aliases.pag.  Only a minor headache!

With an "ordinary" disk system, I'd probably reformat the drives (thus
marking these new sectors as bad).  This does not seem to be an option with
the RA81 series--my field service guy is recommending replacement of the
head/disk assembly, which seems reasonable given their early mortality, but
it seems unwise as a general practice.

My questions are:

	Is replacement the only solution to post-factory hard errors?
	Is there a formatter available for RA81's?
	Does it mark newly found bad sectors?
	Does the RA81 driver in ULTRIX handle bad sectors as claimed?

Some comments:

	You might as well be running pure AT&T System V for all the DEC
	field service people know about how to interpret ULTRIX console
	messages.

	There is apparently no "warranty" period for the ULTRIX software,
	at least as regards software support.  We haven't yet purchased
	a software maintenance agreement, since it isn't yet clear to me
	that ULTRIX is preferable to vanilla 4.2, if you have a source
	license.  But when I tried to call about this problem, I got the
	runaround about not having a software support agreement.  Naturally,
	a call to my DEC salesman, who knows the value of our account, was
	able to bypass this, but the person I spoke to was unable to offer
	any comment, having to promise to get back to me.
-- 
/Steve Dyer
{decvax,linus,ima,ihnp4}!bbncca!sdyer
sdyer@bbncca.ARPA

chris@umcp-cs.UUCP (Chris Torek) (11/09/84)

A few comments:

- There are some versions of uda drivers that mistakenly print "hard
  error" for every error (but---I *think*---really "know" the difference)
  (but if you're getting "sorry, pid foo killed due to swap error" then
  that isn't it).

- We *seem* to have a copy of the ULTRIX RA81 driver (can't be positive),
  and our copy doesn't do bad block replacement.

- Someone has finally done a "complete" driver that *does* do bad block
  replacement; it's in beta test now, and it seems that it will be given
  away.  (If you want to ask plead for a beta test copy, I'll mail you
  the address.)
-- 
(This mind accidently left blank.)

In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

David L. Gehrt <dave@RIACS.ARPA> (11/15/84)

If you haven't discovered it by now, there is (probably) more
misinformation, and disinformation about the UDA50, and the attached
devices circulating than any other kind of computing  machinery or
peripherals I have encountered in 20 or so years of being "around".  I
will not claim special immunity from the effects of the the great
information void, but I will give what I believe to be the best
information available in response to your questions.  Perhaps someone
who has real, hard data, can correct any misstatements contained herein.

First, C. Torek is correct about the fact that a number of drivers
reporting hard errors, when in fact the drive or controller only found
a soft error.  For example our RA81's have frequent soft errors of the
type "[1-8] symbol ecc error" reported in a datagram.  The indication
is that the controller/drive found an error, and corrected it using its
error recovery logic.  Early versions would have reported these errors
as "hard".  Another thing that is difficult to discern from looking at
only the console output, is whether the error message is the result of
and "end message" or a "datagram".  Multiple datagrams can be generated
from a single transaction to disk, and in fact are not guaranteed to be
delivered to the host at all.  There is, on the other hand, only one
end message per transaction to disk, and it is guaranteed to be
delivered.  Another problem is that unless you have real clout with the
powers that be, getting documentation about what is really going on
based on the error messages is a real hassle (read impossible as I
understand the current situation).  DEC is trying to prevent
competitors from entering the UDA50/RA?? market, I guess.  The final
error message problem is that there are few drivers (one?) which will
report the existance of a "bad block" on the console, and there are
hard errors, which do not result in, or flow from bad blocks.

Your questions:

Is replacement the only solution to post-factory hard errors?

No.  It does seem to me that you need to eliminate electrical, and
mechanical problems which might indicate a repair or replacement is in
order.  I have heard of a number of "bad block" problems on RA?? drives
which went away when grounding straps were cinched down, or power
supplies were tweaked or replaced.  Also, internal electronics problems
in the drives, can give problems not much different in appearance from
media going bad (according to legend).  If the problem still appears to
be bad blocks, for real, there will be a driver around soon which will
handle the bad block reports, and arrange for revectoring.

Is there a formatter available for RA81's?

I don't think so.  As nearly as I can tell the drives are formatted
using commands in a protocol (*NOT* mscp) for which I have never seen
any documentation.  This means that the formatter available from DEC is
what there is, and it isn't too great.  It will format any amount of
the disk surface you would like, as long as it is the entire surface.
It is not clear that it will correctly handle bad blocks, except that
if it is true that the drive will not *write* a bad sector (a claim of
which I am very skeptical) , then perhaps formatting, and restoring
might be a way out.  I am skeptical. There is a mode in which the
standalone formatter for the UDA50/RA?? devices will start from scratch
and reinitialize an entire pack, supposedly rebuilding the RCT, and
otherwise handling bad blocks.  My CE says that once done, there is no
guarantee that the disk will *ever* be usable again.  Sounds like a real
slick piece of software to me.

Does it mark newly found bad sectors?

No, not on its own.  There is are flags in the end message which
indicate that a bad block was detected, and whether or not there were
more which couldn't be reported, and a field which indicates which
logical block has been found "bad".  The action taken, in most current
drivers, is to set an error flag in a struct buf, and hang it up.
There is a fairly complicated dance the host can engage in with the
hardware, and have the block revectored.  The driver in beta test does
this little dance.  If the host throws away the bad block report the
controller could care less.  The legend that the controller handles bad
blocks on its own is a myth.  I have never heard of a way to get the
controller to do the revectoring on its own.  There is nothing to keep
a unix system from doing the revectoring.  Contrary to the comments in
/etc/disktab, the RCTs required for the bad block forwarding operation
lie safely out of reach beyond the user accessible disk surface during
normal disk operations by the 4.2 driver.

Does the RA81 driver in ULTRIX handle bad sectors as claimed?

I have never heard any informed person, knowledgeable in ULTRIX, claim
it did.  Several months ago I saw a copy of a driver purporting to be
from ULTRIX (miles of copyright notices, and disclaimers and so on) it
had no code for bad block revectoring in it.  I have heard that the
ULTRIX folks are going to come up with a standalone program to do bad
block revectoring, but that is an unsubstantiated rumor, and the
persons whom I tried to contact, did not return my call.

I hope this helps, but if you have more questions, drop me a line.

dave

P.S.

Oh, a person who responded to your message, couldn't understand bad
blocks in the swap area, actually I suspect that there is probably more
i/o done in swap space than in other areas, and I would expect media
deterioration and bad blocks there first.  

As for the claim that a dump(8), newfs(8) followed by a retore(8)
cleared up the bad blocks, I am skeptical that what is reported
represents reality.  The restore, probably just picked different
blocks, or the errors reported were not in fact bad blocks. The
controller, at least our micro code version, makes a best effort
attempt to write where you tell it to, and to report errors detected.
Also, my experience has been that real "bad blocks" do not just go
away.  So, although such a strategy might be worth a try, I wouldn't
get my hopes up too high.
----------

ronb@natmlab.OZ (Ron Baxter) (11/23/84)

In article <bbncca.1111> sdyer@bbncca.ARPA (Steve Dyer) writes:
>.......  Within the last few days, two of the drives have developed
>"hard errors" which were not present at the installation.  Naturally a few
>of them reside in the swap area, thus randomly killing processes, and a few
>reside in files like /usr/lib/aliases.pag.  Only a minor headache!

Some months back one of our RA81s developed bad-blocks. I had assumed
that RA81s were immune from bad-blocks due to their intelligence.  Our
Field Engineer said that while the RA81 would not write on a bad-block
(ie a block that does not read-check after writing), it has no special
magic to cope with blocks in an existing file that were "good" and then
go "bad".  His advice was to dump the whole file-system if possible (it
wasn't really), and then restore from backup and the bad-blocks should
go away.  THEY DID!.  So I do not understand how "bad-blocks" in the
swap area could occur, while bad-blocks in an aliases file are easier
to understand.

PS it turned out later that the appearance of "bad-blocks" on our RA81
seemed to ba associated with the gradual failure of a power supply (the
voltages were just going too low).  Besides bad blocks this problem also
made the drive go off-line by itself.