sdyer@bbncca.ARPA (Steve Dyer) (11/07/84)
We have three RA81 drives on a single UDA50. We installed ULTRIX last week without any problems and have been burning the system in before opening it up for general use. Within the last few days, two of the drives have developed "hard errors" which were not present at the installation. Naturally a few of them reside in the swap area, thus randomly killing processes, and a few reside in files like /usr/lib/aliases.pag. Only a minor headache! With an "ordinary" disk system, I'd probably reformat the drives (thus marking these new sectors as bad). This does not seem to be an option with the RA81 series--my field service guy is recommending replacement of the head/disk assembly, which seems reasonable given their early mortality, but it seems unwise as a general practice. My questions are: Is replacement the only solution to post-factory hard errors? Is there a formatter available for RA81's? Does it mark newly found bad sectors? Does the RA81 driver in ULTRIX handle bad sectors as claimed? Some comments: You might as well be running pure AT&T System V for all the DEC field service people know about how to interpret ULTRIX console messages. There is apparently no "warranty" period for the ULTRIX software, at least as regards software support. We haven't yet purchased a software maintenance agreement, since it isn't yet clear to me that ULTRIX is preferable to vanilla 4.2, if you have a source license. But when I tried to call about this problem, I got the runaround about not having a software support agreement. Naturally, a call to my DEC salesman, who knows the value of our account, was able to bypass this, but the person I spoke to was unable to offer any comment, having to promise to get back to me. -- /Steve Dyer {decvax,linus,ima,ihnp4}!bbncca!sdyer sdyer@bbncca.ARPA
chris@umcp-cs.UUCP (Chris Torek) (11/09/84)
A few comments: - There are some versions of uda drivers that mistakenly print "hard error" for every error (but---I *think*---really "know" the difference) (but if you're getting "sorry, pid foo killed due to swap error" then that isn't it). - We *seem* to have a copy of the ULTRIX RA81 driver (can't be positive), and our copy doesn't do bad block replacement. - Someone has finally done a "complete" driver that *does* do bad block replacement; it's in beta test now, and it seems that it will be given away. (If you want to ask plead for a beta test copy, I'll mail you the address.) -- (This mind accidently left blank.) In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690 UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
David L. Gehrt <dave@RIACS.ARPA> (11/15/84)
If you haven't discovered it by now, there is (probably) more misinformation, and disinformation about the UDA50, and the attached devices circulating than any other kind of computing machinery or peripherals I have encountered in 20 or so years of being "around". I will not claim special immunity from the effects of the the great information void, but I will give what I believe to be the best information available in response to your questions. Perhaps someone who has real, hard data, can correct any misstatements contained herein. First, C. Torek is correct about the fact that a number of drivers reporting hard errors, when in fact the drive or controller only found a soft error. For example our RA81's have frequent soft errors of the type "[1-8] symbol ecc error" reported in a datagram. The indication is that the controller/drive found an error, and corrected it using its error recovery logic. Early versions would have reported these errors as "hard". Another thing that is difficult to discern from looking at only the console output, is whether the error message is the result of and "end message" or a "datagram". Multiple datagrams can be generated from a single transaction to disk, and in fact are not guaranteed to be delivered to the host at all. There is, on the other hand, only one end message per transaction to disk, and it is guaranteed to be delivered. Another problem is that unless you have real clout with the powers that be, getting documentation about what is really going on based on the error messages is a real hassle (read impossible as I understand the current situation). DEC is trying to prevent competitors from entering the UDA50/RA?? market, I guess. The final error message problem is that there are few drivers (one?) which will report the existance of a "bad block" on the console, and there are hard errors, which do not result in, or flow from bad blocks. Your questions: Is replacement the only solution to post-factory hard errors? No. It does seem to me that you need to eliminate electrical, and mechanical problems which might indicate a repair or replacement is in order. I have heard of a number of "bad block" problems on RA?? drives which went away when grounding straps were cinched down, or power supplies were tweaked or replaced. Also, internal electronics problems in the drives, can give problems not much different in appearance from media going bad (according to legend). If the problem still appears to be bad blocks, for real, there will be a driver around soon which will handle the bad block reports, and arrange for revectoring. Is there a formatter available for RA81's? I don't think so. As nearly as I can tell the drives are formatted using commands in a protocol (*NOT* mscp) for which I have never seen any documentation. This means that the formatter available from DEC is what there is, and it isn't too great. It will format any amount of the disk surface you would like, as long as it is the entire surface. It is not clear that it will correctly handle bad blocks, except that if it is true that the drive will not *write* a bad sector (a claim of which I am very skeptical) , then perhaps formatting, and restoring might be a way out. I am skeptical. There is a mode in which the standalone formatter for the UDA50/RA?? devices will start from scratch and reinitialize an entire pack, supposedly rebuilding the RCT, and otherwise handling bad blocks. My CE says that once done, there is no guarantee that the disk will *ever* be usable again. Sounds like a real slick piece of software to me. Does it mark newly found bad sectors? No, not on its own. There is are flags in the end message which indicate that a bad block was detected, and whether or not there were more which couldn't be reported, and a field which indicates which logical block has been found "bad". The action taken, in most current drivers, is to set an error flag in a struct buf, and hang it up. There is a fairly complicated dance the host can engage in with the hardware, and have the block revectored. The driver in beta test does this little dance. If the host throws away the bad block report the controller could care less. The legend that the controller handles bad blocks on its own is a myth. I have never heard of a way to get the controller to do the revectoring on its own. There is nothing to keep a unix system from doing the revectoring. Contrary to the comments in /etc/disktab, the RCTs required for the bad block forwarding operation lie safely out of reach beyond the user accessible disk surface during normal disk operations by the 4.2 driver. Does the RA81 driver in ULTRIX handle bad sectors as claimed? I have never heard any informed person, knowledgeable in ULTRIX, claim it did. Several months ago I saw a copy of a driver purporting to be from ULTRIX (miles of copyright notices, and disclaimers and so on) it had no code for bad block revectoring in it. I have heard that the ULTRIX folks are going to come up with a standalone program to do bad block revectoring, but that is an unsubstantiated rumor, and the persons whom I tried to contact, did not return my call. I hope this helps, but if you have more questions, drop me a line. dave P.S. Oh, a person who responded to your message, couldn't understand bad blocks in the swap area, actually I suspect that there is probably more i/o done in swap space than in other areas, and I would expect media deterioration and bad blocks there first. As for the claim that a dump(8), newfs(8) followed by a retore(8) cleared up the bad blocks, I am skeptical that what is reported represents reality. The restore, probably just picked different blocks, or the errors reported were not in fact bad blocks. The controller, at least our micro code version, makes a best effort attempt to write where you tell it to, and to report errors detected. Also, my experience has been that real "bad blocks" do not just go away. So, although such a strategy might be worth a try, I wouldn't get my hopes up too high. ----------
ronb@natmlab.OZ (Ron Baxter) (11/23/84)
In article <bbncca.1111> sdyer@bbncca.ARPA (Steve Dyer) writes: >....... Within the last few days, two of the drives have developed >"hard errors" which were not present at the installation. Naturally a few >of them reside in the swap area, thus randomly killing processes, and a few >reside in files like /usr/lib/aliases.pag. Only a minor headache! Some months back one of our RA81s developed bad-blocks. I had assumed that RA81s were immune from bad-blocks due to their intelligence. Our Field Engineer said that while the RA81 would not write on a bad-block (ie a block that does not read-check after writing), it has no special magic to cope with blocks in an existing file that were "good" and then go "bad". His advice was to dump the whole file-system if possible (it wasn't really), and then restore from backup and the bad-blocks should go away. THEY DID!. So I do not understand how "bad-blocks" in the swap area could occur, while bad-blocks in an aliases file are easier to understand. PS it turned out later that the appearance of "bad-blocks" on our RA81 seemed to ba associated with the gradual failure of a power supply (the voltages were just going too low). Besides bad blocks this problem also made the drive go off-line by itself.