grandi@noao.UUCP (Steve Grandi) (06/29/85)
Rumor hath it that a program is available through DEC field service to revector bad blocks on UDA disk drives (RA81s in particular). Details of the rumor are that the program is a "standalone" program written by the Ultrix folks called /rabads that can be booted instead of vmunix and that non-Ultrix sites running 4.2BSD can obtain the program through their field service reps. Has any non-Ultrix site obtained this program? Is there a part number or any identifying information that our Friendly Field Service Man can use to pry it out of DEC's bureaucracy? We are already running the Riacs UDA driver on our 750's that will try to revector blocks that generate hard errors; our problem are blocks that generate lots and lots of soft errors. Since soft errors tend to turn into hard errors and since these are rather important blocks (see below) and revectoring blocks with hard errors often generates data which is not guaranteed to be correct (a "forced error" in MSCP-speak), I would dearly love to revector these marginal blocks now and avoid the massive pain that a trashed file system can bring. (Once burned, twice shy; 5 times in the last year burned, 10**6 times shy!!). Also, the system REALLY slows down when the disk driver is printing error messages on the console. Obviously, we could probably hack the Riacs driver to give us a utility to revector disk blocks, but another rumor hath it that the procedure used in the driver is not REALLY correct (since DEC is incredibly reluctant to reveal details of the very complicated song and dance that has to be gone through to accomplish this feat, I'm not surprised). Also it would be nice to have a tool that our Friendly Field Service Rep believed in as opposed to the incredulous looks I get when I explain the history of our disk driver. Two details of our problems might be of interest to students of MSCP soft error datagrams or of the 4.2BSD file system. The "drive detected error" we are getting is code 1A39 (that's the contents of word 27 of the SDI error variant of the MSCP packet) which indicates a "servo fine position error" generated when "a write command is attempted while the positioner is off track (not detented)". The servo boards and the R/W boards in the drives showing these errors have all been replaced, so the HDA is obviously showing marginal behavior at these locations. The disk blocks showing errors are also interesting. For several file systems on several disks on several 750's, relative block numbers 576 and 577 are repeatedly showing up with fine-positioning errors (and these cases constitute about 75% of our total collection of these errors). A morning's study of the output from dumpfs(8) and fs.h indicates that for our 8K/2K and 8K/1K file systems, blocks 576-7 contain the csum structure, which contains a summary of information about all the cylinder groups (number of directories, number of free blocks, number of free inodes, number of free frags). Obviously, since our disks are figuratively digging holes in the oxide at these blocks, this structure is used a lot, presumably everytime a file is created (and extended?). Is this structure a single point of failure? If block 576 is destroyed, is the file-system totally trashed or just incapable of creating new files? (in other words, can I dump(8) the file-system?). Can fsck completely regenerate the data in the csum structure? (I know fsck can correct things; one often sees "SUMMARY INFORMATION ... BAD" messages on a post-crash reboot). All in all, I think I might have been better off with Eagles.... Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228 {arizona,decvax,hao,ihnp4,seismo}!noao!grandi noao!grandi@lbl-csam.ARPA -- Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228 {arizona,decvax,hao,ihnp4,seismo}!noao!grandi noao!grandi@lbl-csam.ARPA
sdyer@bbnccv.UUCP (Steve Dyer) (06/30/85)
> Rumor hath it that a program is available through DEC field service to > revector bad blocks on UDA disk drives (RA81s in particular). Details > of the rumor are that the program is a "standalone" program written by > the Ultrix folks called /rabads that can be booted instead of vmunix > and that non-Ultrix sites running 4.2BSD can obtain the program > through their field service reps. We have Ultrix and RA81's, which means that we were incredibly eager to get our hands on "rabads" as soon as it was available from our field service organization. At least on the particular RA81's we had, "rabads" was no solution at all, promptly crapping out as soon as it hit our bad spots. Same thing happened with a newer version of the program "which works", as our field rep quaintly told me. We are still having trouble with one of our RA81's almost 9 months after installation of the machine. DEC seems eager to help, but remains confounded by the complexity of the drive and the UDA50 MSCP. Result: the next two VAX 785's we got had an Emulex/Eagle combination. Not a problem from the day they were installed. -- /Steve Dyer {decvax,linus,ima,ihnp4}!bbncca!sdyer sdyer@bbnccv.ARPA
chris@umcp-cs.UUCP (Chris Torek) (06/30/85)
Important announcement: I have set up a new mailing list, info-uda50@maryland, a.k.a umcp-cs!info-uda50. Anyone who wants to subscribe, please send mail to info-uda50-request@maryland (or umcp-cs!info-uda50-request). ------ (enough of that) Speaking of forced error data errors, I found that doing a replace operation with a forced error modifier made our UDA50 controller say "yes, I did the replacement", but the bad block descriptors were unchanged. Removing the forced error modifier made it work. Since then the controller microcode has been "upgraded", so I suppose it could have been just another UDA50 bug.... The cylinder group summary information must (obviously) be rewritten every time the cylinder groups change, which is certain to be very often, but it can be regenerated, so I wouldn't worry about losing it. Interesting that the sectors containing that info are the ones to first start showing errors. Perhaps the oxide is getting attracted to the disk heads :-). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland