Mike Caplinger <mike@rice.ARPA> (01/29/85)
Could somebody with a real UDA50 programmer's manual tell me two things? 1) Does the UDA50 do its own bad block forwarding? This is a persistent rumor, but totally false as far as I can tell. 2) Is there DEC 144 forwarding information on the pack? Does bad144(8) understand it? We have what appears to be bad blocks in the middle of one of our swap regions, so I assume that the answer to (1) is no. Is there any way, short of reformatting the pack, to get those blocks forwarded? I thought I heard something about a bad block forwarding uda.c for 4.2 a while ago. Does it work? Can somebody send me a copy? Thanks for the information. - Mike ps. If you have the choice, buy Eagles.
David L. Gehrt <dave@RIACS.ARPA> (01/29/85)
There is a flag mentioned in the mscp header file in 4.2 #define M_UF_REPLC 0100000 /* Controller initiated bad block replacement */ (sys/vax/mscp) which seems to imply that the controller will do bad block on its own. I don't know what that means. VMS goes through a very elaborate tap dance with the controller to accomplish "host initiated bad block replacement." The work done in this tap dance implies that bad block reports (a bad block report is *not* just any hard error, it is an end message flag with the M ----------
David L. Gehrt <dave@RIACS.ARPA> (01/29/85)
To all wizards recipients - sorry about the earlier partial message- There is a flag mentioned in the mscp header file in 4.2: #define M_UF_REPLC 0100000 /* Controller initiated bad block replacement */ (sys/vax/mscp.h) which seems to imply that the controller will do bad block on its own. I don't know what that flag means. I couldn't find any one who knows what the flag means, so I looked at VMS, it is not used as far as my microfiched out eyes could tell. VMS goes through a very elaborate tap dance with the controller to accomplish "host initiated bad block replacement." The work done in this tap dance implies that bad block reports are not to be taken at face value. The allegedly bad block must be checked out, if it is in fact a bad block then there are records maintained (as I understand solely for the benefit of the host) for finding a suitable replacement block, and subsequent replacement of bad replacement blocks, and precautions against a crash in mid replacement. Finally, a replacement block is located, and the controller is directed to replace the bad block with the selected replacement block. As distributed the 4.2 supports *neither* host nor controller initiated bad block replacement. As far as the DEC 144 information being on disk, there is data beyond the end of the user area (starting at logical block 891072 on an RA81) which complies at least in part with DEC 144, although I've not seen 144, the info looks alot as I recall the bad 144 referenced structs looked. These tables are called RCT (replacement and cacheing tables) and they are crucial to host initiated bad block replacement, so even if you figure out how to mess with them, don't, unless you know what you are doing. By the way, a bad block report is *not* just any hard error, it is an end message flag with the M_EF_BBLKR bit set. Although it does sound like you may be experiencing bad blocks, you need to make sure that the end message flag above is being checked, and reported properly . Also, the distributed driver reports all errors from datagrams, as opposed to end messages, as hard errors. I think about everybody has that one fixed. But unless you are getting processes killed on swap errors, you may be seeing hard error messages from datagrams which are really soft. If you check and you are *sure* you have bad blocks, the hope is to switch to a driver which does host iniated bad block replacement. One should be announced here within a week (I hope). In the mean time there are a few things any one should do before they start replacing bad blocks wholesale. They all envolve getting DEC envolved, and they apply to RA81s, but then it is the only kind on which I have much experience, our RA60 is a new kid on the block. Also, as far as I know they apply pretty much to drives manufactured before last summer. 1. There is a field upgrade to the hda hub grounding brush. Early ones had carbon contacts, new ones have space age materials (rejected shuttle tiles?) for contacts. The failure of brush can cause ecc errors (mostly soft, but now and again a hard one). 2. The next piece of business is have the level of the read/write board checked. This board sits right on top of the hda, and there is an eyeball check for the newer rev level, which your ce can make while removing the board to replace the brush. This upgrade istoincrease the readwrite current, that is "press harder while writing" to make things easier to read. 3. The final thing might be to have the the ce do an audit to make sure you are up snuff as to the rev level of all the boards. For example, I did the brush business a while back on a drive on which I had replaced about 500 blocks, the r/w board was replaced, and still were taking a number of 5-8 symbol ecc errors (soft) in the swap partition. No bad blocks, just annoying messages, and a little degradation in system performance while the kernel printf runs, and runs (10 or 12 times a day). I guess it turns out the rev upgrade of the r/w board should be accompanied by an upgrade of the hda. Actually, they are checking all the boards in all three drives. When our audit is done we will take a look at what to do. One drive is giving almost no problems, and it is probably as out of rev as the other two which are reporting soft errors. "If it isn't broke don't fix it," are sound words of advice. My advice is to find out for sure what you have, we can communicate privately about that before rushing to start doing bad block replacements. One word of warning, there is a bad block relacement driver which circulated a while back, which used a cut down algorithm, which while effective has some problems. The search algorithm is probably non-optimal, it makes no attempt to save the data and restore it after replacement, and I think it *may* be worse than nothing, but I am not all that sure. There is so little hard, reliable information about these devices that we have set up a mailing list for those who are in deep with these devices. I don't know how long it will last, probably only until DEC helps clear the smoke away. To sign up write uda-request@RIACS-icarus.arpa. dave ----------